Greek Language Support — Design Specification
Status (2026-06-03 update): Phase 1 is effectively already shipped — the Greek pack
scripts/dictionaries/langpack-el.zipwas committed in0fb56dc09and is byte-identical to a freshwordfreqbuild (verified), and the import → select → suggest chain already works with zero app code (theeldisplay name is even present ingetLanguageDisplayName). The real gap was discoverability: users didn’t know the pack existed or how to import it. Fixed by rewritingdocs/wiki/layouts/language-packs.md(correct path: Settings > 🌐 Multi-Language > Import Pack; lists prebuilt packs incl Greek; attribution) + adding the repoNOTICE.Remaining: (1) maintainer publishes the pack
.zips as downloadable release assets (currently only in-repo); (2) optional Phase 2 swipe (transliteration). Design for issue #68.
What the user asked for
Greek (el) word suggestions + dictionary; autocorrect optional. Plus a
broader ask: offer importing the phone’s system-installed language
dictionary (a Russian/Greek phone has a built-in dict) as an import
option. Swipe via “pseudoenglish” transliteration was floated as a stretch.
Decisions (resolved by research, 2026-06-02)
Source + licensing — use wordfreq
The dictionary is generated from the wordfreq library via
top_n_list('el', N) / get_wordlist.py. Why:
- Real corpus frequencies (OpenSubtitles 2018, SUBTLEX, Wikipedia, Google Books) — not the synthetic Levenshtein-estimated frequencies in dim-geo/greekdictionary.
- License is clean for a GPL-3.0 app: wordfreq code is Apache-2.0; its data is CC-BY-SA-4.0, and CC-BY-SA-4.0 is one-way compatible into GPLv3 (Creative Commons, 2015-10-08). We relicense the generated wordlist under GPL-3.0 when bundling. (The reverse is not permitted.)
- Already wired into
scripts/build_dictionary.py(--use-wordfreq). - Generalizes to Russian (
ru) and dozens more — same one-command flow.
Rejected sources: FrequencyWords (data is CC-BY-SA-3.0, not
GPL-compatible — only 4.0 gained one-way compatibility); HeliBoard
main_el (184k, license unspecified). Acceptable alternates if ever
needed: Hunspell el_GR (tri-license; take the GPL-2.0-or-later arm,
no frequencies) and dim-geo/greekdictionary (GPL-3.0, but synthetic
freqs). wordfreq is preferred.
Required NOTICE / attribution (when shipping)
Add to the app’s credits / NOTICE:
Greek (and other) word frequencies derived from wordfreq (Robyn Speer et al.), data under CC-BY-SA-4.0, incorporating OpenSubtitles 2018, SUBTLEX, Wikipedia, and Google Books Ngrams. Derived wordlist redistributed under GPL-3.0.
”Import the phone’s system dictionary” — NOT possible; ship file-import instead
Researched against Android platform docs: a third-party IME cannot read the phone’s system/Gboard/AOSP main language dictionary.
- Main
.dictfiles are app-private (since API 28 an app can’t world-share its data dir; since API 30 no app can read another’s) and the format is proprietary. - No content provider / service exports them; Gboard’s are proprietary.
UserDictionaryexposes only the user’s manually-added words (and CleverKeys already reads those —OptimizedVocabulary.kt:1734-1768— because the active IME is granted access; the permission has been IME-only since API 23). It is not the system language dictionary.- The spell-checker API can only validate words you supply, never enumerate its dictionary.
Therefore the real feature is “Import dictionary from a file” via the
Storage Access Framework, reusing the existing
langpack/LanguagePackManager.importLanguagePack(uri) flow (same pattern as
GIF packs). The user imports a wordlist/freq file (or a langpack-el.zip);
we build the per-language trie on-device. This generalizes to any language
and is the honest, implementable version of the request.
Existing infrastructure (verified against current source)
| Component | Status | Evidence |
|---|---|---|
| Greek keyboard layout | ✅ exists | srcs/layouts/grek_qwerty.xml (1 of 83) |
| wordfreq runs on this dev env | ✅ verified | installed on Termux; top_n_list('el',…) → 46,306 words |
| Wordlist → binary pipeline | ✅ exists + run | get_wordlist.py → build_dictionary.py → el_enhanced.bin |
| Unigrams (lang detection) | ✅ script | generate_unigrams.py --lang el |
| Contractions | ✅ script | generate_binary_contractions.py (Greek: minimal/empty) |
| Importable language pack | ✅ exists | build_langpack.py + LanguagePackManager.kt:59-83 |
| UserDictionary user-words import | ✅ already works | OptimizedVocabulary.kt:1734-1768 |
| Greek language detection | ❌ missing | no Greek patterns in data/LanguageDetector.kt |
Bundled/packaged el dictionary | ❌ not yet | generated 1.82 MB artifact, not yet committed |
| Swipe model support for Greek | ❌ missing | tokenizer is Latin-only (Phase 2) |
Generated artifact (proof-of-pipeline, 2026-06-02)
get_wordlist.py --lang el --count 50000 → 46,306 words (wordfreq 'small')
build_dictionary.py --lang el --use-wordfreq → el_enhanced.bin = 1.82 MB
46,306 canonical · 42,719 normalized · 84.3% accented · 3,587 collisions
Bundle-size note: 1.82 MB is larger than de/fr/it/pt (~620–650 KB) and en/es (~1.26 MB) — Greek’s 2-byte-per-char UTF-8 + heavy tonos accents inflate it. Trimming to ~30k words would cut size materially. This is the main go/no-go input for bundle-vs-pack.
Open decision (maintainer go/no-go)
- Bundle vs pack vs both, and word count: bundle 46k/1.82 MB in the
APK, trim to ~30k, or ship pack-only (
langpack-el.zip, zero APK bloat, importable)? “combo 1-3” pointed at bundle+pack+swipe. - Scope now: Phase 1 (suggestions) + the file-import feature, with Phase 2 (swipe) as a follow-up — or all together?
Implementation plan
Phase 1 — suggestions + dictionary (~½–1 day)
- Generate + commit
src/main/assets/dictionaries/el_enhanced.bin(or the pack). Add the NOTICE attribution. data/LanguageDetector.kt:initializeGreekPatterns()— char-frequency- common-word block (Greek α–ω is highly distinctive → reliable). Wire
el_unigrams.txtfromgenerate_unigrams.py.
- common-word block (Greek α–ω is highly distinctive → reliable). Wire
- Register
elso it’s selectable; verifygrek_qwerty.xmlis offered. - Optional minimal
contractions_el.json.
”Import dictionary from file” feature (~½ day)
- Add an “Import dictionary” action (Settings → Languages) using
ACTION_OPEN_DOCUMENT→LanguagePackManager.importLanguagePack(uri)(already parsesword[\tfreq]/unigrams.txtin a zip). - Document accepted formats; point users at wordfreq-derived or Hunspell/AOSP wordlists for languages we don’t bundle.
Phase 2 — swipe via transliteration (~1–2 days, deferred)
- Per-language Greek↔Latin-key map loaded into
SwipeTokenizer(models/transliteration_el.json); pre-transliterate layout chars → Latin tokens before model input, map predictions back to Greek via the dictionary. No neural retraining — the model consumes coordinate sequences, not glyphs.
Test plan
Pure JVM
- Language-detection unit test: feed Greek sample text (
"και το να του με") → detector returnsel; feed English → notel. - Dictionary-load test:
el_enhanced.binparses; word count > 40k; a known word (και) is present with non-zero frequency; accent-normalized lookup (τησform) resolves.
Instrumented (ew-cli, Pixel7 API34)
- With
elactive +grek_qwerty.xml: typing a Greek prefix yields Greek suggestions from the bundled/imported dict. - Import flow: SAF-pick a small
elwordlist zip →LanguagePackManager.importLanguagePack→ suggestions reflect imported words (mirror existing language-pack import test). - Layout:
grek_qwerty.xmlselectable and renders Greek keys.
Phase 2 (when built)
- Swipe across the Greek layout for a known word → transliteration maps to the right Latin token sequence → correct Greek word predicted.
Manual
- Set a Greek wordlist, type a sentence, confirm suggestions; confirm language auto-detect switches correctly when typing Greek vs English.
File touch list (Phase 1 + import)
| File | Change |
|---|---|
src/main/assets/dictionaries/el_enhanced.bin | New (1.82 MB @ 46k, or trimmed) — if bundling |
src/main/assets/unigrams/el_unigrams.txt | New — language detection |
data/LanguageDetector.kt | initializeGreekPatterns() + registration |
src/main/assets/dictionaries/contractions_el.json | Optional, minimal |
| NOTICE / credits screen | wordfreq attribution text (above) |
| Settings (Languages) | “Import dictionary” SAF action → LanguagePackManager |
| Tests | detection + dict-load (pure) + import + layout (instrumented) |
Related
- Dictionary System
- Multi-language · Language packs
- Neural Prediction (Phase 2 tokenizer)