Esperanto Dictionary and Parser: Design, Implementation, and Examples
Overview
This article describes a practical design and implementation for an Esperanto dictionary and parser suitable for lookup, morphological analysis, and basic syntactic parsing. It includes data model choices, algorithmic approaches, performance considerations, and concrete code examples in Python. The implementation emphasizes clarity and extensibility for NLP projects.
1. Requirements & Goals
- Fast dictionary lookup by lemma and inflected forms.
- Morphological parsing (token → lemma + part-of-speech + morphological features).
- Support for Esperanto morphophonology and productive derivation (prefixes, suffixes, compounding).
- Clean, maintainable data structures and a simple API.
- Ability to export/import dictionary data (JSON, CSV).
- Reasonable speed and low memory footprint for desktop/server use.
2. Esperanto Morphology Primer (short)
- Esperanto is highly regular: roots combine with affixes and grammatical endings.
- Parts-of-speech endings:
- Nouns: -o (plural/personal: -oj; accusative: -on/-ojn)
- Adjectives: -a (agree in number/case with nouns)
- Verbs: infinitive -i; present -as; past -is; future -os; conditional -us; imperative -u; participles -ant-, -int-, -ont- + tense endings.
- Adverbs: -e
- Derivational affixes (e.g., mal-, re-, -et-, -eg-) modify meaning/productivity.
- Compound words join roots directly; some affixes can be nested.
3. High-level Design
Components
- Lexicon (dictionary): entries keyed by lemma, containing POS, gloss, derivational info, example sentences.
- Morphological analyzer (parser): splits tokens into root + affixes + grammatical endings → returns lemma + features.
- Tokenizer: handles punctuation, clitics, numeric tokens.
- API: lookup(word), parse(word), addentry(entry), export(format).
Data Model (JSON sketch)
- Entry:
- lemma: “kato”
- pos: “NOUN”
- gloss: “cat”
- frequency: 1234
- derivations: [“kateto”]
- examples: [“La kato kuras.”]
- variants: [“katoj”, “katon”]
4. Implementation Strategy
- Use a rule-based morphological analyzer leveraging Esperanto regularity.
- Maintain a root/affix lexicon: common roots and affixes prioritized for speed.
- Two-stage parse:
- Candidate segmentation generation (longest-match root-first).
- Validation against known affix patterns and POS endings.
- Handle productive derivation heuristically: allow stripping/attaching common affixes when root exists or when segmentation produces plausible root form.
- Cache frequent analyses.
5. Algorithms & Heuristics
- Longest-match root search: try longest possible substring as root present in lexicon; then match trailing grammatical endings.
- Backtracking: try alternative segmentations when initial attempt fails.
- Affix priority list: prefer grammatical endings (-o, -a, -i, -as, -is…) before derivational suffixes.
- Compound detection: check if token can split into two valid roots; recursively parse components.
- Unknown-root fallback: if no root found, use heuristic stem extraction (strip accusative -n, plural -j, tense endings) and return probable features with confidence score.
6. Code Example (Python)
- Minimal runnable analyzer demonstrating core ideas.
python
# esperanto_parser.py import re from functools import lru_cache # Minimal lexicon LEXICON = { “kato”: {“pos”: “NOUN”, “gloss”: “cat”}, “bona”: {“pos”: “ADJ”, “gloss”: “good”}, “vidi”: {“pos”: “VERB”, “gloss”: “see”}, “manĝ”: {“pos”: “ROOT”, “gloss”: “eat”}, “manĝi”: {“pos”: “VERB”, “gloss”: “eat”}, “lernejo”: {“pos”: “NOUN”, “gloss”: “school”}, # add more roots… } GRAM_ENDINGS = { “NOUN”: [“o”], “ADJ”: [“a”], “ADV”: [“e”], “VERB”: [“i”, “as”, “is”, “os”, “us”, “u”], } DERIV_SUFFIXES = [“et”, “eg”, “ec”, “ul”, “in”, “ej”, “ar”, “ig”, “igx”.replace(“x”,“ĝ”)] # example # helpers def strip_accusative(word): if word.endswith(“n”) and len(word) > 1: return word[:-1], True return word, False @lru_cache(maxsize=10000) def parse_token(token): token = token.lower() # strip punctuation token = re.sub(r”^[^\w]+|[^\w]+$”, ””, token) stem, acc = strip_accusative(token) # try exact match if stem in LEXICON: entry = LEXICON[stem] return {“lemma”: stem, “pos”: entry.get(“pos”), “gloss”: entry.get(“gloss”), “accusative”: acc, “confidence”: 1.0} # try match with grammatical endings for pos, endings in GRAM_ENDINGS.items(): for e in sorted(endings, key=len, reverse=True): if stem.endswith(e): base = stem[:-len(e)] if base in LEXICON or base + “o” in LEXICON: # try possible root forms lemma = base if base in LEXICON else base + e return {“lemma”: lemma, “pos”: pos, “features”: {“tense_or_form”: e}, “accusative”: acc, “confidence”: 0.9} # fallback heuristic: try splitting compounds (two roots) for i in range(3, len(stem)-2): a, b = stem[:i], stem[i:] if a in LEXICON and b in LEXICON: return {“lemma”: a + ”+” + b, “pos”: “COMPOUND”, “components”: [a,b], “accusative”: acc, “confidence”: 0.8} return {“lemma”: token, “pos”: None, “accusative”: acc, “confidence”: 0.4}
7. Examples
- Input: “katoj” → parse: lemma “kato”, pos NOUN, plural yes, accusative no.
- Input: “manĝis” → parse: lemma “manĝi”/root “manĝ”, pos VERB, tense PAST.
- Input: “lernejeto” → parse: root “lernejo” + suffix “-et” → “small school” (diminutive).
8. Evaluation & Performance
- Accuracy: rule-based approach yields high precision on in-vocabulary tokens; recall depends on lexicon coverage for roots and derivational affixes.
- Speed: O(n * m) where n = token length, m = number of attempted segmentations; optimized with longest-match and caching.
- Memory: lexicon size dominates; use compressed tries or SQLite for large datasets.
9. Extensions & Improvements
- Add a finite-state transducer (FST) for morphological parsing to cover edge cases and orthographic rules efficiently.
- Train a statistical tagger (CRF/transformer) to disambiguate POS/tense where morphology alone is insufficient.
- Integrate word frequencies for better segmentation scoring.
- Provide multiword expression handling and named-entity recognition.
10. Exporting & Interoperability
- Store lexicon as JSON for easy editing and as SQLite for production.
- Export parsed corpora in CoNLL-U format for downstream tasks.
11. Conclusion
A small, rule-based Esperanto dictionary and parser is practical to build thanks to the language’s regular morphology. Start with a solid lexicon and longest-match parsing, then add FSTs or statistical components to improve coverage and disambiguation. The provided code gives a minimal, extensible foundation for experimentation.
Leave a Reply