SANTI-morf dictionaries
DOI:
https://doi.org/10.1558/lexi.23569Keywords:
SANTI-morf, dictionary, Indonesian, corpus, MorphologyAbstract
This article highlights the structure of dictionaries used in SANTI-morf (Sistem Analisis Teks Indonesia – morfologi), a multi-module pipeline system that performs annotations for an Indonesian corpus at the morpheme level and built using NooJ (Silberztein, 2003, 2016). SANTI-morf dictionaries, together with other SANTI-morf components, enable the system to tokenize each word in an Indonesian corpus into morphemes (e.g., cliticized and non-cliticized roots, affixes, reduplications) and associate these morphemes with their corresponding tags. Each entry in the SANTI-morf dictionary is encoded with a tag composed of morphological analysis (MA) labels. In most cases, these labels are combined with system implementation (SI) labels. Morphological analysis labels consist of formal and functional morphological criteria labels and are typically used for searching the annotated corpus (e.g., root part of speech (POS) labels). System implementation labels are used for system implementation and are mostly of interest to developers rather than end users. They include morphotactic and morphophonemic constraint labels, which are processed when the monomorphemic entries in dictionaries work together with SANTI-morf grammars (rules).
References
Adriani, M., and Riza, H. (2008). Research report phase 2.1: Final design report on statistical machine translation network. Jakarta: Badan Pengkajian dan Penerapan Teknologi (BPPT).
Aikhenvald, A. Y. (2001). A typology of noun categorization device. Oxford: Oxford University Press.
Alwi, H., Dardjowidjojo, S., Lapoliwa, H., and Moeliono, M. (1998). Tata Bahasa Baku Bahasa Indonesia (3rd ed.). Jakarta: Balai Pustaka.
Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, and G. Sampson (Eds.), The computational analysis of English: A corpus-based approach (pp. 31–41). London: Longman.
Larasati, S.-D., Kubon, V., and Zeman, D. (2011). Indonesian morphology tool (MorphInd): Towards an Indonesian corpus. In C. Mahlow and M. Piotrowski (Eds.), Systems and frameworks for computational morphology (pp. 119–129). Berlin and Heidelberg: Springer. https://doi.org/10.1007/978-3-642-23138-4_8
Leclère, C. (2005). The lexicon-grammar of French verbs: A syntactic database. In Y. Kawaguchi, S. Zaima, T. Takagaki, K. Shibano, and M. Usami (Eds.), Linguistic informatics – state of the art and the future (pp. 29–45). Amsterdam: John Benjamins Publishing. https://doi.org/10.1075/ubli.1.05lec
Lewis, M.-P., Simons, G.-F., and Fennig, C.-D. (2009). Ethnologue: Languages of the world (vol. 16). Dallas: SIL International.
Marcus, M.-P., Marcinkiewicz, M.-A., and Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330. https://doi.org/10.21236/ADA273556
Paumier, S. (2014). Unitex manual version 3.1. Paris: Université Paris-Est Marne-la-Vallée and LADL.
Pisceldo, F., Mahendra, R., Manurung, R., and Arka, I.-W. (2008). A two level morphological analyser for the Indonesian language (pp. 142–150). Tasmania: Australasian Language Technology Association Workshop.
Prihantoro, P. (2019). A new tagset for morphological analysis of Indonesian (pp. 176–181). International Corpus Linguistics Conference, Cardiff.
Prihantoro, P. (2021a). An evaluation of MorphInd’s morphological annotation scheme for Indonesian. Corpora, 16(2), 287–299. https://doi.org/10.3366/cor.2021.0221
Prihantoro, P. (2021b). An automatic morphological analysis system for Indonesian. PhD thesis. Lancaster: Lancaster University Press.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, Manchester.
Silberztein, M. (2003). NooJ Manual. www.nooj4nlp.net
Silberztein, M. (2016). Formalizing natural languages: Nooj approach. London: Wiley. https://doi.org/10.1002/9781119264125
Sneddon, J.-N., Adelaar, A., Djenar, D.-N., and Ewing, M.-C. (2010). Indonesian reference grammar (2nd ed.). New South Wales: Allen & Unwin.