SANTI-morf dictionaries


  • Prihantoro Universitas Diponegoro



SANTI-morf, dictionary, Indonesian, corpus, Morphology


This article highlights the structure of dictionaries used in SANTI-morf (Sistem Analisis Teks Indonesia – morfologi), a multi-module pipeline system that performs annotations for an Indonesian corpus at the morpheme level and built using NooJ (Silberztein, 2003, 2016). SANTI-morf dictionaries, together with other SANTI-morf components, enable the system to tokenize each word in an Indonesian corpus into morphemes (e.g., cliticized and non-cliticized roots, affixes, reduplications) and associate these morphemes with their corresponding tags. Each entry in the SANTI-morf dictionary is encoded with a tag composed of morphological analysis (MA) labels. In most cases, these labels are combined with system implementation (SI) labels. Morphological analysis labels consist of formal and functional morphological criteria labels and are typically used for searching the annotated corpus (e.g., root part of speech (POS) labels). System implementation labels are used for system implementation and are mostly of interest to developers rather than end users. They include morphotactic and morphophonemic constraint labels, which are processed when the monomorphemic entries in dictionaries work together with SANTI-morf grammars (rules).

Author Biography

  • Prihantoro, Universitas Diponegoro

    Prihantoro is an associate professor of corpus linguistics in the department of Linguistics, Universitas Diponegoro, Indonesia. He earned his Ph.D from Lancaster University, and he manages some corpora in CQPweb Lancaster ( He is the author of SANTI-morf (a morphological annotation system for Indonesian) and Buku Referensi Pengantar Linguistik Korpus (Introduction to corpus linguistics reference book, written in Indonesian). He can be reached via [email protected], or his website


Adriani, M., and Riza, H. (2008). Research report phase 2.1: Final design report on statistical machine translation network. Jakarta: Badan Pengkajian dan Penerapan Teknologi (BPPT).

Aikhenvald, A. Y. (2001). A typology of noun categorization device. Oxford: Oxford University Press.

Alwi, H., Dardjowidjojo, S., Lapoliwa, H., and Moeliono, M. (1998). Tata Bahasa Baku Bahasa Indonesia (3rd ed.). Jakarta: Balai Pustaka.

Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, and G. Sampson (Eds.), The computational analysis of English: A corpus-based approach (pp. 31–41). London: Longman.

Larasati, S.-D., Kubon, V., and Zeman, D. (2011). Indonesian morphology tool (MorphInd): Towards an Indonesian corpus. In C. Mahlow and M. Piotrowski (Eds.), Systems and frameworks for computational morphology (pp. 119–129). Berlin and Heidelberg: Springer.

Leclère, C. (2005). The lexicon-grammar of French verbs: A syntactic database. In Y. Kawaguchi, S. Zaima, T. Takagaki, K. Shibano, and M. Usami (Eds.), Linguistic informatics – state of the art and the future (pp. 29–45). Amsterdam: John Benjamins Publishing.

Lewis, M.-P., Simons, G.-F., and Fennig, C.-D. (2009). Ethnologue: Languages of the world (vol. 16). Dallas: SIL International.

Marcus, M.-P., Marcinkiewicz, M.-A., and Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.

Paumier, S. (2014). Unitex manual version 3.1. Paris: Université Paris-Est Marne-la-Vallée and LADL.

Pisceldo, F., Mahendra, R., Manurung, R., and Arka, I.-W. (2008). A two level morphological analyser for the Indonesian language (pp. 142–150). Tasmania: Australasian Language Technology Association Workshop.

Prihantoro, P. (2019). A new tagset for morphological analysis of Indonesian (pp. 176–181). International Corpus Linguistics Conference, Cardiff.

Prihantoro, P. (2021a). An evaluation of MorphInd’s morphological annotation scheme for Indonesian. Corpora, 16(2), 287–299.

Prihantoro, P. (2021b). An automatic morphological analysis system for Indonesian. PhD thesis. Lancaster: Lancaster University Press.

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, Manchester.

Silberztein, M. (2003). NooJ Manual.

Silberztein, M. (2016). Formalizing natural languages: Nooj approach. London: Wiley.

Sneddon, J.-N., Adelaar, A., Djenar, D.-N., and Ewing, M.-C. (2010). Indonesian reference grammar (2nd ed.). New South Wales: Allen & Unwin.






How to Cite

Prihantoro. (2022). SANTI-morf dictionaries. Lexicography, 9(2), 175–193.