SANTI-morf dictionaries

Authors

DOI:

https://doi.org/10.1558/lexi.23569

Keywords:

SANTI-morf, dictionary, Indonesian, corpus, Morphology

Abstract

This article highlights the structure of dictionaries used in SANTI-morf (Sistem Analisis Teks Indonesia – morfologi), a multi-module pipeline system that performs annotations for an Indonesian corpus at the morpheme level and built using NooJ (Silberztein, 2003, 2016). SANTI-morf dictionaries, together with other SANTI-morf components, enable the system to tokenize each word in an Indonesian corpus into morphemes (e.g., cliticized and non-cliticized roots, affixes, reduplications) and associate these morphemes with their corresponding tags. Each entry in the SANTI-morf dictionary is encoded with a tag composed of morphological analysis (MA) labels. In most cases, these labels are combined with system implementation (SI) labels. Morphological analysis labels consist of formal and functional morphological criteria labels and are typically used for searching the annotated corpus (e.g., root part of speech (POS) labels). System implementation labels are used for system implementation and are mostly of interest to developers rather than end users. They include morphotactic and morphophonemic constraint labels, which are processed when the monomorphemic entries in dictionaries work together with SANTI-morf grammars (rules).

Author Biography

Prihantoro, Universitas Diponegoro

Prihantoro is an associate professor of corpus linguistics in the department of Linguistics, Universitas Diponegoro, Indonesia. He earned his Ph.D from Lancaster University, and he manages some corpora in CQPweb Lancaster (https://cqpweb.lancs.ac.uk/). He is the author of SANTI-morf (a morphological annotation system for Indonesian) and Buku Referensi Pengantar Linguistik Korpus (Introduction to corpus linguistics reference book, written in Indonesian). He can be reached via [email protected], or his website http://prihantoro.rf.gd/

References

Adriani, M., and Riza, H. (2008). Research report phase 2.1: Final design report on statistical machine translation network. Jakarta: Badan Pengkajian dan Penerapan Teknologi (BPPT).

Aikhenvald, A. Y. (2001). A typology of noun categorization device. Oxford: Oxford University Press.

Alwi, H., Dardjowidjojo, S., Lapoliwa, H., and Moeliono, M. (1998). Tata Bahasa Baku Bahasa Indonesia (3rd ed.). Jakarta: Balai Pustaka.

Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, and G. Sampson (Eds.), The computational analysis of English: A corpus-based approach (pp. 31–41). London: Longman.

Larasati, S.-D., Kubon, V., and Zeman, D. (2011). Indonesian morphology tool (MorphInd): Towards an Indonesian corpus. In C. Mahlow and M. Piotrowski (Eds.), Systems and frameworks for computational morphology (pp. 119–129). Berlin and Heidelberg: Springer. https://doi.org/10.1007/978-3-642-23138-4_8

Leclère, C. (2005). The lexicon-grammar of French verbs: A syntactic database. In Y. Kawaguchi, S. Zaima, T. Takagaki, K. Shibano, and M. Usami (Eds.), Linguistic informatics – state of the art and the future (pp. 29–45). Amsterdam: John Benjamins Publishing. https://doi.org/10.1075/ubli.1.05lec

Lewis, M.-P., Simons, G.-F., and Fennig, C.-D. (2009). Ethnologue: Languages of the world (vol. 16). Dallas: SIL International.

Marcus, M.-P., Marcinkiewicz, M.-A., and Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330. https://doi.org/10.21236/ADA273556

Paumier, S. (2014). Unitex manual version 3.1. Paris: Université Paris-Est Marne-la-Vallée and LADL.

Pisceldo, F., Mahendra, R., Manurung, R., and Arka, I.-W. (2008). A two level morphological analyser for the Indonesian language (pp. 142–150). Tasmania: Australasian Language Technology Association Workshop.

Prihantoro, P. (2019). A new tagset for morphological analysis of Indonesian (pp. 176–181). International Corpus Linguistics Conference, Cardiff.

Prihantoro, P. (2021a). An evaluation of MorphInd’s morphological annotation scheme for Indonesian. Corpora, 16(2), 287–299. https://doi.org/10.3366/cor.2021.0221

Prihantoro, P. (2021b). An automatic morphological analysis system for Indonesian. PhD thesis. Lancaster: Lancaster University Press.

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, Manchester.

Silberztein, M. (2003). NooJ Manual. www.nooj4nlp.net

Silberztein, M. (2016). Formalizing natural languages: Nooj approach. London: Wiley. https://doi.org/10.1002/9781119264125

Sneddon, J.-N., Adelaar, A., Djenar, D.-N., and Ewing, M.-C. (2010). Indonesian reference grammar (2nd ed.). New South Wales: Allen & Unwin.

Published

2022-11-25

How to Cite

Prihantoro. (2022). SANTI-morf dictionaries. Lexicography, 9(2), 175–193. https://doi.org/10.1558/lexi.23569

Issue

Section

Article