Large-sample confidence intervals of information-theoretic measures in linguistics
DOI:
https://doi.org/10.1558/jrds.40134Keywords:
information-theoretic measures, entropy, Kullback-Leibler Divergence, mutual informationAbstract
This article explores a method of creating confidence bounds for information-theoretic measures in linguistics, such as entropy, Kullback-Leibler Divergence (KLD), and mutual information. We show that a useful measure of uncertainty can be derived from simple statistical principles, namely the asymptotic distribution of the maximum likelihood estimator (MLE) and the delta method. Three case studies from phonology and corpus linguistics are used to demonstrate how to apply it and examine its robustness against common violations of its assumptions in linguistics, such as insufficient sample size and non-independence of data points.
References
Albright, A., and Do, Y. (2013). Three biases for learning phonological alternations. Paper
presented at the Twenty-First Manchester Phonology Meeting, Manchester.
Bates, D., Maechler, M., Bolker, B., Walker, S., et al. (2015). Fitting linear mixed-effects
models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/
jss.v067.i01
Bauer, R. S., and Benedict, P. K. (1997). Modern Cantonese phonology (Vol. 102). Walter de
Gruyter. https://doi.org/10.1515/9783110823707
Davison, A. C., and Hinkley, D. V. (1997). Bootstrap methods and their application (Vol. 1).
Cambridge University Press. https://doi.org/10.1017/CBO9780511802843
Denwood, P. (1999). Tibetan (Vol. 3). John Benjamins Publishing. https://doi.org/10.1075/
loall.3
Do, Y., and Lai, R. K. Y. (to appear). Accounting for lexical tones when modeling phonological
distance. Manuscript submitted for publication. Language. Retrieved from
https://ling.auf.net/lingbuzz/004369/current.pdf?_s=tOHunNFkSRD2lh8q.
Efron, B., and Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press. https://
doi.org/10.1201/9780429246593
Esukhia Development Team. (2019, June). pyewts v.0.1.1. Retrieved from https://pypi.org/
project/pyewts/
Field, C. A., and Welsh, A. H. (2007). Bootstrapping clustered data. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 69(3), 369–390. https://doi.
org/10.1111/j.1467-9868.2007.00593.x
Garson, N., and Germano, D. (2004, January). Extended Wylie transliteration scheme.
Tibetan and Himalayan Digital Library. Retrieved from http://www.thlib.org/reference/
transliteration/#!essay=/thl/ewts/ doi: 10.5281/zenodo.803268
Germano, D., Garrett, E., and Weinberger, S. (2017, June). UVA Tibetan spoken corpus.
Retrieved from https://doi.org/10.5281/zenodo.803268 doi: 10.5281/zenodo.803268
Gilbert, P., Gilbert, M. P., and Varadhan, R. (2006). The numDeriv package.
Gu, B. (1998). Xin yi Guliang Zhuan [a new translation of the guliang zhuan]. Sanmin Shuju
Yinhang.
Hogg, R. V., McKean, J., and Craig, A. T. (2005). Introduction to mathematical statistics.
Pearson Education.
Lee, J. L., Chen, L., and Tsui, T.-H. (2016). PyCantonese: Developing computational tools
for Cantonese linguistics.
Lee, S. M. S., and Young, G. A. (1995). Asymptotic iterated bootstrap confidence intervals.
The Annals of Statistics, 1301–1330. https://doi.org/10.1214/aos/1176324710
Liu, L. (2004). Xianqin fouding fuci ’bu’, ’fu’ zhi bijiao [a comparison of the ’bu’ and ’fu’
negating adverbs in the pre-Qin era] (Unpublished master’s thesis). Shaanxi Normal
University.
Luke, K. K., and Wong, M. L. (2015). The Hong Kong Cantonese corpus: design and uses.
Journal of Chinese Linguistics, 25(2015), 309–330.
Pulleyblank, E. G. (2010). Outline of classical Chinese grammar. Vancouver: UBC Press.
Rao, C. R. (1973). Linear statistical inference and its applications (Vol. 2). Wiley New York.
https://doi.org/10.1002/9780470316436
Tournadre, N., and Dorje, S. (2003). Manuel de tibétain standard. L’Asiathèque-Maison des
langues du monde.
Wallis, S. (2013). Binomial confidence intervals and contingency tests: mathematical
fundamentals and the evaluation of alternative methods. Journal of Quantitative
Linguistics, 20(3), 178–208. https://doi.org/10.1080/09296174.2013.799918