Large-sample confidence intervals of information-theoretic measures in linguistics


  • Ryan Ka Yau Lai University of California Santa Barbara
  • Youngah Do The University of Hong Kong



information-theoretic measures, entropy, Kullback-Leibler Divergence, mutual information


This article explores a method of creating confidence bounds for information-theoretic measures in linguistics, such as entropy, Kullback-Leibler Divergence (KLD), and mutual information. We show that a useful measure of uncertainty can be derived from simple statistical principles, namely the asymptotic distribution of the maximum likelihood estimator (MLE) and the delta method. Three case studies from phonology and corpus linguistics are used to demonstrate how to apply it and examine its robustness against common violations of its assumptions in linguistics, such as insufficient sample size and non-independence of data points.

Author Biographies

Ryan Ka Yau Lai, University of California Santa Barbara

Ryan Ka Yau Lai is a PhD student in the Department of Linguistics, University of California, Santa Barbara, CA, USA.

Youngah Do, The University of Hong Kong

Youngah Do is Assistant Professor in the Department of Linguistics of The University of Hong Kong.


Albright, A., and Do, Y. (2013). Three biases for learning phonological alternations. Paper

presented at the Twenty-First Manchester Phonology Meeting, Manchester.

Bates, D., Maechler, M., Bolker, B., Walker, S., et al. (2015). Fitting linear mixed-effects

models using lme4. Journal of Statistical Software, 67(1), 1–48.


Bauer, R. S., and Benedict, P. K. (1997). Modern Cantonese phonology (Vol. 102). Walter de


Davison, A. C., and Hinkley, D. V. (1997). Bootstrap methods and their application (Vol. 1).

Cambridge University Press.

Denwood, P. (1999). Tibetan (Vol. 3). John Benjamins Publishing.


Do, Y., and Lai, R. K. Y. (to appear). Accounting for lexical tones when modeling phonological

distance. Manuscript submitted for publication. Language. Retrieved from

Efron, B., and Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press. https://

Esukhia Development Team. (2019, June). pyewts v.0.1.1. Retrieved from


Field, C. A., and Welsh, A. H. (2007). Bootstrapping clustered data. Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 69(3), 369–390. https://doi.


Garson, N., and Germano, D. (2004, January). Extended Wylie transliteration scheme.

Tibetan and Himalayan Digital Library. Retrieved from

transliteration/#!essay=/thl/ewts/ doi: 10.5281/zenodo.803268

Germano, D., Garrett, E., and Weinberger, S. (2017, June). UVA Tibetan spoken corpus.

Retrieved from doi: 10.5281/zenodo.803268

Gilbert, P., Gilbert, M. P., and Varadhan, R. (2006). The numDeriv package.

Gu, B. (1998). Xin yi Guliang Zhuan [a new translation of the guliang zhuan]. Sanmin Shuju


Hogg, R. V., McKean, J., and Craig, A. T. (2005). Introduction to mathematical statistics.

Pearson Education.

Lee, J. L., Chen, L., and Tsui, T.-H. (2016). PyCantonese: Developing computational tools

for Cantonese linguistics.

Lee, S. M. S., and Young, G. A. (1995). Asymptotic iterated bootstrap confidence intervals.

The Annals of Statistics, 1301–1330.

Liu, L. (2004). Xianqin fouding fuci ’bu’, ’fu’ zhi bijiao [a comparison of the ’bu’ and ’fu’

negating adverbs in the pre-Qin era] (Unpublished master’s thesis). Shaanxi Normal


Luke, K. K., and Wong, M. L. (2015). The Hong Kong Cantonese corpus: design and uses.

Journal of Chinese Linguistics, 25(2015), 309–330.

Pulleyblank, E. G. (2010). Outline of classical Chinese grammar. Vancouver: UBC Press.

Rao, C. R. (1973). Linear statistical inference and its applications (Vol. 2). Wiley New York.

Tournadre, N., and Dorje, S. (2003). Manuel de tibétain standard. L’Asiathèque-Maison des

langues du monde.

Wallis, S. (2013). Binomial confidence intervals and contingency tests: mathematical

fundamentals and the evaluation of alternative methods. Journal of Quantitative

Linguistics, 20(3), 178–208.




How to Cite

Lai, R. K. Y., & Do, Y. (2020). Large-sample confidence intervals of information-theoretic measures in linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 6(1), 19–54.