Large-sample confidence intervals of information-theoretic measures in linguistics

Authors

  • Ryan Ka Yau Lai University of California Santa Barbara
  • Youngah Do The University of Hong Kong

DOI:

https://doi.org/10.1558/jrds.40134

Keywords:

information-theoretic measures, entropy, Kullback-Leibler Divergence, mutual information

Abstract

This article explores a method of creating confidence bounds for information-theoretic measures in linguistics, such as entropy, Kullback-Leibler Divergence (KLD), and mutual information. We show that a useful measure of uncertainty can be derived from simple statistical principles, namely the asymptotic distribution of the maximum likelihood estimator (MLE) and the delta method. Three case studies from phonology and corpus linguistics are used to demonstrate how to apply it and examine its robustness against common violations of its assumptions in linguistics, such as insufficient sample size and non-independence of data points.

Author Biographies

Ryan Ka Yau Lai, University of California Santa Barbara

Ryan Ka Yau Lai is a PhD student in the Department of Linguistics, University of California, Santa Barbara, CA, USA.

Youngah Do, The University of Hong Kong

Youngah Do is Assistant Professor in the Department of Linguistics of The University of Hong Kong.

References

Albright, A., and Do, Y. (2013). Three biases for learning phonological alternations. Paper

presented at the Twenty-First Manchester Phonology Meeting, Manchester.

Bates, D., Maechler, M., Bolker, B., Walker, S., et al. (2015). Fitting linear mixed-effects

models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/

jss.v067.i01

Bauer, R. S., and Benedict, P. K. (1997). Modern Cantonese phonology (Vol. 102). Walter de

Gruyter. https://doi.org/10.1515/9783110823707

Davison, A. C., and Hinkley, D. V. (1997). Bootstrap methods and their application (Vol. 1).

Cambridge University Press. https://doi.org/10.1017/CBO9780511802843

Denwood, P. (1999). Tibetan (Vol. 3). John Benjamins Publishing. https://doi.org/10.1075/

loall.3

Do, Y., and Lai, R. K. Y. (to appear). Accounting for lexical tones when modeling phonological

distance. Manuscript submitted for publication. Language. Retrieved from

https://ling.auf.net/lingbuzz/004369/current.pdf?_s=tOHunNFkSRD2lh8q.

Efron, B., and Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press. https://

doi.org/10.1201/9780429246593

Esukhia Development Team. (2019, June). pyewts v.0.1.1. Retrieved from https://pypi.org/

project/pyewts/

Field, C. A., and Welsh, A. H. (2007). Bootstrapping clustered data. Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 69(3), 369–390. https://doi.

org/10.1111/j.1467-9868.2007.00593.x

Garson, N., and Germano, D. (2004, January). Extended Wylie transliteration scheme.

Tibetan and Himalayan Digital Library. Retrieved from http://www.thlib.org/reference/

transliteration/#!essay=/thl/ewts/ doi: 10.5281/zenodo.803268

Germano, D., Garrett, E., and Weinberger, S. (2017, June). UVA Tibetan spoken corpus.

Retrieved from https://doi.org/10.5281/zenodo.803268 doi: 10.5281/zenodo.803268

Gilbert, P., Gilbert, M. P., and Varadhan, R. (2006). The numDeriv package.

Gu, B. (1998). Xin yi Guliang Zhuan [a new translation of the guliang zhuan]. Sanmin Shuju

Yinhang.

Hogg, R. V., McKean, J., and Craig, A. T. (2005). Introduction to mathematical statistics.

Pearson Education.

Lee, J. L., Chen, L., and Tsui, T.-H. (2016). PyCantonese: Developing computational tools

for Cantonese linguistics.

Lee, S. M. S., and Young, G. A. (1995). Asymptotic iterated bootstrap confidence intervals.

The Annals of Statistics, 1301–1330. https://doi.org/10.1214/aos/1176324710

Liu, L. (2004). Xianqin fouding fuci ’bu’, ’fu’ zhi bijiao [a comparison of the ’bu’ and ’fu’

negating adverbs in the pre-Qin era] (Unpublished master’s thesis). Shaanxi Normal

University.

Luke, K. K., and Wong, M. L. (2015). The Hong Kong Cantonese corpus: design and uses.

Journal of Chinese Linguistics, 25(2015), 309–330.

Pulleyblank, E. G. (2010). Outline of classical Chinese grammar. Vancouver: UBC Press.

Rao, C. R. (1973). Linear statistical inference and its applications (Vol. 2). Wiley New York.

https://doi.org/10.1002/9780470316436

Tournadre, N., and Dorje, S. (2003). Manuel de tibétain standard. L’Asiathèque-Maison des

langues du monde.

Wallis, S. (2013). Binomial confidence intervals and contingency tests: mathematical

fundamentals and the evaluation of alternative methods. Journal of Quantitative

Linguistics, 20(3), 178–208. https://doi.org/10.1080/09296174.2013.799918

Downloads

Published

2020-11-07

How to Cite

Lai, R. K. Y., & Do, Y. (2020). Large-sample confidence intervals of information-theoretic measures in linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 6(1), 19–54. https://doi.org/10.1558/jrds.40134

Issue

Section

Articles