Mining the Past – Data-Intensive Knowledge Discovery in the Study of Historical Textual Traditions
DOI:
https://doi.org/10.1558/jch.31662Keywords:
text mining, quantitative text analysis, historical research, methodologyAbstract
Text-heavy and unstructured data constitute the primary source materials for many historical reconstructions. In history and the history of religion, text analysis has typically been conducted by systematically selecting a small sample of texts and subjecting it to highly detailed reading and mental synthesis. But two interrelated technological developments have rendered a new data-intensive paradigm—one that can usefully supplement qualitative analysis—possible in the study of historical textual traditions. First, the availability of significant computing power has made it possible to run algorithms for automated text analysis on most personal computers. Second, the rapid increase in full text digital databases relevant to the study of religion has considerably reduced costs related to data acquisition and digitization. However, a limited understanding of the scope, advantages, and limitations of data-intensive methods, combined with an overly enthusiastic praise of big data by policy-makers and data scientists, have created real obstacles to the implementation of this paradigm in historical research. This is unfortunate, because history offers a rich and uncharted field for data-intensive knowledge discovery, and historians already have the much sought after and necessary domain expertise. In this article we seek to remove obstacles to the data intensive paradigm by presenting its methods and models for handling text-heavy data.
References
Arnold, Taylor, and Lauren Tilton. 2015. Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text. 1st ed.. New York: Springer. https://doi.org/10.1007/978-3-319-20702-5
Azevedo, Ana Isabel Rojão Lourenço. 2008. “KDD, SEMMA and CRISP-DM: A Parallel Overview”, available at http://recipp.ipp.pt/handle/10400.22/136
Baharudin, Baharum, Lam Hong Lee and Khairullah Khan. 2010. “A Review of Machine Learning Algorithms for Text-Documents Classification”. Journal of Advances in Information Technology 1(1): 4–20. https://doi.org/10.4304/jait.1.1.4-20
Banchs, Rafael E. 2013. Text Mining with MATLAB. New York: Springer. https://doi.org/10.1007/978-1-4614-4151-9
Baunvig, Katrine F., and Kristoffer L. Nielbo. 2017. “Kan man validere et selvopgør?”. Proceedings from Nordiskt Nätverk för Editionsfilologer 2015. Skrifter 12: 45–67.
Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. 1st Edition. Cambridge, MA: O’Reilly Media.
Blei, David M. 2012. “Probabilistic Topic Models”. Communications of the ACM 55(4): 77–84. https://doi.org/10.1145/2133806.2133826
Blei, David M., Andrew Y. Ng and Michael I. Jordan. 2003. “Latent Dirichlet Allocation”. The Journal of Machine Learning Research 3: 993–1022.
Cooper, Anwen, and Chris Green. 2015. “Embracing the Complexities of ‘Big Data’ in Archaeology: The Case of the English Landscape and Identities Project”. Journal of Archaeological Method and Theory 23(1): 271–304. https://doi.org/10.1007/s10816-015-9240-4
Fayyad, Usama, Gregory Piatetsky-Shapiro and Padhraic Smyth. 1996. “From Data Mining to Knowledge Discovery in Databases”. AI Magazine 17(3): 37.
Grant, Will J., and Erin Walsh. 2015. “Social Evidence of a Changing Climate: Google Ngram Data Points to Early Climate Change Impact on Human Society”. Weather 70(7): 195–97. https://doi.org/10.1002/wea.2504
Hastie, Trevor, Robert Tibshirani and Jerome Friedman. 2011. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. New York: Springer.
Heaps, Harold S. 1978. Information Retrieval, Computational and Theoretical Aspects. Orlando, FL: Academic Press Inc.
Hey, Tony, Stewart Tansley and Kristin Tolle, eds. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. 1st edition. Redmond, WA: Microsoft Research.
Jockers, Matthew L. 2013. Macroanalysis: Digital Methods and Literary History. 1st Edition. Urbana, IL: University of Illinois Press.
– 2014. Text Analysis with R for Students of Literature. New York: Springer.
Jockers, Matthew L., and David Mimno. 2013. “Significant Themes in 19th-Century Literature”. Poetics 41(6): 750–69. https://doi.org/10.1016/j.poetic.2013.08.005
Jurafsky, Daniel, and James Martin. 2008. Speech and Language Processing, 2nd Edition. Upper Saddle River, NJ: Prentice Hall.
Katz, Slava M. 1996. “Distribution of Content Words and Phrases in Text and Language Modelling”. Natural Language Engineering 2(1): 15–59. https://doi.org/10.1017/S1351324996001246
Klein, Dan, Joseph Smarr, Huy Nguyen and Christopher D. Manning. 2003. “Named Entity Recognition with Character-Level Models”. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 – Volume 4, 180–83. CONLL 2003. Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.3115/1119176.1119204
Kohavi, Ron, and Foster Provost. 1998. “Glossary of Terms”. Machine Learning 30: 271–74. https://doi.org/10.1023/A:1017181826899
Liu, Bing. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. 2nd edn. New York: Springer. https://doi.org/10.1007/978-3-642-19460-3
Manning, Christopher, Prabhakar Raghavan and Hinrich Schütze. 2008. Introduction to Information Retrieval. 1st edition. New York: Cambridge University Press. https://doi.org/10.1017/CBO9780511809071
Michelbacher, Lukas, Stefan Evert and Hinrich Schütze. 2007. “Asymmetric Association Measures”. Proceedings of the Recent Advances in Natural Language Processing (RANLP 2007). (15 January 2016). Available at http://www.stefan-evert.de/PUB/MichelbacherEtc2007.pdf
Miner, Gary. 2012. Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications. Waltham, MA: Academic Press.
Moretti, Franco. 2013. Distant Reading. 1st edition. London & New York: Verso.
Nichols, Ryan, Kristoffer L. Nielbo, Edward Slingerland, Uffe Bergeton, Carson Logan and Scott Kleinman. forthcoming. Modeling the Contested Relationship between Analects, Mencius, and Xunzi: Preliminary Evidence from a Machine-Learning Approach. Journal of Asian Studies.
Porter, M. F. 2006. “An Algorithm for Suffix Stripping”. Program: Electronic Library and Information Systems 40(3): 211–18. https://doi.org/10.1108/00330330610681286
Richardson, John T. E. 2011. “Eta Squared and Partial Eta Squared as Measures of Effect Size in Educational Research”. Educational Research Review 6(2): 135–47. https://doi.org/10.1016/j.edurev.2010.12.001
Schreibman, Susan, Ray Siemens and John Unsworth. 2008. “The Digital Humanities and Humanities Computing”. In A Companion to Digital Humanities, Susan Schreibman, Ray Siemens and John Unsworth. Oxford: Blackwell.
Slingerland, Edward, and Maciej Chudek. 2011. “The Prevalence of Mind-Body Dualism in Early China”. Cognitive Science 35(5): 997–1007. https://doi.org/10.1111/j.1551-6709.2011.01186.x
Spivey, R. A., and D. M. Smith. 1994. Anatomy of the New Testament: A Guide to Its Structure and Meaning (5th edition). Englewood Cliffs, NJ: Prentice Hall.
Tangherlini, Timothy R., and Peter Leonard. 2013. “Trawling in the Sea of the Great Unread: Sub-Corpus Topic Modeling and Humanities Research”. Poetics 41(6): 725–49. https://doi.org/10.1016/j.poetic.2013.08.002
Tan, Pang-Nang, Michael Steinbach and Vipin Kumar. 2005. Introduction to Data Mining. 1st edition. Boston, MA: Pearson.
Tausczik, Y. R., and J. W. Pennebaker. 2010. “The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods”. Journal of Language and Social Psychology 29(1): 24–54.
Underwood, T. 2016. The Life Cycles of Genres. Journal of Culture Analytics. Retrieved from: http://culturalanalytics.org/2016/05/the-life-cycles-of-genres/
Weikum, Gerhard, Johannes Hoffart, Ndapandula Nakashole, Marc Spaniol, Fabian M. Suchanek and Mohamed Amir Yosef. 2012. “Big Data Methods for Computational Linguistics”. IEEE Data Eng. Bull. 35(3): 46–64.
Weiss, Sholom M., Nitin Indurkhya and Tong Zhang. 2010. Fundamentals of Predictive Text Mining. New York: Springer. https://doi.org/10.1007/978-1-84996-226-1
Witten, Ian H., Eibe Frank and Mark A. Hall. 2011. Data Mining: Practical Machine Learning Tools and Techniques, Third Edition. Burlington, MA: Morgan Kaufmann.
Zhang, Xiang, Junbo Zhao and Yann LeCun. 2015. “Character-Level Convolutional Networks for Text Classification”. In Advances in Neural Information Processing Systems, 649–57.
Zipf, George K. 1935. The Psycho-Biology of Language: An Introduction to Dynamic Philology. 1st edition. Cambridge, MA: M.I.T. Press