Investigating Lexical Effects in Syntax with Regularized Regression (Lasso)
DOI:
https://doi.org/10.1558/jrds.18964Keywords:
cross-validation, regularization, lasso, machine learning, corpus linguistics, collostructional analysis, distinctive collexeme analysis, overfittingAbstract
Within usage-based theory, notably in construction grammar though also elsewhere, the role of the lexicon and of lexically-specific patterns in morphosyntax is well recognized. The methodology, however, is not always sufficiently suited to get at the details, as lexical effects are difficult to study under what are currently the standard methods for investigating grammar empirically. In this short article, we propose a method from machine learning: regularized regression (Lasso) with k-fold cross-validation, and compare its performance with a Distinctive Collexeme Analysis.
References
Bloem, Jelke (2021). Processing verb clusters. Utrecht: LOT Dissertation Series.
Bondell, Howard D., Arun Krishna, and Sujit K. Ghosh (2010). Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics 66(4): 1069–1077. https://doi.org/10.1111/j.1541-0420.2010.01391.x DOI: https://doi.org/10.1111/j.1541-0420.2010.01391.x
Bresnan, Joan, Anna Cueni, Tatiana, and R. Harald Baayen (2007). Predicting the dative alternation. In Gerlof Bouma, Irene Kraemer, and Joost Zwarts (Eds), Cognitive Foundations of Interpretation. Amsterdam: Royal Netherlands Academy of Science. 69–94.
Bresnan, Joan and Ford, Marilyn. (2010). Predicting syntax: Processing dative constructions in American and Australian varieties of English. Language 86: 168–213. https://doi.org/10.1353/lan.0.0189 DOI: https://doi.org/10.1353/lan.0.0189
Cappelle, Bert (2006). Particle placement and the case for ‘allostructions’. In Doris Schönefeld (Ed.), Constructions all Over: Case Studies and Theoretical Implications. [Special issue of Constructions].
Colleman, Timothy (2006). De Nederlandse datiefalternantie. Een constructioneel en corpusgebaseerd onderzoek. PhD Dissertation. UGent.
Da?browska, Ewa (2017). Ten Lectures on Grammar in the Mind. Leiden: Brill. https://doi.org/10.1163/9789004336827 DOI: https://doi.org/10.1163/9789004336827
Daelemans, Walter and Antal van den Bosch (2005). Memory-based Language Processing. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511486579 DOI: https://doi.org/10.1017/CBO9780511486579
Deisenroth, Marc P., A. Aldo Faisal, and Cheng Soon Ong (2020). Mathematics for Machine Learning. Preprint book. https://mml-book.github.io/ https://doi.org/10.1017/9781108679930 DOI: https://doi.org/10.1017/9781108679930
De Troij, Robbert, Stefan Grondelaers, Dirk Speelman, and Antal van den Bosch (2021). Lexicon or grammar? Using memory-based learning to investigate the syntactic relationship between Belgian and Netherlandic Dutch. Natural Language Engineering. https://doi.org/10.1017/S1351324921000097 DOI: https://doi.org/10.1017/S1351324921000097
De Vaere, Hilde (2020). The ditransitive alternation in present-day German. A corpus-based analysis. PhD Dissertation. UGent.
Diessel, Holger (2019). The Grammar Network: How Linguistic Structure is Shaped by Language Use. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108671040 DOI: https://doi.org/10.1017/9781108671040
Flach, Susanne (2021). Collostructions: An R Implementation for the Family of Collostructional Methods. R package version 0.2.0.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1): 1–22. https://doi.org/10.18637/jss.v033.i01 DOI: https://doi.org/10.18637/jss.v033.i01
Ghyselen, Anne-Sophie, and Roxane Vandenberghe (2019). Over etwat, etwuk en iets:geografie en dynamiek van het onbepaald voornaamwoord voor zaak in West-Vlaanderen. Taal en Tongval 71(1): 31–60. https://doi.org/10.5117/TET2019.1.GHYS DOI: https://doi.org/10.5117/TET2019.1.GHYS
Goldberg, Adèle (2006). Constructions at Work: The Nature of Generalization in Language. Oxford: Oxford University Press. DOI: https://doi.org/10.1093/acprof:oso/9780199268511.001.0001
Gries, Stefan Th. (2000). Towards multifactorial analyses of syntactic variation: the case of particle placement. PhD Dissertation, University of Hamburg.
Gries, Stefan Th. and Anatol Stefanowitsch (2004). Extending collostructional analysis: A corpus-based perspective on ‘alternations’. International Journal of Corpus Linguistics 9(1): 97–129. https://doi.org/10.1075/ijcl.9.1.06gri DOI: https://doi.org/10.1075/ijcl.9.1.06gri
Gries, Stefan Th. (2015). The most underused statistical method in corpus linguistics: multi-level (and mixed-effects) models. Corpora 10(1): 95–125. https://doi.org/10.3366/cor.2015.0068 DOI: https://doi.org/10.3366/cor.2015.0068
Groll, Andreas (2017). glmmLasso: Variable Selection for Generalized Linear Mixed Models by L1-Penalized Estimation. R package version 1.5.1. https://CRAN.R-project.org/package=glmmLasso.
Groll, Andreas and Gerhard Tutz (2014). Variable selection for generalized linear mixed models by L1-penalized estimation. Statistics and Computing 24(2): 137–154. https://doi.org/10.1007/s11222-012-9359-z DOI: https://doi.org/10.1007/s11222-012-9359-z
Grondelaers, Stefan (2000). De distributie van niet-anaforisch er buiten de eerste zinplaats: sociolexicologische, functionele en psycholinguïstische aspecten van er’s status als presentatief signaal. PhD Dissertation, KU Leuven.
Pijpops, Dirk (2019). Where, how and why does argument structure vary? A usage-based investigation into the Dutch transitive-prepositional alternation. PhD Diss. KU Leuven.
Pijpops, Dirk, Dirk Speelman, Stefan Grondelaers, and Freek Van de Velde (2018). Comparing explanations for the Complexity Principle. Evidence from argument realization. Language and Cognition 10(3): 514–543. https://doi.org/10.1017/langcog.2018.13 DOI: https://doi.org/10.1017/langcog.2018.13
Haeseryn, Walter, Kirsten Romijn, Guido Geerts, Jaap de Rooij, and Maarten van den Toorn (1997). Algemene Nederlandse Spraakkunst. 2nd end. Groningen: Nijhoff.
Hamrick, Phillip (2019). Adjusting regression models for overfitting in second language research. Journal of Research Design and Statistics in Linguistics and Communication Science 5(1-2): 107–122. https://doi.org/10.1558/jrds.38374 DOI: https://doi.org/10.1558/jrds.38374
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2013). The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd edn. Berlin: Springer.
Klavan, Jane and Dagmar Divjak (2016). The cognitive plausibility of statistical classification models: Comparing textual and behavioral evidence. Folia Linguistica 50: 355–384. https://doi.org/10.1515/flin-2016-0014 DOI: https://doi.org/10.1515/flin-2016-0014
Levshina, Natalia and Kris Heylen (2014). A radically data-driven construction grammar: experiments with Dutch causative constructions. In Ronny Boogaart, Timothy Colleman, and Gijsbert Rutten (Eds), Extending the Scope of Construction Grammar. Berlin: Mouton de Gruyter. 17–46. https://doi.org/10.1515/9783110366273.17 DOI: https://doi.org/10.1515/9783110366273.17
Mandera, Pawel, Emmanuel Keuleers, and Marc Brysbaert (2017). Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: a review and empirical validation. Journal of Memory and Language 92: 57–78. https://doi.org/10.1016/j.jml.2016.04.001 DOI: https://doi.org/10.1016/j.jml.2016.04.001
Ng, Andrew (2018). Machine learning yearning. E-book. https://d2wvfoqc9gyqzf.cloudfront.net/content/uploads/2018/09/Ng-MLY01-13.pdf
Oostdijk, Nelleke, Martin Reynaert, Véronique Hoste, and Ineke Schuurman (2013). The construction of a 500 million word reference corpus of contemporary written Dutch. In Peter Spyns and Jan Odijk (Eds), Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme, 219–247. Berlin/Heidelberg: Springer. https://doi.org/10.1007/978-3-642-30910-6_13 DOI: https://doi.org/10.1007/978-3-642-30910-6_13
Perek, Florent (2015). Argument Structure in Usage-based Construction Grammar. Amsterdam: John Benjamins. https://doi.org/10.1075/cal.17 DOI: https://doi.org/10.1075/cal.17
Rappaport-Hovav, Malka and Beth Levin (2008). The English dative alternation: The case for verb sensitivity, Journal of Linguistics 44: 129–167. https://doi.org/10.1017/S0022226707004975 DOI: https://doi.org/10.1017/S0022226707004975
Roberts, David R. Volker Bahn, Simone Ciuti, Mark S. Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, José J. Lahoz-Monfort, Boris Schröder, Wilfried Thuiller, David I. Warton, Brendan A. Wintle, Florian Hartig, and Carsten F. Dormann (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40: 913–929. https://doi.org/10.1111/ecog.02881 DOI: https://doi.org/10.1111/ecog.02881
Röthlisberger, Melanie (2018). Regional variation in probabilistic grammars: a multifactorial study of the English dative alternation. PhD Dissertation. KU Leuven.
Röthlisberger, Melanie, Jason Grafmiller, and Benedikt Szmrecsanyi (2017). Cognitive indigenization effects in the English dative alternation. Cognitive Linguistics 28(4): 673–710. https://doi.org/10.1515/cog-2016-0051 DOI: https://doi.org/10.1515/cog-2016-0051
Schelldorfer, Jürg, Peter Bühlmann, and Sara van de Geer (2011). Estimation for high-dimensional linear mixed-effects models using L1-Penalization. Scandinavian Journal of Statistics 38: 197–214. https://doi.org/10.1111/j.1467-9469.2011.00740.x DOI: https://doi.org/10.1111/j.1467-9469.2011.00740.x
Schmid, Hans-Jörg and Helmut Küchenhoff (2013). Collostructional analysis and other ways of measuring lexicogrammatical attraction: Theoretical premises, practical problems and cognitive underpinnings. Cognitive Linguistics 24(3): 531–577. https://doi.org/10.1515/cog-2013-0018 DOI: https://doi.org/10.1515/cog-2013-0018
Speelman, Dirk (2014). Logistic regression: A confirmatory technique for comparisons in corpus Linguistics. In Dylan Glynn and Justyna A. Robinson (Eds), Corpus Methods for Semantics: Quantitative Studies in Polysemy and Synonymy. 487–533. Amsterdam: John Benjamins. https://doi.org/10.1075/hcp.43.18spe DOI: https://doi.org/10.1075/hcp.43.18spe
Speelman, Dirk, Kris Heylen, and Dirk Geeraerts (2018). ‘Introduction’. In: Dirk Speelman, Kris Heylen and Dirk Geeraerts (Eds), Mixed-effects Regression Models in Linguistics. 1–10. Cham: Springer. https://doi.org/10.1007/978-3-319-69830-4_1 DOI: https://doi.org/10.1007/978-3-319-69830-4_1
Stefanowitsch, Anatol and Stefan Th. Gries (2003). Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2): 209–244. https://doi.org/10.1075/ijcl.8.2.03ste DOI: https://doi.org/10.1075/ijcl.8.2.03ste
Theijssen, Daphne, Louis ten Bosch, Lou Boves, Bert Cranen, and Hans van Halteren (2013). Choosing alternatives: Using Bayesian networks and memory-based learning to study the dative alternation. Corpus Linguistics and Linguistic Theory 9: 227–262. https://doi.org/10.1515/cllt-2013-0007 DOI: https://doi.org/10.1515/cllt-2013-0007
Van den Bosch, Antal and Joan Bresnan (2015). Modeling dative alternations of individual children. Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning.103–112. https://doi.org/10.18653/v1/W15-2414 DOI: https://doi.org/10.18653/v1/W15-2414
Van de Velde, Freek, Stefano De Pascale, and Dirk Speelman (Forthcoming). Generalizability in mixed models: Lessons from corpus linguistics (response article). Behavioral and Brain Sciences.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, Alex Hayes, Lionel Henry, Jim Hester, Max Kuhn, Thomas Lin Pedersen, Evan Miller, Stephan Milton Bache, Kirill Müller, Jeroen Ooms, David Robinson, Dana Paige Seidel, Vitalie Spinu, Kohske Takahashi, Davis Vaughan, Claus Wilke, Kara Woo, and Hiroaki Yutani (2019). Welcome to the tidyverse. Journal of Open Source Software 4(43): 1686. https://doi.org/10.21105/joss.01686 DOI: https://doi.org/10.21105/joss.01686
Winter, Bodo (2020). Statistics for Linguistics. An Introduction Using R. New York: Routledge.
Wolk, Christoph, Joan Bresnan, Anette Rosenbach, and Benedikt Szmrecsanyi (2013). Dative and genitive variability in Late Modern English: exploring cross-constructional variation and change. Diachronica 30(3): 382–419. https://doi.org/10.1075/dia.30.3.04wol DOI: https://doi.org/10.1075/dia.30.3.04wol
Yarkoni, Tal and Jacob Westfall (2017). Choosing prediction over explanation in psychology: lessons from machine learning. Perspectives on Psychological Science 12(6): 1100–1122. https://doi.org/10.1177/1745691617693393 DOI: https://doi.org/10.1177/1745691617693393
Zehentner, Eva (2019). Competition in Language Change: The rise of the English Dative Alternation. Berlin: De Gruyter. https://doi.org/10.1515/9783110633856 DOI: https://doi.org/10.1515/9783110633856