Investigating Lexical Effects in Syntax with Regularized Regression (Lasso)




cross-validation, regularization, lasso, machine learning, corpus linguistics, collostructional analysis, distinctive collexeme analysis, overfitting


Within usage-based theory, notably in construction grammar though also elsewhere, the role of the lexicon and of lexically-specific patterns in morphosyntax is well recognized. The methodology, however, is not always sufficiently suited to get at the details, as lexical effects are difficult to study under what are currently the standard methods for investigating grammar empirically. In this short article, we propose a method from machine learning: regularized regression (Lasso) with k-fold cross-validation, and compare its performance with a Distinctive Collexeme Analysis.

Author Biographies

Freek Van de Velde, KU Leuven

Freek Van de Velde (KU Leuven) is associate professor of Dutch linguistics and historical linguistics. His research focuses on quantitative approaches to variation and change and evolutionary linguistics. He received his PhD in 2009, with a work on the diachrony of the noun phrase.

Dirk Pijpops, Université de Liège

Dirk Pijpops (University of Liège) works as lecturer of Dutch. He is affiliated with the research unit Lilith. His research focuses language variation and change, which he studies in order to answer questions in usage-based theoretical linguistics. Methodologically, his work builds on quantitative corpus analyses and agent-based computer simulations. He received his PhD in 2019 at the University of Leuven, with a thesis focused on argument structure variation in Dutch.


Bloem, Jelke (2021). Processing verb clusters. Utrecht: LOT Dissertation Series.

Bondell, Howard D., Arun Krishna, and Sujit K. Ghosh (2010). Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics 66(4): 1069–1077.

Bresnan, Joan, Anna Cueni, Tatiana, and R. Harald Baayen (2007). Predicting the dative alternation. In Gerlof Bouma, Irene Kraemer, and Joost Zwarts (Eds), Cognitive Foundations of Interpretation. Amsterdam: Royal Netherlands Academy of Science. 69–94.

Bresnan, Joan and Ford, Marilyn. (2010). Predicting syntax: Processing dative constructions in American and Australian varieties of English. Language 86: 168–213.

Cappelle, Bert (2006). Particle placement and the case for ‘allostructions’. In Doris Schönefeld (Ed.), Constructions all Over: Case Studies and Theoretical Implications. [Special issue of Constructions].

Colleman, Timothy (2006). De Nederlandse datiefalternantie. Een constructioneel en corpusgebaseerd onderzoek. PhD Dissertation. UGent.

Da?browska, Ewa (2017). Ten Lectures on Grammar in the Mind. Leiden: Brill.

Daelemans, Walter and Antal van den Bosch (2005). Memory-based Language Processing. Cambridge: Cambridge University Press.

Deisenroth, Marc P., A. Aldo Faisal, and Cheng Soon Ong (2020). Mathematics for Machine Learning. Preprint book.

De Troij, Robbert, Stefan Grondelaers, Dirk Speelman, and Antal van den Bosch (2021). Lexicon or grammar? Using memory-based learning to investigate the syntactic relationship between Belgian and Netherlandic Dutch. Natural Language Engineering.

De Vaere, Hilde (2020). The ditransitive alternation in present-day German. A corpus-based analysis. PhD Dissertation. UGent.

Diessel, Holger (2019). The Grammar Network: How Linguistic Structure is Shaped by Language Use. Cambridge: Cambridge University Press.

Flach, Susanne (2021). Collostructions: An R Implementation for the Family of Collostruc­tional Methods. R package version 0.2.0.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1): 1–22.

Ghyselen, Anne-Sophie, and Roxane Vandenberghe (2019). Over etwat, etwuk en iets:geografie en dynamiek van het onbepaald voornaamwoord voor zaak in West-Vlaanderen. Taal en Tongval 71(1): 31–60.

Goldberg, Adèle (2006). Constructions at Work: The Nature of Generalization in Language. Oxford: Oxford University Press.

Gries, Stefan Th. (2000). Towards multifactorial analyses of syntactic variation: the case of particle placement. PhD Dissertation, University of Hamburg.

Gries, Stefan Th. and Anatol Stefanowitsch (2004). Extending collostructional analysis: A corpus-based perspective on ‘alternations’. International Journal of Corpus Linguistics 9(1): 97–129.

Gries, Stefan Th. (2015). The most underused statistical method in corpus linguistics: multi-level (and mixed-effects) models. Corpora 10(1): 95–125.

Groll, Andreas (2017). glmmLasso: Variable Selection for Generalized Linear Mixed Models by L1-Penalized Estimation. R package version 1.5.1.

Groll, Andreas and Gerhard Tutz (2014). Variable selection for generalized linear mixed models by L1-penalized estimation. Statistics and Computing 24(2): 137–154.

Grondelaers, Stefan (2000). De distributie van niet-anaforisch er buiten de eerste zinplaats: sociolexicologische, functionele en psycholinguïstische aspecten van er’s status als presentatief signaal. PhD Dissertation, KU Leuven.

Pijpops, Dirk (2019). Where, how and why does argument structure vary? A usage-based investigation into the Dutch transitive-prepositional alternation. PhD Diss. KU Leuven.

Pijpops, Dirk, Dirk Speelman, Stefan Grondelaers, and Freek Van de Velde (2018). Compar­ing explanations for the Complexity Principle. Evidence from argument realization. Language and Cognition 10(3): 514–543.

Haeseryn, Walter, Kirsten Romijn, Guido Geerts, Jaap de Rooij, and Maarten van den Toorn (1997). Algemene Nederlandse Spraakkunst. 2nd end. Groningen: Nijhoff.

Hamrick, Phillip (2019). Adjusting regression models for overfitting in second language research. Journal of Research Design and Statistics in Linguistics and Communication Science 5(1-2): 107–122.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2013). The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd edn. Berlin: Springer.

Klavan, Jane and Dagmar Divjak (2016). The cognitive plausibility of statistical classification models: Comparing textual and behavioral evidence. Folia Linguistica 50: 355–384.

Levshina, Natalia and Kris Heylen (2014). A radically data-driven construction grammar: experiments with Dutch causative constructions. In Ronny Boogaart, Timothy Colleman, and Gijsbert Rutten (Eds), Extending the Scope of Construction Grammar. Berlin: Mouton de Gruyter. 17–46.

Mandera, Pawel, Emmanuel Keuleers, and Marc Brysbaert (2017). Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: a review and empirical validation. Journal of Memory and Language 92: 57–78.

Ng, Andrew (2018). Machine learning yearning. E-book.

Oostdijk, Nelleke, Martin Reynaert, Véronique Hoste, and Ineke Schuurman (2013). The construction of a 500 million word reference corpus of contemporary written Dutch. In Peter Spyns and Jan Odijk (Eds), Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme, 219–247. Berlin/Heidelberg: Springer.

Perek, Florent (2015). Argument Structure in Usage-based Construction Grammar. Amsterdam: John Benjamins.

Rappaport-Hovav, Malka and Beth Levin (2008). The English dative alternation: The case for verb sensitivity, Journal of Linguistics 44: 129–167.

Roberts, David R. Volker Bahn, Simone Ciuti, Mark S. Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, José J. Lahoz-Monfort, Boris Schröder, Wilfried Thuiller, David I. Warton, Brendan A. Wintle, Florian Hartig, and Carsten F. Dormann (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40: 913–929.

Röthlisberger, Melanie (2018). Regional variation in probabilistic grammars: a multifactorial study of the English dative alternation. PhD Dissertation. KU Leuven.

Röthlisberger, Melanie, Jason Grafmiller, and Benedikt Szmrecsanyi (2017). Cognitive indigenization effects in the English dative alternation. Cognitive Linguistics 28(4): 673–710.

Schelldorfer, Jürg, Peter Bühlmann, and Sara van de Geer (2011). Estimation for high-dimensional linear mixed-effects models using L1-Penalization. Scandinavian Journal of Statistics 38: 197–214.

Schmid, Hans-Jörg and Helmut Küchenhoff (2013). Collostructional analysis and other ways of measuring lexicogrammatical attraction: Theoretical premises, practical prob­lems and cognitive underpinnings. Cognitive Linguistics 24(3): 531–577.

Speelman, Dirk (2014). Logistic regression: A confirmatory technique for comparisons in corpus Linguistics. In Dylan Glynn and Justyna A. Robinson (Eds), Corpus Methods for Semantics: Quantitative Studies in Polysemy and Synonymy. 487–533. Amsterdam: John Benjamins.

Speelman, Dirk, Kris Heylen, and Dirk Geeraerts (2018). ‘Introduction’. In: Dirk Speelman, Kris Heylen and Dirk Geeraerts (Eds), Mixed-effects Regression Models in Linguistics. 1–10. Cham: Springer.

Stefanowitsch, Anatol and Stefan Th. Gries (2003). Collostructions: Investigating the inter­action of words and constructions. International Journal of Corpus Linguistics 8(2): 209–244.

Theijssen, Daphne, Louis ten Bosch, Lou Boves, Bert Cranen, and Hans van Halteren (2013). Choosing alternatives: Using Bayesian networks and memory-based learning to study the dative alternation. Corpus Linguistics and Linguistic Theory 9: 227–262.

Van den Bosch, Antal and Joan Bresnan (2015). Modeling dative alternations of individual children. Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning.103–112.

Van de Velde, Freek, Stefano De Pascale, and Dirk Speelman (Forthcoming). Generalizability in mixed models: Lessons from corpus linguistics (response article). Behavioral and Brain Sciences.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, Alex Hayes, Lionel Henry, Jim Hester, Max Kuhn, Thomas Lin Pedersen, Evan Miller, Stephan Milton Bache, Kirill Müller, Jeroen Ooms, David Robinson, Dana Paige Seidel, Vitalie Spinu, Kohske Takahashi, Davis Vaughan, Claus Wilke, Kara Woo, and Hiroaki Yutani (2019). Welcome to the tidyverse. Journal of Open Source Software 4(43): 1686.

Winter, Bodo (2020). Statistics for Linguistics. An Introduction Using R. New York: Routledge.

Wolk, Christoph, Joan Bresnan, Anette Rosenbach, and Benedikt Szmrecsanyi (2013). Dative and genitive variability in Late Modern English: exploring cross-constructional variation and change. Diachronica 30(3): 382–419.

Yarkoni, Tal and Jacob Westfall (2017). Choosing prediction over explanation in psychology: lessons from machine learning. Perspectives on Psychological Science 12(6): 1100–1122.

Zehentner, Eva (2019). Competition in Language Change: The rise of the English Dative Alternation. Berlin: De Gruyter.



How to Cite

Van de Velde, F. ., & Pijpops, D. . (2021). Investigating Lexical Effects in Syntax with Regularized Regression (Lasso). Journal of Research Design and Statistics in Linguistics and Communication Science, 6(2), 166–199.