Principal Components and clustering on lexical distances from standard Basque to local varieties


  • Juan Ignacio Modroño University of the Basque Country UPV/EHU
  • Karmele Fernandez-Aguirre University of the Basque Country UPV/EHU
  • Gotzon Aurrekoetxea University of the Basque Country UPV/EHU
  • Jesus Rubio University of the Basque Country UPV/EHU



linguistic variation, Basque language, Principal Component Analysis, Cluster analysis


It is common in Dialectology the use of linguistic distances aiming to the delimitation of dialects. Different statistical techniques have been applied in modern dialectology to draw dialectal boundaries. So far little work has been published about the determinants of language variation. In this contribution, we propose a two step method (combining Principal Components and Cluster Analysis) in order to get a dialectal classification of localities using distances from standard Basque to a number of local varieties of the Basque language. In addition, we determine, on the one hand, which lexical features have caused the greatest variation in the different dialect areas and, on the other hand, which localities are the most representative ones in each dialect area.

Author Biographies

  • Juan Ignacio Modroño, University of the Basque Country UPV/EHU

    Juan Ignacio Modroño is Profesor Titular, Econometrics and Statistics.

  • Karmele Fernandez-Aguirre, University of the Basque Country UPV/EHU

    Karmele Fernandez-Aguirre is Colaboradora Honorífica Statistics.

  • Gotzon Aurrekoetxea, University of the Basque Country UPV/EHU

    Gotzon Aurrekoetxea is Profesor Titular in Linguistics and Basque Studies.

  • Jesus Rubio, University of the Basque Country UPV/EHU

    Jesus Rubio is Profesor Titular, Econometrics and Statistics.


Aurrekoetxea, G. (1992). Nafarroako euskara: azterketa dialektometrikoa, Uztaro 5: 59–109.

Aurrekoetxea, G. (2007). Grammatical and lexical variation in the Basque language, Linguistica Atlantica, 27–28: 15–20.

Aurrekoetxea, G. (2008). Different patterns of geolinguistic structure. In G. Blaikner-Hohenwart, E. Bortolotti, R. Dranceschini, E. Lörincz et al. (eds), Ladinometria. Festschrift für Hans Goebl zum 65. Geburtstag, 9–18. Salzburg: Universität Salzburg / Freie Universität Bozen.

Aurrekoetxea, G. (2009). Iparraldeko hizkeren sailkapena (1): lexikoa, ASJU XXXVIII–1: 287–331.

Aurrekoetxea, G., Iglesias, A. and Videgain, Ch. (2004). Bourciez Bildumako Euskal Atlasa (BBEA-1): 1. Lexikoa. [Bourciez linguistic atlas: 1. Lexic], ASJU 38–2 (2004) [ed. 2007].

Aurrekoetxea, G., Iglesias, A. and Videgain, Ch. (2005). Bourciez Bildumako Euskal Atlasa (BBEA-2): 2. Gramatika. [Bourciez linguistic atlas: 2. Grammar], ASJU 39–1 (2005) [ed. 2008].

Aurrekoetxea, G. and Videgain, Ch. (2004). Seme prodigoaren parabola Ipar Euskal Herriko 150 bertsiotan, ASJUren gehigarriak, EHU, Bilbo [now in]

Aurrekoetxea, G. and Videgain, Ch. (2009). Le projet Bourciez: Traitement géolinguistique d’un corpus dialectal de 1895, Dialectologia 2 (, 81–111.

Embleton, Sh., Uritescu, D. and Wheeler, E. S. (2013). ‘Defining dialect regions with interpretations: Advancing the multidimensional scaling approach’. Literary and Linguistic Computing 28 (1), 13–22.

Goebl, H. (2012). Introducción a los problemas y métodos según los principios de la Escuela Dialectométrica de Salzburgo (con ejemplos sacados del ‘Atlante Italo-Svizzero’, AIS). In G. Aurrekoetxea and J. I. Ormaetxea (eds) Tools for Linguistic Variation, 3–39. Bilbao: Universidad del País Vasco (UPV/EHU).

Gries, S.-T. (2015). Quantitative Linguistics. International Encyclopedia of the Social and Behavioral Sciences, 2nd Ed., Vol. 19, pp.725–732.

Iglesias, A. (2005). Soziatiboa Bourciez-en testuetan, Lapurdum X, 65–94.

Lebart, L., Morineau, A. and Warwick, K. (1984). Multivariate Descriptive Statistical Analysis. New York: Wiley.

Lebart, L., Salem, A. and Berry, L. (1998). Exploring Textual Data. New York: Kluwer Academic Publisher.

Moisl, H. and Maguire, W. (2008). Identifying the main determinants of phonetic variation in the Newcastle Electronic Corpus of Tyneside English. Journal of Quantitative Linguistics 15 (1), 46–69.

Nakache, J.-P. and Confais, J. (2005). Approche Pragmatique de la Classification. Paris: Technip.

Nerbonne, J. (2015). Various variation aggregates in the LAMSAS South. In Michael D. Picone and Catherine Evans Davies (eds) Language Variety in the South: Historical and Contemporary Perspectives, 369–382. Tuscaloosa, AL: University of Alabama Press.

Ormaetxea, J. L. (2005). Euskal lexikografiari ekarri berria: Bourciez-en euskal testuetako lexikoa, Lapurdum X, 153–168.

Wieling, M. (2012). A Quantitative Approach to Social and Geographical Dialect Variation. Groningen: University of Groningen.

Wieling, M. and Nerbonne, J. (2015). Advances in Dialectometry, Annual Review of Linguistics. Vol. 1: 243–264 (Volume publication date January 2015). First published online as a Review in Advance on 28 July 2014.






How to Cite

Modroño, J. I., Fernandez-Aguirre, K., Aurrekoetxea, G., & Rubio, J. (2016). Principal Components and clustering on lexical distances from standard Basque to local varieties. Journal of Research Design and Statistics in Linguistics and Communication Science, 3(1), 5-22.