The Corpus of Galicia/Spanish Bilingual Speech of the University of Vigo

Codes tagging and automatic annotation


  • Xoán Paulo Rodríguez-Yáñez Universidade de Vigo Author
  • Hakan Casares-Berg Seminario de Sociolingüística da RAG Author



bilingual corpus, Galician/Spanish, tagging, bilingual conversation, languages in contact


Firstly, we present a brief explanation of this research project, the Corpus of Galician/Spanish Bilingual Speech (Corpus de Fala Bilingüe Galego/Castelán, abbreviated as CoFaBil), currently being complied at the University of Vigo. This ethnographic-conversational based corpus has been recorded in a wide range of informal and spontaneous communicative situations, subsequently transcribed in detail with those conventions normally applied to conversation analysis. Secondly, we explain the manual annotation process of the corpus. The CHAT annotation system, applied in tagging this corpus, requires specifying the linguistic-communicative code to which each word belongs. So, we shall explain the problems to which this word by word tagging leads us. These problems cover phenomena characteristic of both bilingual conversation and languages in contact, but with the specificity that the scarce interlinguistic distance between the varieties of Galician and of Spanish call for adopting certain tagging values (presented in the text) that respond to the complex nature of the different phenomena detected. Thirdly, we present the solutions conceived for the automatic annotation of this corpus. The most important result is the computer application Anotador 1.0, which makes it possible to note down a substantial part of the phenomena appearing in the CoFaBil more speedily, while doing away with the interpretative biases involved in human annotating. Also, due to the versatility of this tool, it may be used as a corpora annotator of bilingual speech for any pair of languages.


Acuña, V., S. Alvarez, A. Ameal, H. Casares, A. Lorenzo, F. Ramallo, X.P. Rodríguez & M. Valverde (2001). “Galician/Spanish bilingual corpus: Some transcription and tagging difficulties”. Paper presented at the Third International Symposium on Bilingualism, 18-20 April 2001, University of West England, Bristol. [Unpublished].

Acuña Ferreira, V. (2002). Géneros discursivos en la interacción femenina y masculina: las historias de queja. Unpublished MA Dissertation, Universidade de Vigo.

Alfonzetti, G. (1992). Il discorso bilingüe. Italiano e dialetto a Catania. Milan: Francoangeli.

Álvarez Cáccamo, C. (1990). The Institutionalization of Galician: Linguistic Practices, Power, and Ideology in Public Discourse. Unpublished PhD dissertation, University of California at Berkeley.

Alvarez Cáccamo, C. (1998). “From ‘switching code’ to ‘codeswitching’: Towards a reconceptualisation of communicative codes”. In P. Auer (ed.), Code-Switching in Conversation. Language, Interaction and Identity. London: Routledge, 29-48.

Alvarez Cáccamo, C. (2000). “Para um modelo do ‘code-switching’ e a alternância de variedades como fenómenos distintos: dados do discurso galego-português/espanhol na Galiza”. Estudios de Sociolingüística 1(1), 111-128.

Auer, P. (1984). Bilingual Conversation. Amsterdam: John Benjamins.

Auer, P. (ed.), (1998). Code-Switching in Conversation. Language, Interaction and Identity. London: Routledge.

Casares Berg, H. (2002a). “Anotador 1.0”. Seminario de Sociolingüística e Bilingüismo web site:

Casares Berg, H. (2002b). Un etiquetador automático para o Corpus informatizado de fala bilingüe galego/castelán da Universidade de Vigo. Seminario de Sociolingüística e Bilingüismo web site:

Gardner-Chloros, P. (1995). “Code-switching in community, regional and national repertoires: The myth of the discreteness of linguistic systems”. In L. Milroy & P. Muysken (eds.), One speaker, two languages. Cross-disciplinary perspectives on code-switching. Cambridge: Cambridge University Press, 68-89.

Gafaranga, J. (2000). “Language separateness: A normative framework in studies of language alternation”. Estudios de Sociolingüística 1(2), 65-84.

Hutchby, I. & R. Wooffitt (1998). Conversation Analysis. Principles, practices and applications. Cambridge: Polity Press.

LIDES coding manual (2000). (= The International Journal of Bilingualism. Special Issue. The LIDES coding manual: A document for preparing and analyzing language interaction data 4,2).

MacWhinney, B. (1991). The CHILDES Project: Computational tools for analyzing talk. Hillsdale, NJ: Lawrence Erlbaum Associates.

Muysken, P. (2000). Bilingual Speech. A Typology of Code-Mixing. Cambridge: Cambridge University Press.

Payrató, L., E. Boix, M.R. Lloret & M. Lorente (coords.), (1996). Corpus, corpora. Actes del 1r i 2n Col.loquis Lingüístics de la Universitat de Barcelona (CLUB-1, CLUB-2). Barcelona: PPU-Secció de Lingüística Catalana, Universitat de Barcelona.

Pérez-Guerra, J. (1998). Análisis computerizado de textos. Una introducción a TACT. Vigo: Universidade de Vigo. Servicio de Publicaciones.

Pohl, J. (1965). “Bilinguismes”. Revue Roumaine de Linguistique X(4), 343-49.

Rodríguez Yáñez, X.P. (1993). “Quelques réflexions à propos de la sociolinguistique galicienne”. Plurilinguismes 6, 225-58.

Rodríguez Yáñez, X.P. (1995). Estratexias de comunicación nas interaccións cliente-vendedor no mercado da cidade de Lugo: as alternancias de lingua galego/castelán e a negociación da escolla de lingua. Unpublished PhD Dissertation. Universidade da Coruña.

Rodríguez-Yáñez, X.P. (1997). “Aléas théoriques et méthodologiques dans l’étude du bilinguisme. Le cas de la Galice”. In H. Boyer (ed.), Plurilinguisme: “contact” ou “conflit” de langues? Paris: L’Harmattan, 191-254.

Rodríguez Yáñez, X.P., A. Lorenzo Suárez, F. Ramallo, V. Acuña Ferreira, S. Alvarez López, A. Ameal Guerra, H. Casares Berg & M. Valverde Juncal (2001). “El Corpus informatizado de fala bilingüe galego/castelán de la Universidad de Vigo: presentación y problemas de identificación y etiquetado de los códigos gallego y castellano”. In A.I. Moreno & V. Colwell (eds.), Perspectivas Recientes sobre el Discurso. Recent Perspectives on Discourse. AESLA (Asociación Española de Lingüística Aplicada) and Universidad de León: Secretariado de Publicaciones y Medios Audiovisuales. [CDRom edition, 13 pages]

Romaine, S. (1995). Bilingualism. Oxford& Cambridge: Blackwell. [2nd edition].



How to Cite

Rodríguez-Yáñez, X. P., & Casares-Berg, H. (2003). The Corpus of Galicia/Spanish Bilingual Speech of the University of Vigo: Codes tagging and automatic annotation. Sociolinguistic Studies, 4(1), 358-382.