A segmentally informed solution to automatic accent classification and its advantages to forensic applications

Authors

  • Georgina Brown Lancaster University and Soundscape Voice Evidence
  • Javier Franco-Pedroso Independent Researcher
  • Joaquin González-Rodríguez Universidad Autónoma de Madrid

DOI:

https://doi.org/10.1558/ijsll.20446

Keywords:

automatic accent recognition, explainable technologies, segmental information, forensic applications

Abstract

Traditionally, work in automatic accent recognition has followed a similar research trajectory to that of language identification, dialect identification and automatic speaker recognition. The same acoustic modelling approaches that have been implemented in speaker recognition (such as GMM-UBM and i-vector-based systems) have also been applied to automatic accent recognition. These approaches form models of speakers’ accents by taking acoustic features from right across the speech signal without knowledge of its phonetic content. Particularly for accent recognition, however, phonetic information is expected to add substantial value to the task. The current work presents an alternative modelling approach to automatic accent recognition, which forms models of speakers’ pronunciation systems using segmental information. This article claims that such an approach to the problem makes for a more explainable method and therefore is a more appropriate method to deploy in settings where it is important to be able to communicate methods, such as forensic applications. We discuss the issue of explainability and show how the system operates on a large 700-speaker dataset of non-native English conversational telephone recordings.

Author Biographies

Georgina Brown, Lancaster University and Soundscape Voice Evidence

Georgina Brown is a Lecturer in Forensic Linguistics in the Department of Linguistics and English Language at Lancaster University, UK. Much of her research considers speech technologies in forensic casework and investigative scenarios, but other contributions address topics and challenges that affect casework practice and the forensic speech science community. In addition to her academic position she is a consultant for Soundscape Voice Evidence, a UK-based forensic speech analysis provider.

 

Javier Franco-Pedroso, Independent Researcher

Javier Franco-Pedroso received his PhD from Universidad Autónoma de Madrid (UAM) in 2016. He undertakes research in speaker and language recognition, forensic evidence evaluation and financial time-series analysis and synthesis, among other topics. After a two-year period as a postdoctoral researcher and assistant professor at UAM, he moved into industry. He has been working as a Speech Recognition Engineer in keyword spotting and automatic speech recognition applications for several companies.

Joaquin González-Rodríguez , Universidad Autónoma de Madrid

Joaquin Gonzalez-Rodriguez, Ph.D. (1999), is a Full Professor at Universidad Autonoma de Madrid (UAM). He has led ATVS/AUDIAS participations in multiple NIST Speaker and Language Recognition Evaluations since 2001, and since 2000 was an invited member of the FSAAWG (Forensic Speech and Audio Analysis Working Group) in ENFSI (European Network of Forensic Science Institutes). In 2007, he authored “Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition”. In 2008, he received a Google Faculty Research Award, and addressed in Brisbane (Australia) a keynote plenary talk at Interspeech 2008 entitled “Forensic Automatic Speaker Recognition: Fiction or Science?”. During academic term 2010-2011, he was a Visiting Scholar in the Speech Group at ICSI (International Computer Science Institute) in the University of California at Berkeley. His research interests are focused on speech and audio processing, machine learning and forensic science.

References

Adadi, A. and Berrada, M. Peeking inside the black-box: a survey on Explainable Artifical Intelligence (XAI). IEEE Access. 6. 52138-52160. DOI: https://doi.org/10.1109/ACCESS.2018.2870052

D’Arcy, S., Russell, M., Browning, S. and Tomlinson, M. (2004). The Accents of the British Isles (ABI) corpus. In Proceedings of Modélisations pour l’Identification des Langues. Paris, France. 115-119.

Bahari, M.H., Saeidi, R., Van Hamme, H., Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada. 7344-7348. DOI: https://doi.org/10.1109/ICASSP.2013.6639089

Behravan, H. Hautamäki, V. and Kinnunen, T. (2013). Foreign accent detection from spoken Finnish using i-vectors. In Proceedings of Interspeech. Lyon, France. 79-82. DOI: https://doi.org/10.21437/Interspeech.2013-42

Behravan, H., Hautamäki, V. and Kinnunen, T. (2015). Factors affecting i-vector based foreign accent recognition: A case study in spoken Finnish. Speech Communication. 66. 118-129. DOI: https://doi.org/10.1016/j.specom.2014.10.004

Biadsy, F., Soltau, H., Mangu, L, Navratil, J. and Hirschberg, J. (2010). Discriminative phonotactics for dialect recognition using context-dependent phone classifiers. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Brno, Czech Republic. 263-270.

Boril, H., Sangwan, A. and Hansen, J. (2012). Arabic Dialect Identification – ‘Is the secret in the slence?’ and other observations. In Proceedings of Interspeech. Portland, Oregon. 30-33. DOI: https://doi.org/10.21437/Interspeech.2012-18

Brown, G. (2015). Automatic recognition of geographically-proximate accents using content-controlled and content-mismatched speech data. In Proceedings of the 18th International Congress of Phonetic Sciences. Glasgow, UK.

Brown, G. (2016). Automatic accent recognition systems and the effects of data on performance. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Bilbao, Spain. DOI: https://doi.org/10.21437/Odyssey.2016-14

Brown, G. (2017). Considering automatic accent recognition technology for forensic applications. Ph.D. thesis, University of York, UK.

Brown, G. (2018). Segmental content effects on text-dependent automatic accent recognition. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Les Sables d’Olonne, France. 9-15. DOI: https://doi.org/10.21437/Odyssey.2018-2

Brown, G. and Wormald, J. (2017). Automatic Sociophonetics: Exploring corpora using a forensic accent recognition system. Journal of the Acoustical Society of America. 142. 422-433. DOI: https://doi.org/10.1121/1.4991330

Brümmer, N. (2007). FoCal Multi-Class: Toolkit for evaluation, fusion and calibration of multi-class recognition scores – tutorial and user manual. URL: https://sites.google.com/site/nikobrummer/focalmulticlass (Accessed: 17/04/2017).

Brümmer, N. and Van Leeuwen, D. (2006). On calibration of language recognition scores. In Proceedings of Odyssey: The Speaker and Language Workshop. San Juan, Puerto Rico. DOI: https://doi.org/10.1109/ODYSSEY.2006.248106

Chen, T., Huang, C., Chang, E. and Wang, J. (2001). Automatic accent identification using Gaussian Mixture Models. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. Italy. DOI: https://doi.org/10.1109/ASRU.2001.1034657

Clopper, C. and Pisoni, D. (2004). Some acoustic cues for the perceptual categorization of American English regional dialects. Journal of Phonetics. 32. 111-140. DOI: https://doi.org/10.1016/S0095-4470(03)00009-3

Dehak, N., Kenny P., Dehak, R., Dumouchel, P. and Ouellet, P. (2011). Front-End Factor Analysis for Speaker Verfification. IEEE Transactions on Audio, Speech and Language Processing. 19. 788-798. DOI: https://doi.org/10.1109/TASL.2010.2064307

Drummond, R. (2012). Aspects of identity in a second language. ING variation in speech of Polish migrants living in Manchester, UK. Language Variation and Change. 24. 107-133. DOI: https://doi.org/10.1017/S0954394512000026

Ferragne, E., Gendrot, C. and Pellegrini, T. (2019). Towards phonetic interpretability in deep learning applied to voice comparison. In Proceedings of the International Congress of Phonetic Sciences. Melbourne, Australia.

Ferragne, E. and Pellegrino, F. (2010). Vowel systems and accent similarity in the British Isles: Exploiting multidimensional acoustic distances in phonetics. Journal of Phonetics. 38. 526-539. DOI: https://doi.org/10.1016/j.wocn.2010.07.002

Franco-Pedroso, J. and González-Rodríguez (2016). Linguistically-constrained formant-based i-vectors for automatic speaker recognition. Speech Communication. 76. 61-81. DOI: https://doi.org/10.1016/j.specom.2015.11.002

González-Rodríguez, J., Rose, P., Ramos-Castro, D., Toledano, D. and Ortega-Garcia, J. (2007). Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition. IEEE Transactions on Audio, Speech and Language Processing. 15. 2104-2115. DOI: https://doi.org/10.1109/TASL.2007.902747

Grabe, E. (2004). Intonational variation in urban dialects of English spoken in the British Isles. In P. Gilles and J. Peters (Eds.) Regional Variation in Intonation. Linguistische Arbeiten, Tuebingen. 9-31.

Hanani, A., Russell, M. and Carey, M. (2013). Human and computer recognition of regional accents and ethnic groups from British English speech. Computer, Speech and Language. 27. 59-74. DOI: https://doi.org/10.1016/j.csl.2012.01.003

Huckvale, M. (2004). ACCDIST: a metric for comparing speakers’ accents. In Proceedings of the International Conference on Spoken Language Processing. Jeju, Korea. 29-32. DOI: https://doi.org/10.21437/Interspeech.2004-29

Huckvale, M. (2007). ACCDIST: An accent similarity metric for accent recognition and diagnosis. In C Müller (Ed.) Speaker Classification, Volume 2 of Lecture Notes in Computer Science. Springer-Verlag, Berlin Heidelberg. 258-274. DOI: https://doi.org/10.1007/978-3-540-74122-0_20

Huckvale, M. (2016). Within-speaker features for native language recognition in the Interspeech 2016 Computational Paralinguistics Challenge. In Proceedings of Interspeech. San Francisco, USA. 2403-2407. DOI: https://doi.org/10.21437/Interspeech.2016-1466

Hughes, V. and Foulkes, P. (2015). The relevant population in forensic voice comparison: Effects of varying delimitations of social class and age. Speech Communication. 66. 218-230. DOI: https://doi.org/10.1016/j.specom.2014.10.006

Hughes, V., Wood, S. and Foulkes, P. (2016). Strength of forensic voice comparison evidence from the acoustics of filled pauses. The International Journal of Speech, Language and the Law. 23. 99-132. DOI: https://doi.org/10.1558/ijsll.v23i1.29874

Kajarekar, S. S., Scheffer, N. Graciarena, M., Shriberg, E., Stolcke, A., Ferrer, L. and Bocklet, T. (2009). The SRI NIST 2008 Speaker Recognition Evaluation System. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Taipei, Taiwan. 4205-4208. DOI: https://doi.org/10.1109/ICASSP.2009.4960556

Kataria, A. and Singh, M. D. (2013). Review of data classification using k-nearest neighbour algorithm. International Journal of Emerging Technology and Advancing Engineering. 3. 354-360.

Lan, Y. (2020). Perception of English fricatives and affricates by advanced Chinese learners of English. In Proceedings of Interspeech. Shanghai, China. 4467-4470. DOI: https://doi.org/10.21437/Interspeech.2020-1120

de Leeuw, E. (2007). Hesitation markers in English, German and Dutch. Journal of Germanic Linguistics. 19. 85-114. DOI: https://doi.org/10.1017/S1470542707000049

McDougall, K. and Duckworth, M. (2017). Profiling Fluency: An analysis of individual variation in disfluencies in adult males. Speech Communication. 95. 16-27. DOI: https://doi.org/10.1016/j.specom.2017.10.001

Morrison, G. S. (2013). Tutorial on logistic-regression calibration and fusion: converting a score to a likelihood ratio. Australian Journal of Forensic Sciences. 45. 173-197. DOI: https://doi.org/10.1080/00450618.2012.733025

Najafian, M., Safavi, S., Weber, P. and Russell, M. Identification of British English regional accent using fusion of i-vector and multi-accent phonotactic systems. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Bilbao, Spain.

Nance, C., Kirkham, S. and Groarke, E. (2018). Studying intonation in varieties of English: Gender and individual variation in Liverpool. In N. Braber and S. Jansen (Eds.) Sociolinguistics in England. Palgrave Macmillan, Basingstoke. 274-296. DOI: https://doi.org/10.1057/978-1-137-56288-3_11

Piske, T., MacKay, I. and Flege, J. (2001). Factors affecting degree of foreign accent in an L2: A review. Journal of Phonetics. 29. 191-215. DOI: https://doi.org/10.1006/jpho.2001.0134

Pryzybocki, M. and Martin, A. (2004). NIST Speaker Recognition Evaluation chronicles. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Toledo, Spain.

Ramos-Castro, D. González-Rodríguez, J. and Ortega-Garcia, J. (2006). Likelihood ratio calibration in a transparent and testable forensic speaker recognition framework. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. San Juan, Puerto Rico. DOI: https://doi.org/10.1109/ODYSSEY.2006.248088

Reynolds, D. and Rose, R. (1995). Robust Text-Independent Speaker Identification using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing. 3. 72-83. DOI: https://doi.org/10.1109/89.365379

Samek, W., Wiegand, T., Müller, K-R. (2017). Explainable Artificial Intelligence: understanding visualizing and interpreting deel learning models. URL: https://arxiv.org/pdf/1708.08296.pdf

Schüller, B., Steidl, S., Batliner, A., Hirschberg, J., Burgoon, J., Baird, A., Elkins, A., Zhang, Y., Coutinho, E. and Evanini, K. Computational Paralinguistics Challenge. Deception, sincerity and native language. In Proceedings of Interspeech. San Francisco, USA. 2001-2005.

Shon, S., Ali, A., & Glass, J. (2018). Convolutional neural network and language embeddings for end-to-end dialect recognition. In Proceedings of Odyssey: the speaker and language recognition workshop. Les Sables d’Olonne, France. 98-104. DOI: https://doi.org/10.21437/Odyssey.2018-14

Snyder, D., Garcia-Romero, D., Povey, D. and Khudanpur, S. (2017). Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Proceedings of Interspeech. Stockholm, Sweden. 999-1003. DOI: https://doi.org/10.21437/Interspeech.2017-620

Stuart-Smith, J. (1999). Glasgow: Accent and Voice Quality. In P. Foulkes and G. Docherty (Eds.) Urban Voices: Accent Studies in the British Isles. Routledge, London. 203-222.

Tully, G. (2020). Codes of Practice and Conduct for forensic science providers and practitioners in the Criminal Justice Sysytem. FSR-C-100. Issue 5. The UK Government. URL: https://www.gov.uk/government/publications/forensic-science-providers-codes-of-practice-and-conduct-2020.

Vieru, B., de Mareüil. and Adda-Decker, M. (2011). Characterisation and Identification of Non-native French Accents. Speech Communication. 53. 292-310. DOI: https://doi.org/10.1016/j.specom.2010.10.002

Watt, D. (2010). The identification of the individual through speech. In C. Llamas and D. Watt (Eds.) Language and Identities. Edinburgh University Press, Edinburgh. 76-85. DOI: https://doi.org/10.1515/9780748635788-011

Watt, D., Harrison, P., Cabot-King, L. (2020). Who owns your voice? Linguistic and legal perspectives on the relationship between vocal distinctiveness and the rights of the individual speaker. The International Journal of Speech, Language and the Law. 26. 137-180. DOI: https://doi.org/10.1558/ijsll.40571

Published

2022-07-08

How to Cite

Brown, G., Franco-Pedroso, J., & González-Rodríguez , J. (2022). A segmentally informed solution to automatic accent classification and its advantages to forensic applications. International Journal of Speech, Language and the Law, 28(2), 201–232. https://doi.org/10.1558/ijsll.20446

Issue

Section

Articles