A segmentally informed solution to automatic accent classification and its advantages to forensic applications

Authors

  • Georgina Brown Lancaster University and Soundscape Voice Evidence
  • Javier Franco-Pedroso Independent Researcher
  • Joaquin González-Rodríguez Universidad Autónoma de Madrid

DOI:

https://doi.org/10.1558/ijsll.20446

Keywords:

automatic accent recognition, explainable technologies, segmental information, forensic applications

Abstract

Traditionally, work in automatic accent recognition has followed a similar research trajectory to that of language identification, dialect identification and automatic speaker recognition. The same acoustic modelling approaches that have been implemented in speaker recognition (such as GMM-UBM and i-vector-based systems) have also been applied to automatic accent recognition. These approaches form models of speakers’ accents by taking acoustic features from right across the speech signal without knowledge of its phonetic content. Particularly for accent recognition, however, phonetic information is expected to add substantial value to the task. The current work presents an alternative modelling approach to automatic accent recognition, which forms models of speakers’ pronunciation systems using segmental information. This article claims that such an approach to the problem makes for a more explainable method and therefore is a more appropriate method to deploy in settings where it is important to be able to communicate methods, such as forensic applications. We discuss the issue of explainability and show how the system operates on a large 700-speaker dataset of non-native English conversational telephone recordings.

Author Biographies

  • Georgina Brown, Lancaster University and Soundscape Voice Evidence

    Georgina Brown is a Lecturer in Forensic Linguistics in the Department of Linguistics and English Language at Lancaster University, UK. Much of her research considers speech technologies in forensic casework and investigative scenarios, but other contributions address topics and challenges that affect casework practice and the forensic speech science community. In addition to her academic position she is a consultant for Soundscape Voice Evidence, a UK-based forensic speech analysis provider.

     

  • Javier Franco-Pedroso, Independent Researcher

    Javier Franco-Pedroso received his PhD from Universidad Autónoma de Madrid (UAM) in 2016. He undertakes research in speaker and language recognition, forensic evidence evaluation and financial time-series analysis and synthesis, among other topics. After a two-year period as a postdoctoral researcher and assistant professor at UAM, he moved into industry. He has been working as a Speech Recognition Engineer in keyword spotting and automatic speech recognition applications for several companies.

  • Joaquin González-Rodríguez , Universidad Autónoma de Madrid

    Joaquin Gonzalez-Rodriguez, Ph.D. (1999), is a Full Professor at Universidad Autonoma de Madrid (UAM). He has led ATVS/AUDIAS participations in multiple NIST Speaker and Language Recognition Evaluations since 2001, and since 2000 was an invited member of the FSAAWG (Forensic Speech and Audio Analysis Working Group) in ENFSI (European Network of Forensic Science Institutes). In 2007, he authored “Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition”. In 2008, he received a Google Faculty Research Award, and addressed in Brisbane (Australia) a keynote plenary talk at Interspeech 2008 entitled “Forensic Automatic Speaker Recognition: Fiction or Science?”. During academic term 2010-2011, he was a Visiting Scholar in the Speech Group at ICSI (International Computer Science Institute) in the University of California at Berkeley. His research interests are focused on speech and audio processing, machine learning and forensic science.

References

Adadi, A. and Berrada, M. Peeking inside the black-box: a survey on Explainable Artifical Intelligence (XAI). IEEE Access. 6. 52138-52160.

D’Arcy, S., Russell, M., Browning, S. and Tomlinson, M. (2004). The Accents of the British Isles (ABI) corpus. In Proceedings of Modélisations pour l’Identification des Langues. Paris, France. 115-119.

Bahari, M.H., Saeidi, R., Van Hamme, H., Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada. 7344-7348.

Behravan, H. Hautamäki, V. and Kinnunen, T. (2013). Foreign accent detection from spoken Finnish using i-vectors. In Proceedings of Interspeech. Lyon, France. 79-82.

Behravan, H., Hautamäki, V. and Kinnunen, T. (2015). Factors affecting i-vector based foreign accent recognition: A case study in spoken Finnish. Speech Communication. 66. 118-129.

Biadsy, F., Soltau, H., Mangu, L, Navratil, J. and Hirschberg, J. (2010). Discriminative phonotactics for dialect recognition using context-dependent phone classifiers. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Brno, Czech Republic. 263-270.

Boril, H., Sangwan, A. and Hansen, J. (2012). Arabic Dialect Identification – ‘Is the secret in the slence?’ and other observations. In Proceedings of Interspeech. Portland, Oregon. 30-33.

Brown, G. (2015). Automatic recognition of geographically-proximate accents using content-controlled and content-mismatched speech data. In Proceedings of the 18th International Congress of Phonetic Sciences. Glasgow, UK.

Brown, G. (2016). Automatic accent recognition systems and the effects of data on performance. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Bilbao, Spain.

Brown, G. (2017). Considering automatic accent recognition technology for forensic applications. Ph.D. thesis, University of York, UK.

Brown, G. (2018). Segmental content effects on text-dependent automatic accent recognition. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Les Sables d’Olonne, France. 9-15.

Brown, G. and Wormald, J. (2017). Automatic Sociophonetics: Exploring corpora using a forensic accent recognition system. Journal of the Acoustical Society of America. 142. 422-433.

Brümmer, N. (2007). FoCal Multi-Class: Toolkit for evaluation, fusion and calibration of multi-class recognition scores – tutorial and user manual. URL: https://sites.google.com/site/nikobrummer/focalmulticlass (Accessed: 17/04/2017).

Brümmer, N. and Van Leeuwen, D. (2006). On calibration of language recognition scores. In Proceedings of Odyssey: The Speaker and Language Workshop. San Juan, Puerto Rico.

Chen, T., Huang, C., Chang, E. and Wang, J. (2001). Automatic accent identification using Gaussian Mixture Models. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. Italy.

Clopper, C. and Pisoni, D. (2004). Some acoustic cues for the perceptual categorization of American English regional dialects. Journal of Phonetics. 32. 111-140.

Dehak, N., Kenny P., Dehak, R., Dumouchel, P. and Ouellet, P. (2011). Front-End Factor Analysis for Speaker Verfification. IEEE Transactions on Audio, Speech and Language Processing. 19. 788-798.

Drummond, R. (2012). Aspects of identity in a second language. ING variation in speech of Polish migrants living in Manchester, UK. Language Variation and Change. 24. 107-133.

Ferragne, E., Gendrot, C. and Pellegrini, T. (2019). Towards phonetic interpretability in deep learning applied to voice comparison. In Proceedings of the International Congress of Phonetic Sciences. Melbourne, Australia.

Ferragne, E. and Pellegrino, F. (2010). Vowel systems and accent similarity in the British Isles: Exploiting multidimensional acoustic distances in phonetics. Journal of Phonetics. 38. 526-539.

Franco-Pedroso, J. and González-Rodríguez (2016). Linguistically-constrained formant-based i-vectors for automatic speaker recognition. Speech Communication. 76. 61-81.

González-Rodríguez, J., Rose, P., Ramos-Castro, D., Toledano, D. and Ortega-Garcia, J. (2007). Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition. IEEE Transactions on Audio, Speech and Language Processing. 15. 2104-2115.

Grabe, E. (2004). Intonational variation in urban dialects of English spoken in the British Isles. In P. Gilles and J. Peters (Eds.) Regional Variation in Intonation. Linguistische Arbeiten, Tuebingen. 9-31.

Hanani, A., Russell, M. and Carey, M. (2013). Human and computer recognition of regional accents and ethnic groups from British English speech. Computer, Speech and Language. 27. 59-74.

Huckvale, M. (2004). ACCDIST: a metric for comparing speakers’ accents. In Proceedings of the International Conference on Spoken Language Processing. Jeju, Korea. 29-32.

Huckvale, M. (2007). ACCDIST: An accent similarity metric for accent recognition and diagnosis. In C Müller (Ed.) Speaker Classification, Volume 2 of Lecture Notes in Computer Science. Springer-Verlag, Berlin Heidelberg. 258-274.

Huckvale, M. (2016). Within-speaker features for native language recognition in the Interspeech 2016 Computational Paralinguistics Challenge. In Proceedings of Interspeech. San Francisco, USA. 2403-2407.

Hughes, V. and Foulkes, P. (2015). The relevant population in forensic voice comparison: Effects of varying delimitations of social class and age. Speech Communication. 66. 218-230.

Hughes, V., Wood, S. and Foulkes, P. (2016). Strength of forensic voice comparison evidence from the acoustics of filled pauses. The International Journal of Speech, Language and the Law. 23. 99-132.

Kajarekar, S. S., Scheffer, N. Graciarena, M., Shriberg, E., Stolcke, A., Ferrer, L. and Bocklet, T. (2009). The SRI NIST 2008 Speaker Recognition Evaluation System. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Taipei, Taiwan. 4205-4208.

Kataria, A. and Singh, M. D. (2013). Review of data classification using k-nearest neighbour algorithm. International Journal of Emerging Technology and Advancing Engineering. 3. 354-360.

Lan, Y. (2020). Perception of English fricatives and affricates by advanced Chinese learners of English. In Proceedings of Interspeech. Shanghai, China. 4467-4470.

de Leeuw, E. (2007). Hesitation markers in English, German and Dutch. Journal of Germanic Linguistics. 19. 85-114.

McDougall, K. and Duckworth, M. (2017). Profiling Fluency: An analysis of individual variation in disfluencies in adult males. Speech Communication. 95. 16-27.

Morrison, G. S. (2013). Tutorial on logistic-regression calibration and fusion: converting a score to a likelihood ratio. Australian Journal of Forensic Sciences. 45. 173-197.

Najafian, M., Safavi, S., Weber, P. and Russell, M. Identification of British English regional accent using fusion of i-vector and multi-accent phonotactic systems. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Bilbao, Spain.

Nance, C., Kirkham, S. and Groarke, E. (2018). Studying intonation in varieties of English: Gender and individual variation in Liverpool. In N. Braber and S. Jansen (Eds.) Sociolinguistics in England. Palgrave Macmillan, Basingstoke. 274-296.

Piske, T., MacKay, I. and Flege, J. (2001). Factors affecting degree of foreign accent in an L2: A review. Journal of Phonetics. 29. 191-215.

Pryzybocki, M. and Martin, A. (2004). NIST Speaker Recognition Evaluation chronicles. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Toledo, Spain.

Ramos-Castro, D. González-Rodríguez, J. and Ortega-Garcia, J. (2006). Likelihood ratio calibration in a transparent and testable forensic speaker recognition framework. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. San Juan, Puerto Rico.

Reynolds, D. and Rose, R. (1995). Robust Text-Independent Speaker Identification using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing. 3. 72-83.

Samek, W., Wiegand, T., Müller, K-R. (2017). Explainable Artificial Intelligence: understanding visualizing and interpreting deel learning models. URL: https://arxiv.org/pdf/1708.08296.pdf

Schüller, B., Steidl, S., Batliner, A., Hirschberg, J., Burgoon, J., Baird, A., Elkins, A., Zhang, Y., Coutinho, E. and Evanini, K. Computational Paralinguistics Challenge. Deception, sincerity and native language. In Proceedings of Interspeech. San Francisco, USA. 2001-2005.

Shon, S., Ali, A., & Glass, J. (2018). Convolutional neural network and language embeddings for end-to-end dialect recognition. In Proceedings of Odyssey: the speaker and language recognition workshop. Les Sables d’Olonne, France. 98-104.

Snyder, D., Garcia-Romero, D., Povey, D. and Khudanpur, S. (2017). Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Proceedings of Interspeech. Stockholm, Sweden. 999-1003.

Stuart-Smith, J. (1999). Glasgow: Accent and Voice Quality. In P. Foulkes and G. Docherty (Eds.) Urban Voices: Accent Studies in the British Isles. Routledge, London. 203-222.

Tully, G. (2020). Codes of Practice and Conduct for forensic science providers and practitioners in the Criminal Justice Sysytem. FSR-C-100. Issue 5. The UK Government. URL: https://www.gov.uk/government/publications/forensic-science-providers-codes-of-practice-and-conduct-2020.

Vieru, B., de Mareüil. and Adda-Decker, M. (2011). Characterisation and Identification of Non-native French Accents. Speech Communication. 53. 292-310.

Watt, D. (2010). The identification of the individual through speech. In C. Llamas and D. Watt (Eds.) Language and Identities. Edinburgh University Press, Edinburgh. 76-85.

Watt, D., Harrison, P., Cabot-King, L. (2020). Who owns your voice? Linguistic and legal perspectives on the relationship between vocal distinctiveness and the rights of the individual speaker. The International Journal of Speech, Language and the Law. 26. 137-180.

Published

2022-07-08

Issue

Section

Articles

How to Cite

Brown, G., Franco-Pedroso, J., & González-Rodríguez , J. (2022). A segmentally informed solution to automatic accent classification and its advantages to forensic applications. International Journal of Speech, Language and the Law, 28(2), 201–232. https://doi.org/10.1558/ijsll.20446