A segmentally informed solution to automatic accent classification and its advantages to forensic applications
Keywords:automatic accent recognition, explainable technologies, segmental information, forensic applications
Traditionally, work in automatic accent recognition has followed a similar research trajectory to that of language identification, dialect identification and automatic speaker recognition. The same acoustic modelling approaches that have been implemented in speaker recognition (such as GMM-UBM and i-vector-based systems) have also been applied to automatic accent recognition. These approaches form models of speakers’ accents by taking acoustic features from right across the speech signal without knowledge of its phonetic content. Particularly for accent recognition, however, phonetic information is expected to add substantial value to the task. The current work presents an alternative modelling approach to automatic accent recognition, which forms models of speakers’ pronunciation systems using segmental information. This article claims that such an approach to the problem makes for a more explainable method and therefore is a more appropriate method to deploy in settings where it is important to be able to communicate methods, such as forensic applications. We discuss the issue of explainability and show how the system operates on a large 700-speaker dataset of non-native English conversational telephone recordings.
Adadi, A. and Berrada, M. Peeking inside the black-box: a survey on Explainable Artifical Intelligence (XAI). IEEE Access. 6. 52138-52160. DOI: https://doi.org/10.1109/ACCESS.2018.2870052
D’Arcy, S., Russell, M., Browning, S. and Tomlinson, M. (2004). The Accents of the British Isles (ABI) corpus. In Proceedings of Modélisations pour l’Identification des Langues. Paris, France. 115-119.
Bahari, M.H., Saeidi, R., Van Hamme, H., Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada. 7344-7348. DOI: https://doi.org/10.1109/ICASSP.2013.6639089
Behravan, H. Hautamäki, V. and Kinnunen, T. (2013). Foreign accent detection from spoken Finnish using i-vectors. In Proceedings of Interspeech. Lyon, France. 79-82. DOI: https://doi.org/10.21437/Interspeech.2013-42
Behravan, H., Hautamäki, V. and Kinnunen, T. (2015). Factors affecting i-vector based foreign accent recognition: A case study in spoken Finnish. Speech Communication. 66. 118-129. DOI: https://doi.org/10.1016/j.specom.2014.10.004
Biadsy, F., Soltau, H., Mangu, L, Navratil, J. and Hirschberg, J. (2010). Discriminative phonotactics for dialect recognition using context-dependent phone classifiers. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Brno, Czech Republic. 263-270.
Boril, H., Sangwan, A. and Hansen, J. (2012). Arabic Dialect Identification – ‘Is the secret in the slence?’ and other observations. In Proceedings of Interspeech. Portland, Oregon. 30-33. DOI: https://doi.org/10.21437/Interspeech.2012-18
Brown, G. (2015). Automatic recognition of geographically-proximate accents using content-controlled and content-mismatched speech data. In Proceedings of the 18th International Congress of Phonetic Sciences. Glasgow, UK.
Brown, G. (2016). Automatic accent recognition systems and the effects of data on performance. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Bilbao, Spain. DOI: https://doi.org/10.21437/Odyssey.2016-14
Brown, G. (2017). Considering automatic accent recognition technology for forensic applications. Ph.D. thesis, University of York, UK.
Brown, G. (2018). Segmental content effects on text-dependent automatic accent recognition. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Les Sables d’Olonne, France. 9-15. DOI: https://doi.org/10.21437/Odyssey.2018-2
Brown, G. and Wormald, J. (2017). Automatic Sociophonetics: Exploring corpora using a forensic accent recognition system. Journal of the Acoustical Society of America. 142. 422-433. DOI: https://doi.org/10.1121/1.4991330
Brümmer, N. (2007). FoCal Multi-Class: Toolkit for evaluation, fusion and calibration of multi-class recognition scores – tutorial and user manual. URL: https://sites.google.com/site/nikobrummer/focalmulticlass (Accessed: 17/04/2017).
Brümmer, N. and Van Leeuwen, D. (2006). On calibration of language recognition scores. In Proceedings of Odyssey: The Speaker and Language Workshop. San Juan, Puerto Rico. DOI: https://doi.org/10.1109/ODYSSEY.2006.248106
Chen, T., Huang, C., Chang, E. and Wang, J. (2001). Automatic accent identification using Gaussian Mixture Models. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. Italy. DOI: https://doi.org/10.1109/ASRU.2001.1034657
Clopper, C. and Pisoni, D. (2004). Some acoustic cues for the perceptual categorization of American English regional dialects. Journal of Phonetics. 32. 111-140. DOI: https://doi.org/10.1016/S0095-4470(03)00009-3
Dehak, N., Kenny P., Dehak, R., Dumouchel, P. and Ouellet, P. (2011). Front-End Factor Analysis for Speaker Verfification. IEEE Transactions on Audio, Speech and Language Processing. 19. 788-798. DOI: https://doi.org/10.1109/TASL.2010.2064307
Drummond, R. (2012). Aspects of identity in a second language. ING variation in speech of Polish migrants living in Manchester, UK. Language Variation and Change. 24. 107-133. DOI: https://doi.org/10.1017/S0954394512000026
Ferragne, E., Gendrot, C. and Pellegrini, T. (2019). Towards phonetic interpretability in deep learning applied to voice comparison. In Proceedings of the International Congress of Phonetic Sciences. Melbourne, Australia.
Ferragne, E. and Pellegrino, F. (2010). Vowel systems and accent similarity in the British Isles: Exploiting multidimensional acoustic distances in phonetics. Journal of Phonetics. 38. 526-539. DOI: https://doi.org/10.1016/j.wocn.2010.07.002
Franco-Pedroso, J. and González-Rodríguez (2016). Linguistically-constrained formant-based i-vectors for automatic speaker recognition. Speech Communication. 76. 61-81. DOI: https://doi.org/10.1016/j.specom.2015.11.002
González-Rodríguez, J., Rose, P., Ramos-Castro, D., Toledano, D. and Ortega-Garcia, J. (2007). Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition. IEEE Transactions on Audio, Speech and Language Processing. 15. 2104-2115. DOI: https://doi.org/10.1109/TASL.2007.902747
Grabe, E. (2004). Intonational variation in urban dialects of English spoken in the British Isles. In P. Gilles and J. Peters (Eds.) Regional Variation in Intonation. Linguistische Arbeiten, Tuebingen. 9-31.
Hanani, A., Russell, M. and Carey, M. (2013). Human and computer recognition of regional accents and ethnic groups from British English speech. Computer, Speech and Language. 27. 59-74. DOI: https://doi.org/10.1016/j.csl.2012.01.003
Huckvale, M. (2004). ACCDIST: a metric for comparing speakers’ accents. In Proceedings of the International Conference on Spoken Language Processing. Jeju, Korea. 29-32. DOI: https://doi.org/10.21437/Interspeech.2004-29
Huckvale, M. (2007). ACCDIST: An accent similarity metric for accent recognition and diagnosis. In C Müller (Ed.) Speaker Classification, Volume 2 of Lecture Notes in Computer Science. Springer-Verlag, Berlin Heidelberg. 258-274. DOI: https://doi.org/10.1007/978-3-540-74122-0_20
Huckvale, M. (2016). Within-speaker features for native language recognition in the Interspeech 2016 Computational Paralinguistics Challenge. In Proceedings of Interspeech. San Francisco, USA. 2403-2407. DOI: https://doi.org/10.21437/Interspeech.2016-1466
Hughes, V. and Foulkes, P. (2015). The relevant population in forensic voice comparison: Effects of varying delimitations of social class and age. Speech Communication. 66. 218-230. DOI: https://doi.org/10.1016/j.specom.2014.10.006
Hughes, V., Wood, S. and Foulkes, P. (2016). Strength of forensic voice comparison evidence from the acoustics of filled pauses. The International Journal of Speech, Language and the Law. 23. 99-132. DOI: https://doi.org/10.1558/ijsll.v23i1.29874
Kajarekar, S. S., Scheffer, N. Graciarena, M., Shriberg, E., Stolcke, A., Ferrer, L. and Bocklet, T. (2009). The SRI NIST 2008 Speaker Recognition Evaluation System. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Taipei, Taiwan. 4205-4208. DOI: https://doi.org/10.1109/ICASSP.2009.4960556
Kataria, A. and Singh, M. D. (2013). Review of data classification using k-nearest neighbour algorithm. International Journal of Emerging Technology and Advancing Engineering. 3. 354-360.
Lan, Y. (2020). Perception of English fricatives and affricates by advanced Chinese learners of English. In Proceedings of Interspeech. Shanghai, China. 4467-4470. DOI: https://doi.org/10.21437/Interspeech.2020-1120
de Leeuw, E. (2007). Hesitation markers in English, German and Dutch. Journal of Germanic Linguistics. 19. 85-114. DOI: https://doi.org/10.1017/S1470542707000049
McDougall, K. and Duckworth, M. (2017). Profiling Fluency: An analysis of individual variation in disfluencies in adult males. Speech Communication. 95. 16-27. DOI: https://doi.org/10.1016/j.specom.2017.10.001
Morrison, G. S. (2013). Tutorial on logistic-regression calibration and fusion: converting a score to a likelihood ratio. Australian Journal of Forensic Sciences. 45. 173-197. DOI: https://doi.org/10.1080/00450618.2012.733025
Najafian, M., Safavi, S., Weber, P. and Russell, M. Identification of British English regional accent using fusion of i-vector and multi-accent phonotactic systems. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Bilbao, Spain.
Nance, C., Kirkham, S. and Groarke, E. (2018). Studying intonation in varieties of English: Gender and individual variation in Liverpool. In N. Braber and S. Jansen (Eds.) Sociolinguistics in England. Palgrave Macmillan, Basingstoke. 274-296. DOI: https://doi.org/10.1057/978-1-137-56288-3_11
Piske, T., MacKay, I. and Flege, J. (2001). Factors affecting degree of foreign accent in an L2: A review. Journal of Phonetics. 29. 191-215. DOI: https://doi.org/10.1006/jpho.2001.0134
Pryzybocki, M. and Martin, A. (2004). NIST Speaker Recognition Evaluation chronicles. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. Toledo, Spain.
Ramos-Castro, D. González-Rodríguez, J. and Ortega-Garcia, J. (2006). Likelihood ratio calibration in a transparent and testable forensic speaker recognition framework. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. San Juan, Puerto Rico. DOI: https://doi.org/10.1109/ODYSSEY.2006.248088
Reynolds, D. and Rose, R. (1995). Robust Text-Independent Speaker Identification using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing. 3. 72-83. DOI: https://doi.org/10.1109/89.365379
Samek, W., Wiegand, T., Müller, K-R. (2017). Explainable Artificial Intelligence: understanding visualizing and interpreting deel learning models. URL: https://arxiv.org/pdf/1708.08296.pdf
Schüller, B., Steidl, S., Batliner, A., Hirschberg, J., Burgoon, J., Baird, A., Elkins, A., Zhang, Y., Coutinho, E. and Evanini, K. Computational Paralinguistics Challenge. Deception, sincerity and native language. In Proceedings of Interspeech. San Francisco, USA. 2001-2005.
Shon, S., Ali, A., & Glass, J. (2018). Convolutional neural network and language embeddings for end-to-end dialect recognition. In Proceedings of Odyssey: the speaker and language recognition workshop. Les Sables d’Olonne, France. 98-104. DOI: https://doi.org/10.21437/Odyssey.2018-14
Snyder, D., Garcia-Romero, D., Povey, D. and Khudanpur, S. (2017). Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Proceedings of Interspeech. Stockholm, Sweden. 999-1003. DOI: https://doi.org/10.21437/Interspeech.2017-620
Stuart-Smith, J. (1999). Glasgow: Accent and Voice Quality. In P. Foulkes and G. Docherty (Eds.) Urban Voices: Accent Studies in the British Isles. Routledge, London. 203-222.
Tully, G. (2020). Codes of Practice and Conduct for forensic science providers and practitioners in the Criminal Justice Sysytem. FSR-C-100. Issue 5. The UK Government. URL: https://www.gov.uk/government/publications/forensic-science-providers-codes-of-practice-and-conduct-2020.
Vieru, B., de Mareüil. and Adda-Decker, M. (2011). Characterisation and Identification of Non-native French Accents. Speech Communication. 53. 292-310. DOI: https://doi.org/10.1016/j.specom.2010.10.002
Watt, D. (2010). The identification of the individual through speech. In C. Llamas and D. Watt (Eds.) Language and Identities. Edinburgh University Press, Edinburgh. 76-85. DOI: https://doi.org/10.1515/9780748635788-011
Watt, D., Harrison, P., Cabot-King, L. (2020). Who owns your voice? Linguistic and legal perspectives on the relationship between vocal distinctiveness and the rights of the individual speaker. The International Journal of Speech, Language and the Law. 26. 137-180. DOI: https://doi.org/10.1558/ijsll.40571
How to Cite
© Equinox Publishing Ltd.
For information regarding our Open Access policy, click here.