Assessing the effects of accent-mismatched reference population databases on the performance of an automatic speaker recognition system
DOI:
https://doi.org/10.1558/ijsll.41466Keywords:
Forensic phonetics, automatic speaker recognition, speech technology, forensic speaker comparisonAbstract
Automatic Speaker Recognition (ASR) systems are designed to provide the user with statistics relating to the similarity of two or more speech samples and to the typicality of those shared features in the wider population. When an ASR system is used as part of a forensic investigation, the user must decide what counts as the appropriate ‘wider population’ and select a reference database accordingly. While it has generally been held that the voices populating the reference database should be similar in accent to that of the samples under consideration, the degree to which the accents should correspond has until now not been investigated empirically. We report in this article on a study in which the composition of the reference database was systematically varied in terms of accent, using corpora of samples of Standard Southern British English and of three subvarieties spoken in North-East England (Newcastle, Sunderland, Middlesbrough).
References
Aitken, C. G. G. and Taroni, F. (2004) Statistics and the Evaluation of Evidence for Forensic Scientists (2nd ed.). Hoboken, NJ: John Wiley and Sons.
Association of Forensic Science Providers (2009) Standards for the formulation of evaluative forensic science expert opinion. Science and Justice 49: 161--164.
Beal, J., Burbano-Elizondo, L. and Llamas, C. (2012) Urban North-Eastern English: Tyneside to Teesside. Edinburgh: Edinburgh University Press.
Braun, A., Llamas, C., Watt, D., French, P. and Robertson, D. (2018) Sub-regional ‘other-accent’ effects on lay listeners’ speaker identification abilities: a voice line-up study with speakers and listeners from the North East of England. International Journal of Speech, Language and the Law 25(2): 231--255.
Caballero, M., Mariño, J.B. and Moreno, A. (2002) Multidialectal Spanish modeling for ASR. Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Spain, May 2002, 892--895.
Champod, C. and Evett, I. (2000) Commentary on Broeders 1999. Forensic Linguistics 7(2): 238--243.
Davis, S. B. and Mermelstein, P. (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4): 357--366.
Dellwo, V., French, P. and He, L. (2018) Voice biometrics for speaker recognition applications. In S. Frühholz and P. Belin (eds.) The Oxford Handbook of Voice Perception 777--795. Oxford: Oxford University Press.
Enzinger, E., Morrison, G. S. and Ochoa, F. (2016) A demonstration of the application of the new paradigm for the evaluation of forensic evidence under conditions reflecting those of a real forensic-voice-comparison case. Science and Justice 56: 42--57.
Enzinger, E. and Morrison, G. S. (2017) Empirical test of the performance of an acoustic-phonetic approach to forensic voice comparison under conditions similar to those of a real case. Forensic Science International 277: 30--40.
Evett, I., Lambert, J. and Buckleton, J. (1995) Further observations on glass evidence interpretation. Science and Justice 35(4): 283--289.
French, P. (2017) A developmental history of forensic speaker comparison in the UK. English Phonetics 21: 271--286.
French, P. and Stevens, L. (2013) Forensic speech science. In M. Jones and R. Knight (eds.) The Bloomsbury Companion to Phonetics 183--197. London: Continuum.
Gold, E. and French, P. (2019) International practices in forensic speaker comparisons: second survey. International Journal of Speech, Language and the Law 26(1): 1--20.
Hansen, J. H. L. and Hasan, T. (2015) Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine 32(6): 74--99.
Hudson, T., de Jong, G., McDougall, K., Harrison, P. and Nolan, F. (2007) F0 statistics for 100 young male speakers of Standard Southern British English. Proceedings of the 16th International Congress of Phonetic Sciences, Saarbrücken, August 2007: 1809--1812.
Hughes, V. (2014) The Definition of the Relevant Population and the Collection of Data for Likelihood Ratio-Based Forensic Voice Comparison. PhD Thesis. York: University of York. http://etheses.whiterose.ac.uk/8309/1/Hughes, V. (2014) PhD.pdf
Hughes, V. and Foulkes, P. (2015) The relevant population in forensic voice comparison: effects of varying delimitations of social class and age. Speech Communication 66: 218--230.
Hughes, V., Harrison, P., Foulkes, P., French, P., Kavanagh, C. and San Segundo, E. (2018) The individual and the system: assessing the stability of the output of a semi-automatic forensic voice comparison system. Proceedings of Interspeech 2018, Hyderabad, India: 227--231.
Hughes, V. and Rhodes, R. (2018) Questions, propositions and assessing different levels of evidence: forensic voice comparison in practice. Science and Justice 58(4): 250--257.
Jessen, M., Meir, G. and Solewicz, Y. A. (2019) Evaluation of Nuance Forensics 9.2 and 11.1 under conditions reflecting those of a real forensic voice comparison case (forensic_eval_01). Speech Communication 110: 101--107.
Kinoshita, Y. and Ishihara, S. (2014) Background population: how does it affect LR based forensic voice comparison? International Journal of Speech, Language and the Law 21(2): 191--224.
Künzel, H. J. (2013) Automatic speaker recognition with cross-language speech material. International Journal of Speech, Language and the Law 20(1): 21--44.
Meuwly, D. (2001) Reconnaissance de locuteurs en sciences forensiques: l’apport d’une approche automatique. PhD thesis, University of Lausanne. Retrieved on 25 March 2020 from https://serval.unil.ch/resource/serval:BIB_R_7892.P001/REF
Morrison, G. S. (2012) The likelihood-ratio framework and forensic evidence in court: a response to R v T. International Journal of Evidence and Proof 16: 1--29.
Morrison, G. S. (2018) The impact in forensic voice comparison of lack of calibration and of mismatched conditions between the known-speaker recording and the relevant-population sample recordings. Forensic Science International 283: e1--e7.
Nolan, F., McDougall, K., de Jong, G. and Hudson, T. (2009) The DyViS database: style-controlled recordings of 100 homogeneous speakers for forensic phonetic research. International Journal of Speech, Language and the Law 16(1): 31--57.
Rose, P. (2013). Where the science ends and the law begins: likelihood ratio-based forensic voice comparison in a $150 million telephone fraud. International Journal of Speech, Language and the Law 20(2): 277--324.
San Segundo, E., Foulkes, P., French, P., Harrison, P., Hughes, V. and Kavanagh, C. (2019) The use of vocal profile analysis for speaker characterization: methodological proposals. Journal of the International Phonetic Association 49(3): 353--380.
Smith, R. L. and Charrow, R. P. (1975) Upper and lower bounds for the probability of guilt based on circumstantial evidence. Journal of the American Statistical Association 70: 555--560.
Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D. and Khudanpur, S. (2018) Spoken language recognition using X-vectors. Proceedings of Odyssey 2018: The Speech and Language Recognition Workshop, Les Sables d’Olonne, France, June 2018: 105--111.
Sohn, J., Kim, N. S. and Sung, W. (1999) A statistical model-based voice activity detection. IEEE Signal Processing Letters 6(1): 1--3.
Solewicz, Y. A., Jessen, M. and van der Vloed, D. (2017) Null-hypothesis LLR: a proposal for forensic automatic speaker recognition. Proceedings of Interspeech 2017, Stockholm, August 2017, 2849--2853. doi: 10.21437/Interspeech.2017-1023
Tippett, C., Emerson, V., Fereday, M., Lawton, F. and Lampert, S. (1968) The evidential value of the comparison of paint flakes from sources other than vehicles. Journal of the Forensic Science Society 8: 61--65.
van der Vloed, D., Jessen, M. and Gfroerer, S. (2017) Experiments with two forensic automatic speaker comparison systems using reference populations that (mis)match the test language. Proceedings of the Audio Engineering Society International Conference on Audio Forensics, Arlington, VA, June 2017. Retrieved on 25 March 2020 from http://www.aes.org/e-lib/browse.cfm?elib=18743
Van Leeuwen, D.A. and Bouten, J.S. (2004) Results of the 2003 NFI-TNO forensic speaker recognition evaluation. Proceedings of the Odyssey 2004 Speaker and Language Recognition Workshop, International Speech Communication Association: 75--82.
Wells, J. C. (1982) Accents of English 1: An Introduction. Cambridge: Cambridge University Press.
Wormald, J. (2016) Regional Variation in Panjabi-English. PhD Thesis, University of York. Retrieved on 25 March 2020 from http://etheses.whiterose.ac.uk/13188/1/Wormald_PhD_final.pdf