Tuning the performance of automatic speaker recognition in different conditions
effects of language and simulated voice disguise
DOI:
https://doi.org/10.1558/ijsll.39778Keywords:
automatic speaker recognition, forensic phonetics, voice disguiseAbstract
Automatic speaker recognition applications have often been described as a ‘black box’. This study explores the benefit of tuning procedures (condition adaptation and reference normalisation) implemented in an i-vector PLDA framework ASR system, VOCALISE. These procedures enable users to open the black box to a certain degree. Subsets of two 100-speaker databases, one of Czech and the other of Persian male speakers, are used for the baseline condition and for the tuning procedures. The effect of tuning with cross-language material, as well as the effect of simulated voice disguise, achieved by raising the fundamental frequency by four semitones and resonance characteristics by 8%, are also examined. The results show superior recognition performance (EER) for Persian than Czech in the baseline condition, but an opposite result in the simulated disguise condition; possible reasons for this are discussed. Overall, the study suggests that both condition adaptation and reference normalisation are beneficial to recognition performance.References
Alexander, A., Forth, O., Atreya, A. A., & Kelly, F. (2016). VOCALISE: A forensic automatic speaker recognition system supporting spectral, phonetic, and user-provided features. Proceedings of Odyssey. Bilbao.
Bijankhan, M. (2018). Phonology. In A. Sadeghi & P. Shabani-Jadidi (Eds.), The Oxford Handbook of Persian Linguistics (pp. 111–141). https://doi.org/10.1093/oxfordhb/9780198736745.013.5
Boersma, P., & Weenink, D. (2019). Praat: doing phonetics by computer. Retrieved from www.praat.org
Braun, A. (2006). Stimmverstellung und Stimmenimitation in der forensischen Sprechererkennung. In T. Kopfermann (Ed.), Das Phänomen Stimme: Imitation und Identität: 5. Internationale Stuttgarter Stimmtage 2004. Hellmut K. Geissner zum 80. Geburtstag (pp. 177–182). St. Ingbert: Röhrig Universitätsverlag.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech & Language Processing, 19(4), 788–798. https://doi.org/10.1109/TASL.2010.2064307
Enzinger, E. (2015). Implementation of forensic voice comparison within the new paradigm for the evaluation of forensic evidence (The University of New South Wales). Retrieved from http://handle.unsw.edu.au/1959.4/55772
Fant, G. (1960). Acoustic Theory of Speech Production. The Hague: Mouton.
Farrús, M. (2018). Voice Disguise in Automatic Speaker Recognition. ACM Comput. Surv., 51(4), 68:1--68:22. https://doi.org/10.1145/3195832
Farrús, M., Wagner, M., Erro, D., & Hernando, J. (2010). Automatic speaker recognition as a measurement of voice imitation and conversion. International Journal of Speech, Language and the Law, 17(1), 119–142. https://doi.org/10.1558/ijsll.v17i1.119
Gfroerer, S. (2003). Auditory-instrumental forensic speaker recognition. BT - 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003 (pp. 705–708). pp. 705–708. Retrieved from http://www.isca-speech.org/archive/eurospeech_2003/e03_0705.html
Giddens, C. L., Barron, K. W., Byrd-Craven, J., Clark, K. F., & Winter, A. S. (2013). Vocal indices of stress: a review. Journal of Voice, 27(3), 390.e21-9. https://doi.org/10.1016/j.jvoice.2012.12.010
Gold, E., & French, P. (2011). International practices in forensic speaker comparison. International Journal of Speech Language and the Law, 18(2), 293–307. https://doi.org/http://dx.doi.org/10.1558/ijsll.v18i2.293
Gold, E., & French, P. (2019). International Practices in Forensic Speaker Comparisons: Second Survey. International Journal of Speech Language and the Law, 26(1).
Hansen, J. H. L., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(november), 74–99. https://doi.org/https://doi.org/10.1109/MSP.2015.2462851
Hautamäki, R. G., Kinnunen, T., Hautamäki, V., & Laukkanen, A.-M. (2015). Automatic versus human speaker verification: The case of voice mimicry. Speech Communication, 72, 13–31. https://doi.org/https://doi.org/10.1016/j.specom.2015.05.002
Hughes, V., & Foulkes, P. (2015). The relevant population in forensic voice comparison: Effects of varying delimitations of social class and age. Speech Communication, 66, 218–230. https://doi.org/https://doi.org/10.1016/j.specom.2014.10.006
Kelly, F., Forth, O., Kent, S., Gerlach, L., & Alexander, A. (2019). Deep Neural Network Based Forensic Automatic Speaker Recognition in VOCALISE using x-Vectors. Audio Engineering Society Conference: 2019 AES INTERNATIONAL CONFERENCE ON AUDIO FORENSICS. Retrieved from http://www.aes.org/e-lib/browse.cfm?elib=20477
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40. https://doi.org/10.1016/J.SPECOM.2009.08.009
Kirchhübel, C, & Howard, D. (2013). Detecting suspicious behaviour using speech: Acoustic correlates of deceptive speech – An exploratory investigation. Applied Ergonomics, 44(5), 694–702. https://doi.org/10.1016/J.APERGO.2012.04.016
Kirchhübel, Christin, Howard, D. M., & Stedmon, A. W. (2011). Acoustic Correlates of Speech when Under Stress: Research, Methods and Future Directions. International Journal of Speech, Language and the Law, 18(1), 75–98. https://doi.org/https://dx.doi.org/10.1558/ijsll.v18i1.75
Künzel, H. J. (2000). Effects of voice disguise on speaking fundamental frequency. Forensic Linguistics, 7(2), 149–179. Retrieved from https://www2.scopus.com/inward/record.uri?eid=2-s2.0-54249140687&partnerID=40&md5=91a9ecd533c278f5e6fc8f1d80299550
Laukkanen, A., Takalo, R., Vilkman, E., Nummenranta, J., & Lipponen, T. (1999). Simultaneous videofluorographic and dual-channel electroglottographic registration of the vertical laryngeal position in various phonatory tasks. Journal of Voice, 13(1), 60–71. https://doi.org/10.1016/S0892-1997(99)80062-9
Laver, J. (1980). The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press.
Modarresi Ghavami, G. (2018). Phonetics. In A. Sadeghi & P. Shabani-Jadidi (Eds.), The Oxford Handbook of Persian Linguistics (pp. 91–110). https://doi.org/10.1093/oxfordhb/9780198736745.013.4
Morrison, G. S. (2011). Measuring the validity and reliability of forensic likelihood-ratio systems. Science & Justice, 51(3), 91–98. https://doi.org/10.1016/J.SCIJUS.2011.03.002
Morrison, G. S., Ochoa, F., & Thiruvaran, T. (2012). Database selection for forensic voice comparison. Proceedings of Odyssey 2012: The Language and Speaker Recognition Workshop, Singapore, (June), 62–77.
Oxford Wave Research (2017). iVOCALISE 2017B. (n.d.).
Reynolds, D. A. (1997). Comparison of background normalization methods for text-independent speaker verification. Proceedings of Eurospeech 1997, 963–966.
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10((1-3)), 19–41. https://doi.org/https://doi.org/10.1006/dspr.1999.0361
Rose, P. (2002). Forensic Speaker Identification. London: Taylor and Francis.
Růžičková, A., & Skarnitzl, R. (2017). Voice disguise strategies in Czech male speakers. Acta Universitatis Carolinae – Philologica 3, Phonetica Pragensia XIV, 19–34. https://doi.org/https://doi.org/10.14712/24646830.2017.30
San Segundo, E., & Mompean, J. (2017). A simplified vocal profile analysis protocol for the assessment of voice quality and speaker similarity. Journal of Voice, 31(5), 644.e11-644.e27. https://doi.org/10.1016/J.JVOICE.2017.01.005
San Segundo, E., & Skarnitzl, R. (in print). A computer-based tool for the assessment of voice quality through visual analogue scales: VAS-Simplified Vocal Profile Analysis. Journal of Voice. https://doi.org/10.1016/j.jvoice.2019.10.007.
Scherer, K. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40((1-2)), 227–256. https://doi.org/10.1016/S0167-6393(02)00084-5
Shipp, T. (1987). Vertical laryngeal position: Research findings and application for singers. Journal of Voice, 1(3), 217–219. https://doi.org/10.1016/S0892-1997(87)80002-4
Skarnitzl, R., Šturm, P., & Volín, J. (2016). Zvuková báze řečové komunikace: Fonetický a fonologický popis řeči. Praha: Karolinum.
Skarnitzl, R., & Vaňková, J. (2017). Fundamental frequency statistics for male speakers of Common Czech. Acta Universitatis Carolinae – Philologica 3, Phonetica Pragensia XIV, 7–17. https://doi.org/https://doi.org/10.14712/24646830.2017.29
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-Vectors: Robust DNN Embeddings for Speaker Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375
Tan, T. (2010). The effect of voice disguise on Automatic Speaker Recognition. 2010 3rd International Congress on Image and Signal Processing, 8, 3538–3541. https://doi.org/10.1109/CISP.2010.5647131
Tirumala, S., Shahamiri, S., Garhwal, A., & Wang, R. (2017). Speaker identification features extraction methods: A systematic review. Expert Systems with Applications, 90, 250–271. https://doi.org/10.1016/J.ESWA.2017.08.015