Tuning the performance of automatic speaker recognition in different conditions

effects of language and simulated voice disguise


  • Radek Skarnitzl Charles University
  • Maral Asiaee Alzahra University
  • Mandana Nourbakhsh Alzahra University




automatic speaker recognition, forensic phonetics, voice disguise


Automatic speaker recognition applications have often been described as a ‘black box’. This study explores the benefit of tuning procedures (condition adaptation and reference normalisation) implemented in an i-vector PLDA framework ASR system, VOCALISE. These procedures enable users to open the black box to a certain degree. Subsets of two 100-speaker databases, one of Czech and the other of Persian male speakers, are used for the baseline condition and for the tuning procedures. The effect of tuning with cross-language material, as well as the effect of simulated voice disguise, achieved by raising the fundamental frequency by four semitones and resonance characteristics by 8%, are also examined. The results show superior recognition performance (EER) for Persian than Czech in the baseline condition, but an opposite result in the simulated disguise condition; possible reasons for this are discussed. Overall, the study suggests that both condition adaptation and reference normalisation are beneficial to recognition performance.

Author Biographies

Radek Skarnitzl, Charles University

Radek Skarnitzl is an Associate Professor at Charles University, Prague, Czech Republic, and director of the Institute of Phonetics. His research focuses on issues related to speaker identification, especially the effects of disguise. He is also interested in the impact of various pronunciation features on the socio-psychological evaluation of a speaker in both native and foreign languages, as well as in the teaching of pronunciation of a foreign language.

Maral Asiaee, Alzahra University

Maral Asiaee is a PhD candidate in General Linguistics at Alzahra University, Tehran, Iran. She holds a BA in English Language and Literature from Shiraz University and an MA in General Linguistics from Alzahra University. Her research interest lies in the fields of forensic phonetics, acoustic phonetics, sociophonetics and psychoacoustics.

Mandana Nourbakhsh, Alzahra University

Mandana Nourbakhsh has a PhD in General Linguistics from the University of Tehran and she is currently an assistant professor teaching phonetics, phonology and psycholinguistics at the Linguistics department of Alzahra University, Iran. Her area of research interest includes laboratory phonetics and phonology, as well as psycholinguistics and psychoacoustics.


Alexander, A., Forth, O., Atreya, A. A., & Kelly, F. (2016). VOCALISE: A forensic automatic speaker recognition system supporting spectral, phonetic, and user-provided features. Proceedings of Odyssey. Bilbao.

Bijankhan, M. (2018). Phonology. In A. Sadeghi & P. Shabani-Jadidi (Eds.), The Oxford Handbook of Persian Linguistics (pp. 111–141). https://doi.org/10.1093/oxfordhb/9780198736745.013.5

Boersma, P., & Weenink, D. (2019). Praat: doing phonetics by computer. Retrieved from www.praat.org

Braun, A. (2006). Stimmverstellung und Stimmenimitation in der forensischen Sprechererkennung. In T. Kopfermann (Ed.), Das Phänomen Stimme: Imitation und Identität: 5. Internationale Stuttgarter Stimmtage 2004. Hellmut K. Geissner zum 80. Geburtstag (pp. 177–182). St. Ingbert: Röhrig Universitätsverlag.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech & Language Processing, 19(4), 788–798. https://doi.org/10.1109/TASL.2010.2064307

Enzinger, E. (2015). Implementation of forensic voice comparison within the new paradigm for the evaluation of forensic evidence (The University of New South Wales). Retrieved from http://handle.unsw.edu.au/1959.4/55772

Fant, G. (1960). Acoustic Theory of Speech Production. The Hague: Mouton.

Farrús, M. (2018). Voice Disguise in Automatic Speaker Recognition. ACM Comput. Surv., 51(4), 68:1--68:22. https://doi.org/10.1145/3195832

Farrús, M., Wagner, M., Erro, D., & Hernando, J. (2010). Automatic speaker recognition as a measurement of voice imitation and conversion. International Journal of Speech, Language and the Law, 17(1), 119–142. https://doi.org/10.1558/ijsll.v17i1.119

Gfroerer, S. (2003). Auditory-instrumental forensic speaker recognition. BT - 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003 (pp. 705–708). pp. 705–708. Retrieved from http://www.isca-speech.org/archive/eurospeech_2003/e03_0705.html

Giddens, C. L., Barron, K. W., Byrd-Craven, J., Clark, K. F., & Winter, A. S. (2013). Vocal indices of stress: a review. Journal of Voice, 27(3), 390.e21-9. https://doi.org/10.1016/j.jvoice.2012.12.010

Gold, E., & French, P. (2011). International practices in forensic speaker comparison. International Journal of Speech Language and the Law, 18(2), 293–307. https://doi.org/http://dx.doi.org/10.1558/ijsll.v18i2.293

Gold, E., & French, P. (2019). International Practices in Forensic Speaker Comparisons: Second Survey. International Journal of Speech Language and the Law, 26(1).

Hansen, J. H. L., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(november), 74–99. https://doi.org/https://doi.org/10.1109/MSP.2015.2462851

Hautamäki, R. G., Kinnunen, T., Hautamäki, V., & Laukkanen, A.-M. (2015). Automatic versus human speaker verification: The case of voice mimicry. Speech Communication, 72, 13–31. https://doi.org/https://doi.org/10.1016/j.specom.2015.05.002

Hughes, V., & Foulkes, P. (2015). The relevant population in forensic voice comparison: Effects of varying delimitations of social class and age. Speech Communication, 66, 218–230. https://doi.org/https://doi.org/10.1016/j.specom.2014.10.006

Kelly, F., Forth, O., Kent, S., Gerlach, L., & Alexander, A. (2019). Deep Neural Network Based Forensic Automatic Speaker Recognition in VOCALISE using x-Vectors. Audio Engineering Society Conference: 2019 AES INTERNATIONAL CONFERENCE ON AUDIO FORENSICS. Retrieved from http://www.aes.org/e-lib/browse.cfm?elib=20477

Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40. https://doi.org/10.1016/J.SPECOM.2009.08.009

Kirchhübel, C, & Howard, D. (2013). Detecting suspicious behaviour using speech: Acoustic correlates of deceptive speech – An exploratory investigation. Applied Ergonomics, 44(5), 694–702. https://doi.org/10.1016/J.APERGO.2012.04.016

Kirchhübel, Christin, Howard, D. M., & Stedmon, A. W. (2011). Acoustic Correlates of Speech when Under Stress: Research, Methods and Future Directions. International Journal of Speech, Language and the Law, 18(1), 75–98. https://doi.org/https://dx.doi.org/10.1558/ijsll.v18i1.75

Künzel, H. J. (2000). Effects of voice disguise on speaking fundamental frequency. Forensic Linguistics, 7(2), 149–179. Retrieved from https://www2.scopus.com/inward/record.uri?eid=2-s2.0-54249140687&partnerID=40&md5=91a9ecd533c278f5e6fc8f1d80299550

Laukkanen, A., Takalo, R., Vilkman, E., Nummenranta, J., & Lipponen, T. (1999). Simultaneous videofluorographic and dual-channel electroglottographic registration of the vertical laryngeal position in various phonatory tasks. Journal of Voice, 13(1), 60–71. https://doi.org/10.1016/S0892-1997(99)80062-9

Laver, J. (1980). The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press.

Modarresi Ghavami, G. (2018). Phonetics. In A. Sadeghi & P. Shabani-Jadidi (Eds.), The Oxford Handbook of Persian Linguistics (pp. 91–110). https://doi.org/10.1093/oxfordhb/9780198736745.013.4

Morrison, G. S. (2011). Measuring the validity and reliability of forensic likelihood-ratio systems. Science & Justice, 51(3), 91–98. https://doi.org/10.1016/J.SCIJUS.2011.03.002

Morrison, G. S., Ochoa, F., & Thiruvaran, T. (2012). Database selection for forensic voice comparison. Proceedings of Odyssey 2012: The Language and Speaker Recognition Workshop, Singapore, (June), 62–77.

Oxford Wave Research (2017). iVOCALISE 2017B. (n.d.).

Reynolds, D. A. (1997). Comparison of background normalization methods for text-independent speaker verification. Proceedings of Eurospeech 1997, 963–966.

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10((1-3)), 19–41. https://doi.org/https://doi.org/10.1006/dspr.1999.0361

Rose, P. (2002). Forensic Speaker Identification. London: Taylor and Francis.

Růžičková, A., & Skarnitzl, R. (2017). Voice disguise strategies in Czech male speakers. Acta Universitatis Carolinae – Philologica 3, Phonetica Pragensia XIV, 19–34. https://doi.org/https://doi.org/10.14712/24646830.2017.30

San Segundo, E., & Mompean, J. (2017). A simplified vocal profile analysis protocol for the assessment of voice quality and speaker similarity. Journal of Voice, 31(5), 644.e11-644.e27. https://doi.org/10.1016/J.JVOICE.2017.01.005

San Segundo, E., & Skarnitzl, R. (in print). A computer-based tool for the assessment of voice quality through visual analogue scales: VAS-Simplified Vocal Profile Analysis. Journal of Voice. https://doi.org/10.1016/j.jvoice.2019.10.007.

Scherer, K. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40((1-2)), 227–256. https://doi.org/10.1016/S0167-6393(02)00084-5

Shipp, T. (1987). Vertical laryngeal position: Research findings and application for singers. Journal of Voice, 1(3), 217–219. https://doi.org/10.1016/S0892-1997(87)80002-4

Skarnitzl, R., Šturm, P., & Volín, J. (2016). Zvuková báze řečové komunikace: Fonetický a fonologický popis řeči. Praha: Karolinum.

Skarnitzl, R., & Vaňková, J. (2017). Fundamental frequency statistics for male speakers of Common Czech. Acta Universitatis Carolinae – Philologica 3, Phonetica Pragensia XIV, 7–17. https://doi.org/https://doi.org/10.14712/24646830.2017.29

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-Vectors: Robust DNN Embeddings for Speaker Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375

Tan, T. (2010). The effect of voice disguise on Automatic Speaker Recognition. 2010 3rd International Congress on Image and Signal Processing, 8, 3538–3541. https://doi.org/10.1109/CISP.2010.5647131

Tirumala, S., Shahamiri, S., Garhwal, A., & Wang, R. (2017). Speaker identification features extraction methods: A systematic review. Expert Systems with Applications, 90, 250–271. https://doi.org/10.1016/J.ESWA.2017.08.015



How to Cite

Skarnitzl, R., Asiaee, M., & Nourbakhsh, M. (2020). Tuning the performance of automatic speaker recognition in different conditions: effects of language and simulated voice disguise. International Journal of Speech, Language and the Law, 26(2), 209–229. https://doi.org/10.1558/ijsll.39778