Vowel convergence does not affect auditory speaker discriminability in humans and machine in a case study on Swiss German dialects





phonetic convergence, vowel acoustics, speaker discrimination, automatic speaker verification, Swiss German dialects


In this study, we examined whether the convergence in interlocutors’ vowel acoustics leads to decreasing discriminability between interlocutors’ voices. Ten pairs of Grison and Zürich German speakers produced lexical items before and after dialogue interactions with evidence of vowel convergence in post-dialogue productions. In Experiment 1, native and non-native Swiss German listeners discriminated pairs of speakers whose speech was obtained pre- and post-dialogue. Results showed that listeners’ sensitivity (A’) was higher for native than non-native listeners, but comparable for pre- and post-dialogue recordings. The observed negative correlation between voice discrimination and the acoustic distance in formant space was mainly driven by a single speaker pair. In Experiment 2, the speaker recognition performance of an i-vector-based software was compared in pre- and post-dialogue speech. Results revealed no difference in the system performance between the two conditions. The findings suggest that vowel convergence does not compromise voice discriminability under the given experimental conditions.

Author Biographies

Elisa Pellegrino, University of Zurich

Elisa Pellegrino is a senior researcher in phonetics at the Department of Computational Linguistics at the University of Zurich. Her research interests and publications focus on vocal accommodation, speaker recognition, and the role of temporal information in speech communication. She is an Executive Committee member of the International Association for Forensic Phonetics and Acoustics, a board member and the Case Manager of the Centre for Forensic Phonetics and Acoustics at the Department of Computational Linguistics.

Thayabaran Kathiresan, University of Zurich

Dr. Thayabaran Kathiresan has a PhD in computational linguistics and phonetics from the University of Zurich. He has both academic and industrial experiences in speech signal processing, speaker and speech recognition, and machine learning. He is currently a senior research engineer at Telepathy Labs, Zurich.

Volker Dellwo, University of Zurich

Volker Dellwo is Associate Professor of Phonetics in the Department of Computational Linguistics at the University of Zurich and Chair of the Centre of Forensic Phonetics and Acoustics. He is an Associate Member of the European Network of Forensic Science Institutes and Director of the Linguistic Research Infrastructure at the University of Zurich. He has published widely in phonetics and speech sciences, and has over 18 years of experience in the analysis of voice recordings for forensic purposes.


Adank, P., Smits, R. and van Hout, R. (2004) A comparison of vowel normalization procedures for language variation research. Journal of the Acoustical Society of America 116(5): 3099–3107. https://doi.org/10.1121/1.1795335

Ajili, M. (2017) Reliability of voice comparison for forensic applications. Artificial Intelligence [cs.AI]. Université d’Avignon, 2017. English. ffNNT: 2017AVIG0223ff. fftel-01774394

Alexander, A., Forth, O., Atreya, A. A. and Kelly, F. (2016) VOCALISE: A forensic automatic speaker recognition system supporting spectral, phonetic, and user-provided features. In Proceedings Odyssey. http://www.odyssey2016.org/papers/Show_tell/88.pdf

Babel, M. (2010) Dialect divergence and convergence in New Zealand English. Language in Society 39(4): 437–456. https://doi.org/10.1017/s0047404510000400

Babel, M. and Bulatov, D. (2012) The role of fundamental frequency in phonetic accommodation. Language and Speech 55(2): 231–248. https://doi.org/10.1177/0023830911417695

Babel, M., McAuliffe, M. and Haber, G. (2013) Can mergers-in-progress be unmerged in speech accommodation? Frontiers in Psychology 4: 653. https://www.frontiersin.org/article/10.3389/fpsyg.2013.00653

Baumann, O. and Belin, P. (2010) Perceptual scaling of voice identity: Common dimensions for different vowels and speakers. Psychological Research 74(1): 110–120. https://doi.org/10.1007/s00426-008-0185-z

Boersma, P. and Weenink, D. (2018) Praat: Doing Phonetics by Computer [Computer program]. Version (6.0.37). Retrieved 14 March 2018 from http://www.praat.org/

Bonastre, J. F., Kahn, J., Rossato, S. and Ajili, M. (2015) Forensic speaker recognition: Mirages and reality. In S. Fuchs, D. Pape, C. Petrone and P. Perrier (eds) Individual Differences in Speech Production and Perception 255–284. Frankfurt am Main, Berlin: Peter Lang:.

Braun, A., Llamas, C., Watt, D., French, P. and Robertson D. (2018) Sub-regional ‘other-accent’ effects on lay listeners’ speaker identification abilities: A voice line-up study with speakers and listeners from the North East of England. International Journal of Speech Language and the Law 25(2): 231–255. https://doi.org/10.1558/ijsll.37340

Bricker, P. D. and Pruzansky, S. (1966) Effects of stimulus content and duration on talker identification. Journal of the Acoustical Society of America 40(6): 1441–1449. https://doi.org/10.1121/1.1910246

Chandrasekaran, B., Chan, A. H. D. and Wong, P. C. M. (2011) Neural processing of what and who information in speech. Journal of Cognitive Neuroscience 23(10): 2690–2700. https://doi.org/10.1162/jocn.2011.21631

Cohen Priva, U. and Sanker, C. (2018) Distinct behaviors in convergence across measures. In Proceedings of the 40th Annual Conference of the Cognitive Science Society 1518–1523, Austin, TX: Cognitive Science Society.

Cook, S. and Wilding, J. (1997) Earwitness testimony: Never mind the variety, hear the length. Applied Cognitive Psychology 11(2): 95–111. https://doi.org/10.1002/(SICI)1099-0720(199704)11:2<95::AID-ACP429>3.0.CO;2-O

Creel, S. C. and Bregman, M. R. (2011) How talker identity relates to language processing. Linguistics and Language Compass 5(5): 190–204. https://doi.org/10.1111/j.1749-818X.2011.00276.x

Davis, S. and Mermelstein, P. (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4): 357–366. https://doi.org/10.1109/TASSP.1980.1163420

Dehak, N., Kenny, P. J., Dehak, R. Dumouchel, P. and Ouellet, P. (2011) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19(4): 788–798. https://doi.org/10.1109/TASL.2010.2064307

Dellwo, V., French, P. and He, L. (2019) Voice biometrics for FORENSIC speaker recognition applications. In S. Frühholz, and P. Belin (eds) The Oxford Handbook of Voice Perception, Oxford Library of Psychology (2018; online edn, Oxford Academic, 4 Oct. 2019). Oxford: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780198743187.013.36

Du, M. (2017) Analysis of errors in forensic science. Journal of Forensic Science and Medicine 3: 139–143.

Farrús, M., Wagner, M., Erro, D. and Hernando, J. (2010) Automatic speaker recognition as a measurement of voice imitation and conversion. Journal of Speech, Language and the Law 17(1): 119–142. https://doi.org/10.1558/ijsll.v17i1.119

Fernández Gallardo, L. (2016) Human and Automatic Speaker Recognition over Telecommunication Channels. Singapore: Springer.

Fleischer, J. and Schmid, S. (2006) Zurich German. Journal of the International Phonetic Association 36(2): 243–253. https://doi.org/10.1017/S0025100306002441

Francis, A. L. and Driscoll, C. (2006) Training to use voice onset time as a cue to talker identification induces a left-ear/right-hemisphere processing advantage. Brain and Language 98(3): 310–318. https://doi.org/10.1016/j.bandl.2006.06.002

Furui, S. (1981) Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech and Signal Processing 29(2): 254–272. https://doi.org/10.1109/TASSP.1981.1163530

Ganugapati, D. and Theodore, R. M. (2019) Structured phonetic variation facilitates talker identification. Journal of the Acoustical Society of America 145: EL469. https://doi.org/10.1121/1.5100166

Grier, J. B. (1971) Nonparametric indexes for sensitivity and bias: Computing formulas. Psychological Bulletin 75(6): 424–429. https://doi.org/10.1037/h0031246

Hudson, T., de Jong, G., McDougall, K., Harrison, P. and Nolan, F. (2007) F0 statistics for 100 young male speakers of Standard Southern British English. In Proceedings of the 16th International Congress of Phonetic Sciences 1809–1812. Saarbrücken.

JASP Team (2022) JASP (Version 0.16.3)[computer software].

Kelly, F., Fröhlich, A., Dellwo, V., Forth, O., Kent, S. and Alexander, A. (2019) Evaluation of VOCALISE under conditions reflecting those of a real forensic voice comparison case (forensic_eval_01). Speech Communication 112: 30–36. https://doi.org/10.1016/j.specom.2019.06.005

Kinnunen, T. and Li, H. (2010) An overview of text-independent speaker recognition: From features to supervectors. Speech Communication 52(1): 12–40. https://doi.org/10.1016/j.specom.2009.08.009

Knösche, T. R., Lattner, S., Maess, B., Schauer, M. and Friederici, A. D. (2002) Early parallel processing of auditory word and voice information. NeuroImage 17(3): 1493–1503. https://doi.org/10.1006/nimg.2002.1262

Kreiman, J. and Sidtis, D. (2011) Foundations of Voice Studies: An Interdisciplinary Approach to Voice Production and Perception. Hoboken, NJ: John Wiley & Sons.

Kreiman, J., Lee, Y., Garellek, M., Samlan, R. and Gerratt, B. R. (2021) Validating a psychoacoustic model of voice quality. Journal of the Acoustical Society of America 149(1): 457. https://doi.org/10.1121/10.0003331

Leemann, A., Dellwo, V., Kolly, M. J. and Schmid, S. (2012) Rhythmic variability in Swiss German dialects. In Proceedings of the 6th International Conference on Speech Prosody, May 22–25, Shanghai, China 607–610.

Legge, G. E., Grosmann, C. and Pieper, C. M. (1984) Learning unfamiliar voices. Journal of Experimental Psychology: Learning, Memory, and Cognition 10(2): 298–303. https://doi.org/10.1037/0278-7393.10.2.298

Lindh, J. (2006) Preliminary descriptive F0-statistics for young male speakers. Lund University Working Papers 52: 89–92.

Lobanov, B. M. (1971) Classification of Russian vowels spoken by different speakers. Journal of the Acoustical Society of America 49: 606–608. https://doi.org/10.1121/1.1912396

Loporcaro, M. and Bertinetto, P. M. (2005) The sound pattern of Standard Italian, as compared with the varieties spoken in Florence, Milan and Rome. Journal of the International Phonetic Association 35(2): 132–151. https://doi.org/10.1017/S0025100305002148

Majidi, M. R. and Ternes, E. (1999) Persian (Farsi). In Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet 124–125. Cambridge: Cambridge University Press.

Meuwly, D. (2000) Voice analysis. In Encyclopaedia of Forensic Sciences 1413–1421. Amsterdam: Elsevier.

Myers, E. B. and Theodore, R. M. (2017) Voice-sensitive brain networks encode talker-specific phonetic detail. Brain and Language 165: 33–44. https://doi.org/https://doi.org/10.1016/j.bandl.2016.11.001

Nygaard, L. C. (2005) Perceptual integration of linguistic and nonlinguistic properties of speech. In D. B. Pisoni and R. E. Remez (eds) The Handbook of Speech Perception 390–413. Malden. MA, and Oxford: Blackwell Publishing.

Nygaard, L. C. and Pisoni, D. B. (1998) Talker-specific learning in speech perception. Perception and Psychophysics 60(3) : 355–376. https://doi.org/10.3758/BF03206860

Pardo, J. S. (2006).On phonetic convergence during conversational interaction. Journal of the Acoustical Society of America 119(4): 2382–2393. https://doi.org/10.1121/1.2178720

Pardo, J. S., Jordan, K., Mallari, R., Scanlon, C. and Lewandowski, E. (2013) Phonetic convergence in shadowed speech: The relation between acoustic and perceptual measures. Journal of Memory and Language 69(3) 183–195. https://doi.org/10.1016/j.jml.2013.06.002

Pardo, J. S., Urmanche, A., Wilman, S. and Wiener, J. (2017) Phonetic convergence across multiple measures and model talkers. Attention, Perception, and Psychophysics 79(2): 637–659. https://doi.org/10.3758/s13414-016-1226-0

Pardo, J. S., Urmanche, A., Wilman, S., Wiener, J., Mason, N., Francis, K. and Ward, W. (2018) A comparison of phonetic convergence in conversational interaction and speech shadowing. Journal of Phonetics 69: 1–11. https://doi.org/10.1016/j.wocn.2018.04.001

Perrachione, T. K. (2018) Speaker recognition across languages. In S. Frühholz and P. Belin (eds) The Oxford Handbook of Voice Perception. Oxford: Oxford University Press. https://open.bu.edu/handle/2144/23877

Perrachione, T. K., Furbeck, K. T. and Thurston, E. J. (2019) Acoustic and linguistic factors affecting perceptual similarity judgments of voices. Journal of the Acoustical Society of America 146: 3384–3399. https://doi.org/10.1121/1.5126697

Pollack, I., Pickett, J. M. and Sumby, W. H. (1954) On the identification of speakers by voice. Journal of the Acoustical Society of America 26(3): 403–406. https://doi.org/10.1121/1.1907349

RStudio Team (2022) RStudio: Integrated Development Environment for R. Boston, MA: RStudio, PBC. http://www.rstudio.com/

Reader, A. T. and Holmes, N. P. (2016) Examining ecological validity in social interaction: Problems of visual fidelity, gaze, and social potential. Culture and Brain, 4: 134–146. https://doi.org/10.1007/s40167-016-0041-8

Remez, R. E., Fellowes, J. M. and Rubin, P. E. (1997) Talker identification based on phonetic information. Journal of Experimental Psychology: Human Perception and Performance 23(3): 651–666. https://doi.org/10.1037//0096-1523.23.3.651

Roebuck, R. and Wilding, J. (1993) Effects of vowel variety and sample length on identification of a speaker in a line-up. Applied Cognitive Psychology 7(6): 475–481. https://doi.org/10.1002/acp.2350070603

Ruch, H. (2015) Vowel convergence and divergence between two Swiss German dialects. Proceedings of the 18th International Congress of Phonetic Sciences (ICPhS 2015).

Ruch, H. (2018) The role of acoustic distance and sociolinguistic knowledge in dialect identification. Frontal Psychology 9: 818. https://doi.org/10.3389/fpsyg.2018.00818

Ruch, Hanna (2021). Dialect contact in real interactions and in an agent-based model. Speech Communication 134: 55–70. https://doi.org/10.1016/j.specom.2021.09.003

Schweinberger, S. R. and Zäske, R. (2018) Perceiving speaker identity from the voice. In S. Frühholz and P. Belin (eds) The Oxford Handbook of Voice Perception 539–560. Oxford: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780198743187.013.24

Theodore, R. M. and Miller, J. L. (2010) Characteristics of listener sensitivity to talker-specific phonetic detail. Journal of the Acoustical Society of America 128(4): 2090–2099. https://doi.org/10.1121/1.4782541

Tuninetti, A., Chládková, K., Peter, V., Schiller, N. O. and Escudero, P. (2017) When speaker identity is unavoidable: Neural processing of speaker identity cues in natural speech. Brain and Language 174: 42–49. https://doi.org/10.1016/j.bandl.2017.07.001

Van Engen, K. J., Baese-Berk, M., Baker, R. E., Choi, A., Kim, M. and Bradlow, A. R. (2010) The Wildcat corpus of native- and foreign-accented English: Communicative efficiency across conversational dyads with varying language alignment profiles. Language and Speech 53: 510–540. https://doi.org/10.1177/0023830910372495

Van Lancker, D. R., Cummings, J. L., Kreiman, J. and Dobkin, B. H. (1988) Phonagnosia: A dissociation between familiar and unfamiliar voices. Cortex 24(2): 195–209. https://doi.org/10.1016/S0010-9452(88)80029-7

Walker, A. and Campbell-Kibler, K. (2015) Repeat what after whom? Exploring variable selectivity in a cross-dialectal shadowing task. Frontiers in Psychology 6: 546. https://www.frontiersin.org/article/10.3389/fpsyg.2015.00546



How to Cite

Pellegrino, E., Kathiresan, T., & Dellwo, V. (2022). Vowel convergence does not affect auditory speaker discriminability in humans and machine in a case study on Swiss German dialects. International Journal of Speech, Language and the Law, 29(1), 60–84. https://doi.org/10.1558/ijsll.19954