Controlled voice quality modifications: Acoustic, perceptual and ASR analysis

Tomáš Nechanský; Alžběta Houzar; Tomáš Bořil; Radek Skarnitzl

doi:10.1558/ijsll.26094

Authors

Tomáš Nechanský Charles University
Alžběta Houzar Charles University
Tomáš Bořil Charles University
Radek Skarnitzl Charles University

DOI:

https://doi.org/10.1558/ijsll.26094

Keywords:

voice quality, voice disguise, phonation, articulation, Czech

Abstract

Within-speaker variability, which results from the plasticity of speech production, is an inherent feature of speaker comparison. This study examines targeted modifications of voice quality, both phonatory and articulatory, in Czech. Fifteen speakers were instructed to read a text in 15 different versions (e.g. palatalised voice, denasalised voice, breathy phonation, or a combination of open jaw and creaky voice). Acoustic analyses revealed that F3 is relatively stable across various voice quality settings, while harmonicity and spectral slope indicators are sensitive to the phonatory modifications. A perceptual test, administered online to 120 participants, showed that palatalised, pharyngealised, creaky, and pressed voice were regarded as most different from the speakers’ habitual voices. Finally, automatic speaker recognition scores were very good with the targeted voice quality modifications, with LLR between 3 and 10. Pressed phonation turned out to have the greatest effect on all three types of analysis.

Author Biographies

Tomáš Nechanský, Charles University

Tomáš Nechanský is a Ph.D. candidate in phonetics at the Institute of Phonetics, Charles University, Prague, Czech Republic. He is a published author in the field of forensic phonetics, focusing mainly on segmental features. Currently, he delivers lectures and seminars on Modern Technologies in Education and English Phonetics and Phonology at Prague City University. His professional roles have included linguistic analysis, project management and teaching English at various education levels.
Alžběta Houzar, Charles University

Alžběta Houzar is a postdoctoral researcher at the Institute of Phonetics in Prague. Her main research subject is intra-speaker and inter-speaker variability in acoustic parameters of speech signal; she is interested in its relation to different factors such as speech style or voice disguise. Additionally, she is currently focusing on speech acquisition, particularly the development of speech perception and comprehension of pragmatic aspects of communication.
Tomáš Bořil, Charles University

Tomáš Bořil received a Ph.D. in electrical engineering at the Czech Technical University in Prague. He is an Assistant Professor at the Institute of Phonetics in Prague. His research is focused on speech acoustics, perception, biological signal processing, statistics and software design including real-time speech processing. He is the author of the rPraat/mPraat package.
Radek Skarnitzl, Charles University

Radek Skarnitzl is an Associate Professor at the Institute of Phonetics in Prague. His research focuses on issues related to speaker identification, especially the effects of voice disguise. He is also interested in the teaching of pronunciation of English and particularly its prosodic features, as well as the impact of various pronunciation features on the socio-psychological evaluation of speakers in both native and foreign languages.

References

Anwyl-Irvine, A. L., Massonié, J., Flitton, A., Kirkham, N. Z. and Evershed, J. K. (2019) Gorilla in our midst: an online behavioural experiment builder. Behavior Research Methods 52: 388–407. https://doi.org/10.3758/s13428-019-01237-x

Anwyl-Irvine, A. L., Dalmaijer, E. S., Hodges, N. and Evershed, J.K. (2021) Realistic precision and accuracy of online experiment platforms, web browsers, and devices. Behavior Research Methods 53: 1407–1425. https://doi.org/10.3758/s13428-020-01501-5

Artkoski, M., Tommila, J. and Laukkanen, A.-M. (2002) Changes in voice during a day in normal voices without vocal loading. Logopedics Phoniatrics Vocology 27(3): 118–123. https://doi.org/10.1080/140154302760834840

Ayoub, M. R., Larrouy-Maestri, P. and Morsomme, D. (2019) The effect of smoking on the fundamental frequency of the speaking voice. Journal of Voice 33(5): 802.e11–802.e16. https://doi.org/10.1016/j.jvoice.2018.04.001

Bartle, A. and Dellwo, V. (2015) Auditory speaker discrimination by forensic phoneticians and naive listeners in voiced and whispered speech. International Journal of Speech, Language and the Law 22(2): 229–248. https://doi.org/10.1558/ijsll.v22i2.23101

Bates, D., Mächler, M., Bolker, B. and Walker, S. (2015) Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67: 1–48. https://doi.org/10.18637/jss.v067.i01

Baumeister, B., Heinrich, C. and Schiel, F. (2012) The influence of alcoholic intoxication on the fundamental frequency of female and male speakers. Journal of the Acoustical Society of America 132(1): 442–451. https://doi.org/10.1121/1.4726017

Boersma, P. and Weenink, D. (2022) Praat: Doing Phonetics by Computer. Version 6.3. Retrieved from http://www.praat.org

Boucher, V. J. and Ayad, T. (2010) Physiological attributes of vocal fatigue and their acoustic effects: A synthesis of findings for a criterion-based prevention of acquired voice disorders. Journal of Voice 24(3): 324–336. https://doi.org/10.1016/j.jvoice.2008.10.001

Braun, A. (2006) Stimmverstellung und Stimmenimitation in der forensischen Sprechererkennung. In T. Kopfermann (ed.) Das Phänomen Stimme: Imitation und Identität 177–181. St. Ingbert: Röhrig Universitätsverlag.

Brümmer, N. and du Preez, J. (2006) Application-independent evaluation of speaker detection. Computer, Speech and Language 20: 230–275. https://doi.org/10.1016/j.csl.2005.08.001

Chen, S. X. and Bond, M. H. (2010) Two languages, two personalities? Examining language effects on the expression of personality in a bilingual context. Personality and Social Psychology Bulletin 36(11): 1514–1528. https://doi.org/10.1177/0146167210385360

Corretge, R. (2022) Praat Vocal Toolkit. Retrieved from https://www.praatvocaltoolkit.com

Disner, S. and Benítez, A. (2018) F2 and F3 covariance as evidence of speaker identity. In Proceedings of IAFPA 2018, 86. Huddersfield, United Kingdom, July 29–August 1. https://iafpa2018.wordpress.com/wp-content/uploads/2018/09/iafpa_2018_abstract_booklet_5.pdf

Earnshaw, K. (2021) Examining the implications of speech accommodation for forensic speaker comparison casework: a case study of the West Yorkshire FACE vowel. Journal of Phonetics 87: 101062. https://doi.org/10.1016/j.wocn.2021.101062

Eriksson, A. (2010) The disguised voice: imitating accents or speech styles and impersonating individuals. In C. Llamas and D. Watt (eds) Language and Identities 86–96. Edinburgh: Edinburgh University Press.

Eriksson, E. J., Rodman, R. D. and Hubal, R. C. (2007) Emotions in speech: juristic implications. In C. Müller (ed.) Speaker Classification I 152–173. Berlin: Springer-Verlag.

Evans, B. G. and Iverson, P. (2007) Plasticity in vowel perception and production: a study of accent change in young adults. Journal of the Acoustical Society of America 121(6): 3814–3826. https://doi.org/10.1121/1.2722209

Farrús, M. (2018) Voice disguise in automatic speaker recognition. ACM Computing Surveys 51(4): article 68. https://doi.org/10.1145/3195832

Figueiredo, R. M. and Britto, H. S. (1996) A report on the acoustic effects of one type of disguise. Forensic Linguistics 3(1): 168–175. https://doi.org/10.1558/ijsll.v3i1.168

Fraile, R. and Godino-Llorente, J. I. (2014) Cepstral peak prominence: a comprehensive analysis. Biomedical Signal Processing and Control 14: 42–54. https://doi.org/10.1016/j.bspc.2014.07.001

Hammarberg, B., Fritzell, B., Gauffin, J., Sundberg, J. and Wedin, L. (1980) Perceptual and acoustic correlates of abnormal voice qualities. Acta Otolaryngoogica 90: 441–451. https://doi.org/10.3109/00016488009131746

Hejná, M. (2019) A case study of menstrual cycle effects: global phonation or also local phonatory phenomena? In Proceedings of the 19th ICPhS, paper 13. Melbourne, Australia, August 5–9. https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2019/papers/ICPhS_2679.pdf

Hillenbrand, J., Cleveland, R. A. and Erickson, R. L. (1994) Acoustic correlates of breathy vocal quality. Journal of Speech, Language and Hearing Research 37: 769–778. https://doi.org/10.1044/jshr.3704.769

Hollien, H., DeJong, G., Martin, C. A., Schwartz, R. and Liljegren, K. (2001) Effects of ethanol intoxication on speech suprasegmentals. Journal of the Acoustical Society of America 110(6): 3198–3206. https://doi.org/10.1121/1.1413751

Hollien, H. and Schwartz, R. (2000) Aural-perceptual speaker identification: Problems with noncontemporary samples. Forensic Linguistics 7(2): 199–211. https://doi.org/10.1558/sll.2000.7.2.199

Hruška, R. and Boril, T. (2017) Temporal variability of fundamental frequency contours. Acta Universitatis Carolinae – Philologica 3, Phonetica Pragensia XIV 35–44. https://doi.org/10.14712/24646830.2017.31

Jessen, M. (2009) Forensic phonetics and the influence of speaking style on global measures of fundamental frequency. In G. Grewendorf and M. Rathert (eds) Formal Linguistics and Law 115–139. Berlin: Mouton de Gruyter.

Kelly, F. and Hansen, J. H. L. (2015) Evaluation and calibration of short-term aging effects in speaker verification. In Proceedings of Interspeech 2015 224–228. https://doi.org/10.21437/Interspeech.2015-89

Kelly, F. and Hansen, J. H. L. (2021) Analysis and calibration of Lombard effect and whisper for speaker recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29: 927–942. https://doi.org/10.1109/TASLP.2021.3053388

Krajewski, J., Wieland, R. and Batliner, A. (2008) An acoustic framework for detecting fatigue in speech based human-computer-interaction. In K. Miesenberger, J. Klaus, W. Zagler and A. Karshmer (eds), Computers helping people with special needs (ICCHP 2008). Lecture Notes in Computer Science, vol. 5105, 54–61. Berlin: Springer-Verlag. https://doi.org/10.1007/978-3-540-70540-6_7

Künzel, H. J. (2000). Effects of voice disguise on speaking fundamental frequency. Forensic Linguistics 7(2): 149–179. https://doi.org/10.1558/sll.2000.7.2.149

Laver, J. (1980) The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press.

Lee, B. and Van Lancker Sidtis, D. (2017) The bilingual voice: vocal characteristics when speaking two languages across speech tasks. Speech, Language and Hearing 20(3): 174–185. https://doi.org/10.1080/2050571X.2016.1273572

Lenth, R. (2022) emmeans: Estimated marginal means, aka least-squares means. R package version 1.7.5. Retrieved from https://CRAN.R-project.org/package=emmeans

Lindh, J. and Eriksson, A. (2007) Robustness of long time measures of fundamental frequency. In Proceedings of Interspeech 2007, 2025–2028. Antwerp, Belgium, August 27–31. https://www.isca-archive.org/interspeech_2007/lindh07_interspeech.pdf

Masthoff, H. (1996) A report on a voice disguise experiment. Forensic Linguistics 3(1): 160–167. https://doi.org/10.1558/ijsll.v3i1.160

Matejka, P., Plchot, O., Glembek, O., Burget, L., Rohdin, J. A., Zeinali, H., Mošner, L., Silnova, A., Novotný, O., Diez, S. M. and Cernický, J. (2020) 13 years of speaker recognition research at BUT, with longitudinal analysis of NIST SRE. Computer Speech and Language 63: 101035. https://doi.org/10.1016/j.csl.2019.101035

McDougall, K. and Duckworth, M. (2018) Individual patterns of disfluency across speaking styles: a forensic phonetic investigation of Standard Southern British English. International Journal of Speech, Language and the Law 25(2): 205–230. https://doi.org/10.1558/ijsll.37241

Mertens, P. (2004) The prosogram: semi-automatic transcription of prosody based on a tonal perception model. In Proceedings of Speech Prosody 2004. Nara, Japan, March 23–26. https://doi.org/10.21437/SpeechProsody.2004-127

Monsen. R. B. and Engebretson, A. M. (1977) Study of variations in the male and female glottal wave. Journal of the Acoustical Society of America 62: 981–993.

Nandwana, M. K., McLaren, M., Ferrer, L., Castan, D. and Lawson, A. (2019) Analysis and mitigation of vocal effort variations in speaker recognition. In Proceedings of ICASSP 2019 6001–6005. https://doi.org/10.1109/ICASSP.2019.8683881

Nolan, F. (1983) The Phonetic Bases of Speaker Recognition. Cambridge: Cambridge University Press.

Nolan, F. (2012) Degrees of freedom in speech production: an argument for native speakers in LADO. International Journal of Speech, Language and the Law 19(2): 263–289. https://doi.org/10.1558/ijsll.v19i2.263

Pardo, J. S. (2006) On phonetic convergence during conversational interaction. Journal of the Acoustical Society of America 119(4): 2382–2393. https://doi.org/10.1121/1.2178720

R Core Team (2022) R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. Available at https://www.R-project.org/

Rhodes, R. (2017) Aging effects on voice features used in forensic speaker comparison. International Journal of Speech, Language and the Law 24(2): 177–199. https://doi.org/10.1558/ijsll.34096

Ross, S., Earnshaw, K. and Gold, E. (2019) A cautionary tale for phonetic analyses: the variability of speech between and within recording sessions. In Proceedings of the 19th ICPhS, 3090–3094. Melbourne, Australia, August 5–9. https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2019/papers/ICPhS_3139.pdf

Ružicková, A. and Skarnitzl, R. (2017) Voice disguise strategies in Czech male speakers. Acta Universitatis Carolinae – Philologica 3: 19–34. https://doi.org/10.14712/24646830.2017.30

San Segundo, E. and Mompean, J. A. (2017) A simplified Vocal Profile Analysis protocol for the assessment of voice quality and speaker similarity. Journal of Voice 31(5): 644.e11–644.e27. http://dx.doi.org/10.1016/j.jvoice.2017.01.005

Scherer, K. R. (2020) Acoustic patterning of emotion vocalization. In S. Frühholz and P. Belin (eds) Oxford Handbook of Voice Perception 61–91. Oxford: Oxford University Press.

Sinha, P., Vandana, V. P., Lewis, N. V., Jayaram, M. and Enderby, P. (2015) Evaluating the effect of risperidone on speech: a cross-sectional study. Asian Journal of Psychiatry 15: 51–55. https://doi.org/10.1016/j.ajp.2015.05.005

Skarnitzl, R. and Nechanský, T. (forthcoming) Segmental cues. In K. McDougall, T. Hudson and F. Nolan (eds) Oxford Handbook of Forensic Phonetics. Oxford: Oxford University Press.

Smith, A. B., Mason, N., Browne, M. E. and Sullivan, B. (2019) Acoustic characteristics of disguised speech: speaker strategies and listener error patterns. International Journal of Speech, Language and the Law 26(1): 85–95. https://doi.org/10.1558/ijsll.38372

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. and Khudanpur, S. (2018) X-vectors: Robust DNN embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375

Sundberg, J. and Nordenberg, M. (2006) Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-term-average spectra of speech. Journal of the Acoustical Society of America 120(1): 453–457. https://doi.org/10.1121/1.2208451

Swerdlin, Y., Sith, J. and Wolfe, J. (2010) The effect of whisper and creak on vocal tract resonances. Journal of the Acoustical Society of America 127(4): 2590–2598. https://doi.org/10.1121/1.3316288

Tamarit, L., Goudbeek, M. and Scherer, K. (2008) Spectral slope measurements in emotionally expressive speech. In Proceedings of ISCA ITRW on Speech Analysis and Processing for Knowledge Discovery paper 007. Aalborg, Denmark, June 4–6. https://www.isca-archive.org/spkd_2008/tamarit08_spkd.pdf

Tisljár-Szabó, E., Rossu, R., Varga, V. and Pléh, C. (2014) The effect of alcohol on speech production. Journal of Psycholinguistic Research 43: 737–748. https://doi.org/10.1007/s10936-013-9278-y

Traunmüller, H. and Eriksson, A. (2000) Acoustic effects of variation in vocal effort by men, women, and children. Journal of the Acoustical Society of America 107(6): 3438–3451. https://doi.org/10.1121/1.429414

Van Summers, W., Pisoni, D. B., Bernacki, R. H., Pedlow, R. I. and Stokes, M. A. (1988) Effects of noise on speech production: Acoustic and perceptual analyses. Journal of the Acoustical Society of America 84(3): 917–928. https://doi.org/10.1121/1.396660

Volín, J. and Zimmermann, J. (2011) Spectral slope parameters and detection of word stress. Technical Computing Prague 125–130. Prague, Czech Republic, November 8. https://dsp.vscht.cz/konference_matlab/MATLAB11/prispevky/125_volin.pdf

Controlled voice quality modifications

Acoustic, perceptual and ASR analysis

Authors

DOI:

Keywords:

Abstract

Author Biographies

References

Downloads

Published

Issue

Section

License

How to Cite

Subscription

Information

Accessibility

Unsubscribe

Latest publications