Controlled voice quality modifications
Acoustic, perceptual and ASR analysis
DOI:
https://doi.org/10.1558/ijsll.26094Keywords:
voice quality, voice disguise, phonation, articulation, CzechAbstract
Within-speaker variability, which results from the plasticity of speech production, is an inherent feature of speaker comparison. This study examines targeted modifications of voice quality, both phonatory and articulatory, in Czech. Fifteen speakers were instructed to read a text in 15 different versions (e.g. palatalised voice, denasalised voice, breathy phonation, or a combination of open jaw and creaky voice). Acoustic analyses revealed that F3 is relatively stable across various voice quality settings, while harmonicity and spectral slope indicators are sensitive to the phonatory modifications. A perceptual test, administered online to 120 participants, showed that palatalised, pharyngealised, creaky, and pressed voice were regarded as most different from the speakers’ habitual voices. Finally, automatic speaker recognition scores were very good with the targeted voice quality modifications, with LLR between 3 and 10. Pressed phonation turned out to have the greatest effect on all three types of analysis.
References
Anwyl-Irvine, A. L., Massonié, J., Flitton, A., Kirkham, N. Z. and Evershed, J. K. (2019) Gorilla in our midst: an online behavioural experiment builder. Behavior Research Methods 52: 388–407. https://doi.org/10.3758/s13428-019-01237-x
Anwyl-Irvine, A. L., Dalmaijer, E. S., Hodges, N. and Evershed, J.K. (2021) Realistic precision and accuracy of online experiment platforms, web browsers, and devices. Behavior Research Methods 53: 1407–1425. https://doi.org/10.3758/s13428-020-01501-5
Artkoski, M., Tommila, J. and Laukkanen, A.-M. (2002) Changes in voice during a day in normal voices without vocal loading. Logopedics Phoniatrics Vocology 27(3): 118–123. https://doi.org/10.1080/140154302760834840
Ayoub, M. R., Larrouy-Maestri, P. and Morsomme, D. (2019) The effect of smoking on the fundamental frequency of the speaking voice. Journal of Voice 33(5): 802.e11–802.e16. https://doi.org/10.1016/j.jvoice.2018.04.001
Bartle, A. and Dellwo, V. (2015) Auditory speaker discrimination by forensic phoneticians and naive listeners in voiced and whispered speech. International Journal of Speech, Language and the Law 22(2): 229–248. https://doi.org/10.1558/ijsll.v22i2.23101
Bates, D., Mächler, M., Bolker, B. and Walker, S. (2015) Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67: 1–48. https://doi.org/10.18637/jss.v067.i01
Baumeister, B., Heinrich, C. and Schiel, F. (2012) The influence of alcoholic intoxication on the fundamental frequency of female and male speakers. Journal of the Acoustical Society of America 132(1): 442–451. https://doi.org/10.1121/1.4726017
Boersma, P. and Weenink, D. (2022) Praat: Doing Phonetics by Computer. Version 6.3. Retrieved from http://www.praat.org
Boucher, V. J. and Ayad, T. (2010) Physiological attributes of vocal fatigue and their acoustic effects: A synthesis of findings for a criterion-based prevention of acquired voice disorders. Journal of Voice 24(3): 324–336. https://doi.org/10.1016/j.jvoice.2008.10.001
Braun, A. (2006) Stimmverstellung und Stimmenimitation in der forensischen Sprechererkennung. In T. Kopfermann (ed.) Das Phänomen Stimme: Imitation und Identität 177–181. St. Ingbert: Röhrig Universitätsverlag.
Brümmer, N. and du Preez, J. (2006) Application-independent evaluation of speaker detection. Computer, Speech and Language 20: 230–275. https://doi.org/10.1016/j.csl.2005.08.001
Chen, S. X. and Bond, M. H. (2010) Two languages, two personalities? Examining language effects on the expression of personality in a bilingual context. Personality and Social Psychology Bulletin 36(11): 1514–1528. https://doi.org/10.1177/0146167210385360
Corretge, R. (2022) Praat Vocal Toolkit. Retrieved from https://www.praatvocaltoolkit.com
Disner, S. and Benítez, A. (2018) F2 and F3 covariance as evidence of speaker identity. In Proceedings of IAFPA 2018, 86. Huddersfield, United Kingdom, July 29–August 1. https://iafpa2018.wordpress.com/wp-content/uploads/2018/09/iafpa_2018_abstract_booklet_5.pdf
Earnshaw, K. (2021) Examining the implications of speech accommodation for forensic speaker comparison casework: a case study of the West Yorkshire FACE vowel. Journal of Phonetics 87: 101062. https://doi.org/10.1016/j.wocn.2021.101062
Eriksson, A. (2010) The disguised voice: imitating accents or speech styles and impersonating individuals. In C. Llamas and D. Watt (eds) Language and Identities 86–96. Edinburgh: Edinburgh University Press.
Eriksson, E. J., Rodman, R. D. and Hubal, R. C. (2007) Emotions in speech: juristic implications. In C. Müller (ed.) Speaker Classification I 152–173. Berlin: Springer-Verlag.
Evans, B. G. and Iverson, P. (2007) Plasticity in vowel perception and production: a study of accent change in young adults. Journal of the Acoustical Society of America 121(6): 3814–3826. https://doi.org/10.1121/1.2722209
Farrús, M. (2018) Voice disguise in automatic speaker recognition. ACM Computing Surveys 51(4): article 68. https://doi.org/10.1145/3195832
Figueiredo, R. M. and Britto, H. S. (1996) A report on the acoustic effects of one type of disguise. Forensic Linguistics 3(1): 168–175. https://doi.org/10.1558/ijsll.v3i1.168
Fraile, R. and Godino-Llorente, J. I. (2014) Cepstral peak prominence: a comprehensive analysis. Biomedical Signal Processing and Control 14: 42–54. https://doi.org/10.1016/j.bspc.2014.07.001
Hammarberg, B., Fritzell, B., Gauffin, J., Sundberg, J. and Wedin, L. (1980) Perceptual and acoustic correlates of abnormal voice qualities. Acta Otolaryngoogica 90: 441–451. https://doi.org/10.3109/00016488009131746
Hejná, M. (2019) A case study of menstrual cycle effects: global phonation or also local phonatory phenomena? In Proceedings of the 19th ICPhS, paper 13. Melbourne, Australia, August 5–9. https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2019/papers/ICPhS_2679.pdf
Hillenbrand, J., Cleveland, R. A. and Erickson, R. L. (1994) Acoustic correlates of breathy vocal quality. Journal of Speech, Language and Hearing Research 37: 769–778. https://doi.org/10.1044/jshr.3704.769
Hollien, H., DeJong, G., Martin, C. A., Schwartz, R. and Liljegren, K. (2001) Effects of ethanol intoxication on speech suprasegmentals. Journal of the Acoustical Society of America 110(6): 3198–3206. https://doi.org/10.1121/1.1413751
Hollien, H. and Schwartz, R. (2000) Aural-perceptual speaker identification: Problems with noncontemporary samples. Forensic Linguistics 7(2): 199–211. https://doi.org/10.1558/sll.2000.7.2.199
Hruška, R. and Boril, T. (2017) Temporal variability of fundamental frequency contours. Acta Universitatis Carolinae – Philologica 3, Phonetica Pragensia XIV 35–44. https://doi.org/10.14712/24646830.2017.31
Jessen, M. (2009) Forensic phonetics and the influence of speaking style on global measures of fundamental frequency. In G. Grewendorf and M. Rathert (eds) Formal Linguistics and Law 115–139. Berlin: Mouton de Gruyter.
Kelly, F. and Hansen, J. H. L. (2015) Evaluation and calibration of short-term aging effects in speaker verification. In Proceedings of Interspeech 2015 224–228. https://doi.org/10.21437/Interspeech.2015-89
Kelly, F. and Hansen, J. H. L. (2021) Analysis and calibration of Lombard effect and whisper for speaker recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29: 927–942. https://doi.org/10.1109/TASLP.2021.3053388
Krajewski, J., Wieland, R. and Batliner, A. (2008) An acoustic framework for detecting fatigue in speech based human-computer-interaction. In K. Miesenberger, J. Klaus, W. Zagler and A. Karshmer (eds), Computers helping people with special needs (ICCHP 2008). Lecture Notes in Computer Science, vol. 5105, 54–61. Berlin: Springer-Verlag. https://doi.org/10.1007/978-3-540-70540-6_7
Künzel, H. J. (2000). Effects of voice disguise on speaking fundamental frequency. Forensic Linguistics 7(2): 149–179. https://doi.org/10.1558/sll.2000.7.2.149
Laver, J. (1980) The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press.
Lee, B. and Van Lancker Sidtis, D. (2017) The bilingual voice: vocal characteristics when speaking two languages across speech tasks. Speech, Language and Hearing 20(3): 174–185. https://doi.org/10.1080/2050571X.2016.1273572
Lenth, R. (2022) emmeans: Estimated marginal means, aka least-squares means. R package version 1.7.5. Retrieved from https://CRAN.R-project.org/package=emmeans
Lindh, J. and Eriksson, A. (2007) Robustness of long time measures of fundamental frequency. In Proceedings of Interspeech 2007, 2025–2028. Antwerp, Belgium, August 27–31. https://www.isca-archive.org/interspeech_2007/lindh07_interspeech.pdf
Masthoff, H. (1996) A report on a voice disguise experiment. Forensic Linguistics 3(1): 160–167. https://doi.org/10.1558/ijsll.v3i1.160
Matejka, P., Plchot, O., Glembek, O., Burget, L., Rohdin, J. A., Zeinali, H., Mošner, L., Silnova, A., Novotný, O., Diez, S. M. and Cernický, J. (2020) 13 years of speaker recognition research at BUT, with longitudinal analysis of NIST SRE. Computer Speech and Language 63: 101035. https://doi.org/10.1016/j.csl.2019.101035
McDougall, K. and Duckworth, M. (2018) Individual patterns of disfluency across speaking styles: a forensic phonetic investigation of Standard Southern British English. International Journal of Speech, Language and the Law 25(2): 205–230. https://doi.org/10.1558/ijsll.37241
Mertens, P. (2004) The prosogram: semi-automatic transcription of prosody based on a tonal perception model. In Proceedings of Speech Prosody 2004. Nara, Japan, March 23–26. https://doi.org/10.21437/SpeechProsody.2004-127
Monsen. R. B. and Engebretson, A. M. (1977) Study of variations in the male and female glottal wave. Journal of the Acoustical Society of America 62: 981–993.
Nandwana, M. K., McLaren, M., Ferrer, L., Castan, D. and Lawson, A. (2019) Analysis and mitigation of vocal effort variations in speaker recognition. In Proceedings of ICASSP 2019 6001–6005. https://doi.org/10.1109/ICASSP.2019.8683881
Nolan, F. (1983) The Phonetic Bases of Speaker Recognition. Cambridge: Cambridge University Press.
Nolan, F. (2012) Degrees of freedom in speech production: an argument for native speakers in LADO. International Journal of Speech, Language and the Law 19(2): 263–289. https://doi.org/10.1558/ijsll.v19i2.263
Pardo, J. S. (2006) On phonetic convergence during conversational interaction. Journal of the Acoustical Society of America 119(4): 2382–2393. https://doi.org/10.1121/1.2178720
R Core Team (2022) R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. Available at https://www.R-project.org/
Rhodes, R. (2017) Aging effects on voice features used in forensic speaker comparison. International Journal of Speech, Language and the Law 24(2): 177–199. https://doi.org/10.1558/ijsll.34096
Ross, S., Earnshaw, K. and Gold, E. (2019) A cautionary tale for phonetic analyses: the variability of speech between and within recording sessions. In Proceedings of the 19th ICPhS, 3090–3094. Melbourne, Australia, August 5–9. https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2019/papers/ICPhS_3139.pdf
Ružicková, A. and Skarnitzl, R. (2017) Voice disguise strategies in Czech male speakers. Acta Universitatis Carolinae – Philologica 3: 19–34. https://doi.org/10.14712/24646830.2017.30
San Segundo, E. and Mompean, J. A. (2017) A simplified Vocal Profile Analysis protocol for the assessment of voice quality and speaker similarity. Journal of Voice 31(5): 644.e11–644.e27. http://dx.doi.org/10.1016/j.jvoice.2017.01.005
Scherer, K. R. (2020) Acoustic patterning of emotion vocalization. In S. Frühholz and P. Belin (eds) Oxford Handbook of Voice Perception 61–91. Oxford: Oxford University Press.
Sinha, P., Vandana, V. P., Lewis, N. V., Jayaram, M. and Enderby, P. (2015) Evaluating the effect of risperidone on speech: a cross-sectional study. Asian Journal of Psychiatry 15: 51–55. https://doi.org/10.1016/j.ajp.2015.05.005
Skarnitzl, R. and Nechanský, T. (forthcoming) Segmental cues. In K. McDougall, T. Hudson and F. Nolan (eds) Oxford Handbook of Forensic Phonetics. Oxford: Oxford University Press.
Smith, A. B., Mason, N., Browne, M. E. and Sullivan, B. (2019) Acoustic characteristics of disguised speech: speaker strategies and listener error patterns. International Journal of Speech, Language and the Law 26(1): 85–95. https://doi.org/10.1558/ijsll.38372
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. and Khudanpur, S. (2018) X-vectors: Robust DNN embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375
Sundberg, J. and Nordenberg, M. (2006) Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-term-average spectra of speech. Journal of the Acoustical Society of America 120(1): 453–457. https://doi.org/10.1121/1.2208451
Swerdlin, Y., Sith, J. and Wolfe, J. (2010) The effect of whisper and creak on vocal tract resonances. Journal of the Acoustical Society of America 127(4): 2590–2598. https://doi.org/10.1121/1.3316288
Tamarit, L., Goudbeek, M. and Scherer, K. (2008) Spectral slope measurements in emotionally expressive speech. In Proceedings of ISCA ITRW on Speech Analysis and Processing for Knowledge Discovery paper 007. Aalborg, Denmark, June 4–6. https://www.isca-archive.org/spkd_2008/tamarit08_spkd.pdf
Tisljár-Szabó, E., Rossu, R., Varga, V. and Pléh, C. (2014) The effect of alcohol on speech production. Journal of Psycholinguistic Research 43: 737–748. https://doi.org/10.1007/s10936-013-9278-y
Traunmüller, H. and Eriksson, A. (2000) Acoustic effects of variation in vocal effort by men, women, and children. Journal of the Acoustical Society of America 107(6): 3438–3451. https://doi.org/10.1121/1.429414
Van Summers, W., Pisoni, D. B., Bernacki, R. H., Pedlow, R. I. and Stokes, M. A. (1988) Effects of noise on speech production: Acoustic and perceptual analyses. Journal of the Acoustical Society of America 84(3): 917–928. https://doi.org/10.1121/1.396660
Volín, J. and Zimmermann, J. (2011) Spectral slope parameters and detection of word stress. Technical Computing Prague 125–130. Prague, Czech Republic, November 8. https://dsp.vscht.cz/konference_matlab/MATLAB11/prispevky/125_volin.pdf