Strength of forensic voice comparison evidence from the acoustics of filled pauses

Vincent Hughes; Sophie Wood; Paul Foulkes

doi:10.1558/ijsll.v23i1.29874

Authors

Vincent Hughes University of York
Sophie Wood University of York
Paul Foulkes University of York

DOI:

https://doi.org/10.1558/ijsll.v23i1.29874

Keywords:

Forensic voice comparison, hesitation markers, likelihood ratio, formant dynamics, durations

Abstract

This study investigates the evidential value of filled pauses (FPs, i.e. um, uh) as variables in forensic voice comparison. FPs for 60 young male speakers of standard southern British English were analysed, drawn from Task 1 of the DyViS corpus (Nolan et al. 2009). The following acoustic properties were analysed: midpoint frequencies of the first three formants in the vocalic portion; ‘dynamic’ characterisations of formant trajectories (i.e. quadratic polynomial equations fitted to nine measurement points over the entire vowel); vowel duration; and nasal duration for um. Likelihood ratio (LR) scores were computed using the Multivariate Kernel Density formula (MVKD; Aitken and Lucy, 2004) and converted to calibrated log10 LRs (LLRs) using logistic-regression (Brümmer et al., 2007). System validity was assessed using both equal error rate (EER) and the log LR cost function (Cllr; Brümmer and du Preez, 2006). The system with the best performance combines dynamic measurements of all three formants with vowel and nasal duration for um, achieving an EER of 4.08% and Cllr of 0.12. In terms of general patterns, um consistently outperformed uh. For um, the formant dynamic systems generated better validity than those based on midpoints, presumably reflecting the additional degree of formant movement in um caused by the transition from vowel to nasal. By contrast, midpoints outperformed dynamics for the more monophthongal uh. Further, the addition of duration (vowel or vowel and nasal) consistently improved system performance. The study supports the view that FPs have excellent potential as variables in forensic voice comparison cases.

Author Biographies

Vincent Hughes, University of York

Vincent Hughes is Lecturer in Forensic Speech Science at the University of York. In 2015, he was a post-doctoral research assistant on the project Voice and Identity – Source, Filter, Biometric (funded by the UK Arts and Humanities Research Council #AH/M003396/1, 2015-17). His research interests lie in forensic speech science, phonetics, phonology, sociophonetics and sociolinguistics. He is a member of the International Association of Forensic Phonetics and Acoustics.
Sophie Wood, University of York

Sophie Wood works for the UK Civil Service. She holds undergraduate (BA English Language and Linguistic Science) and postgraduate (MSc Forensic Speech Science) degrees from the University of York. She researched filled pauses as a discriminatory parameter for forensic speaker comparison for her MSc dissertation and presented at the 2014 IAFPA conference. Sophie also worked on the project ‘Perceptual adaptation to regional accents as a new lens on the puzzle of spoken word recognition’.
Paul Foulkes, University of York

Paul Foulkes is Professor in the Department of Language and Linguistic Science, University of York. His teaching and research interests include forensic phonetics, laboratory phonology, phonological development, and sociolinguistics. His current collaborators include Cathi Best, Jean-Pierre Chevrot, Gerry Docherty, Bronwen Evans, Peter French, Bill Haddican, Jen Hay, Vincent Hughes, Jason Shaw, Marilyn Vihman and Kim Wilson. He has worked on over 200 forensic cases from the UK, Ghana and New Zealand.

References

Acton, E. K. (2011) On gender differences in the distribution of um and uh. University of Pennsylvania Working Papers in Linguistics 17. http://repository.upenn.edu/pwpl/vol17/iss2/2

Aitken, C. G. G. and Lucy, D. (2004) Evaluation of trace evidence in the form of multivariate data. Applied Statistics 53(4): 109–122.

Aitken, C. G. G. and Taroni, F. (2004) Statistics and the Evaluation of Evidence for Forensic Scientists (2nd edn). Chichester: Wiley.

Atkinson, N. (2009) Formant dynamics of SSBE monophthongs in unscripted speech. Unpublished MSc dissertation, University of York.

Becker, T., Jessen, M. and Grigoras, C. (2008) Forensic speaker verification using formant features and Gaussian Mixture Models. Interspeech 2008 Special Session: Forensic Speaker Recognition – Traditional and Automatic Approaches. Brisbane, Australia: 1505–1508.

Boersma, P. and Weenink, D. (2014) Praat: doing phonetics by computer [Computer program]. Version 5.3.62.

Brander, D. (2014) Phonetic characteristics of hesitation vowels in Swiss German and their use for forensic phonetic speaker identification. Poster presented at the annual conference of the International Association for Forensic Phonetics and Acoustics, Zürich, Switzerland.

Brümmer, N. (n.d.) FoCal toolkit. https://sites.google.com/site/nikobrummer/focal (retrieved 3 June 2011).

Brümmer, N. and du Preez, J. (2006) Application-independent evaluation of speaker detection. Computer Speech and Language 20(2–3): 230–275. http://dx.doi.org/10.1016/j.csl.2005.08.001

Bru?mmer, N., Burget, L., ?ernocký, J., Glembek, O., Grézl, F., Karafiát, M., van Leeuwen, D. A., Mat?jka, P., Schwarz, P. and Strasheim, A. (2007). Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST SRE 2006. IEEE Transactions on Audio Speech and Language Processing 15: 2072–2084.

Champod, C. and Evett, I. W. (2000) Commentary on A. P. A. Broeders (1999) ‘Some observations on the use of probability in forensic identification’. Forensic Linguistics 7(2): 238–243.

Christenfeld, N. and Creager, B. (1996) Anxiety, alcohol, aphasia, and ums. Journal of Personality and Social Psychology 70(3): 451–460. http://dx.doi.org/10.1037/0022-3514.70.3.451

Clark, H. H. and Fox Tree, J. E. (2002) Using uh and um in spontaneous speech. Cognition 84: 73–111. http://dx.doi.org/10.1016/S0010-0277(02)00017-3

Clermont, F., French, J. P., Harrison, P. T. and Simpson, S. (2008) Population data for English spoken in England: a modest first step. Paper presented at the annual conference of the International Association for Forensic Phonetics and Acoustics, Lausanne, Switzerland.

Docherty, G. J. and Foulkes, P. (1999) Newcastle upon Tyne and Derby: instrumental phonetics and variationist studies. In P. Foulkes and G. J. Docherty (eds) Urban Voices: Accent Studies in the British Isles 47–71. London: Arnold.

Duckworth, M. and McDougall, K. (2013) Individual differences in fluency disruptions: a cross-style investigation. Paper presented at the annual conference of the International Association for Forensic Phonetics and Acoustics, Tampa, Florida.

Enzinger, E. and Morrison, G. S. (2012) The importance of using between-session test data in evaluating the performance of forensic-voice-comparison systems. In Proceedings of the 14th Australasian Conference on Speech Science and Technology 137–140. Sydney, Australia.

Eriksson, E. J., Cepeda, L. F., Rodman, R. D., McAllister, D. F., Bitzer, D. and Arroway, P. (2004) Cross-language speaker identification using spectral moments. In Proceedings of the 17th Swedish Phonetic Conference (FONETIK) 76–79. Stockholm, Sweden.

Evett, I. W. (1991) Interpretation: a personal odyssey. In C. G. G. Aitken and D. A. Stone (eds) The Use of Statistics in Forensic Science 9–22. Chichester: Ellis Horwood.

Foulkes, P., Carrol, G. and Hughes, S. (2004) Sociolinguistics and acoustic variability in filled pauses. Paper presented at the annual conference of the International Association for Forensic Phonetics and Acoustics, Helsinki, Finland.

Foulkes, P. and French, J. P. (2012) Forensic speaker comparison: a linguistic-acoustic perspective. In P. M. Tiersma and L. M. Solan (eds) The Oxford Handbook of Language and the Law 557–572. Oxford: Oxford University Press.

Greenberg, S., Carvey, H., Hitchcock, L. and Chang, S. (2003) Temporal properties of spontaneous speech – a syllable-centric perspective. Journal of Phonetics 31(3): 465–485. http://dx.doi.org/10.1016/j.wocn.2003.09.005

Grosjean, F. and Deschamps, A. (1973) Analyse des variables temporelles du français spontané. Phonetica 28(3–4): 191–226. http://dx.doi.org/10.1159/000259456

Hughes, V. (2014) The definition of the relevant population and the collection of data for likelihood ratio-based forensic voice comparison. Unpublished PhD thesis, University of York.

Hughes, V. and Foulkes, P. (2015) The relevant population in forensic voice comparison: effects of varying delimitations of social class and age. Speech Communication 66: 218–230. http://dx.doi.org/10.1016/j.specom.2014.10.006

Hughes, V., Wood, S. and Foulkes, P. (forthcoming) Phonetic measurements of hesitations improve the performance of automatic speaker recognition systems.

Jessen, M. (2008) Forensic phonetics. Language and Linguistics Compass 2(4): 671–711. http://dx.doi.org/10.1111/j.1749-818X.2008.00066.x

Jessen, M., Köster, O. and Gfroerer, S. (2005) Influence of vocal effort on average and variability of fundamental frequency. International Journal of Speech, Language and the Law 12(2): 174–213. http://dx.doi.org/10.1558/sll.2005.12.2.174

Johnson, K. (2012) Acoustic and Auditory Phonetics (3rd edn). Malden, MA: Wiley-Blackwell.

Kendall, T. and Thomas, E. R. (2014) ‘vowels’ (R package). http://cran.r-project.org/web/packages/vowels/index.html

Ketabdar, H. (2004) ‘jEER_DET.m’ (matlab function) (version 1.2 with amendments by Anil Alexander).

Kowal, S., O’Connell, D. C., Forbush, K., Higgins, M., Clarke, L. and D’Anna, K. (1997) Interplay of literacy and orality in inaugural rhetoric. Journal of Psycholinguistic Research 26(1): 1–31. http://dx.doi.org/10.1023/A:1025043620499

Künzel, H. J. (1997) Some general phonetic and forensic aspects of speaking tempo. International Journal of Speech, Language and the Law 4(1): 48–83.

Lennes, M. (2003a) Save_intervals_to_wav_sound_files.praat (Praat script) http://www.helsinki.fi/~lennes/praat-scripts/public/save_intervals_to_wav_sound_files.praat (retrieved 29 July 2013).

Lennes, M. (2003b) Collect_formant_data_from_files.praat. http://www.helsinki.fi/~lennes/praat-scripts/public/collect_formant_data_from_files.praat (retrieved 15 May 2013).

Liberman, M. (2014) UM / UH update. Language Log, 13 December 2014. http://languagelog.ldc.upenn.edu/nll/?p=16414 (and several other posts).

Maclay, H. and Osgood, C. (1959) Hesitation phenomena in spontaneous English speech. Word 15(1): 19–44. http://dx.doi.org/10.1080/00437956.1959.11659682

Martire, K. A., Kemp, R. I., Sayle, M. and Newell, B. R. (2013) On the interpretation of likelihood ratios in forensic science evidence: presentation formats and the weak evidence effect. Forensic Science International 240: 61–68. http://dx.doi.org/10.1016/j.forsciint.2014.04.005

McDougall, K. (2004) Speaker-specific formant dynamics: an experiment on Australian English /a?/. International Journal of Speech, Language and the Law 11(1): 103–130. http://dx.doi.org/10.1558/sll.2004.11.1.103

McDougall, K. (2006) Dynamic features of speech and the characterisation of speakers: towards a new approach using formant frequencies. International Journal of Speech, Language and the Law 13(1): 89–126. http://dx.doi.org/10.1558/sll.2006.13.1.89

McDougall, K. and Nolan, F. (2007) Discrimination of speakers using the formant dynamics of /u?/ in British English. In Proceedings of the 16th International Congress of Phonetic Sciences 1825–1828. Saarbrücken, Germany.

Milroy, L., Milroy, J. and Docherty, G. J. (1994–1997) Phonological Variation and Change in Contemporary British English. Economic and Social Research Council (ESRC) of Great Britain. R000234892.

Morrison, G. S. (2007) matlab implementation of Aitken and Lucy’s (2004) forensic likelihood ratio software using multivariate-kernel-density estimation. http://geoff-morrison.net/#MVKD (retrieved 31 May 2011).

Morrison, G. S. (2009a) Forensic voice comparison and the paradigm shift. Science and Justice 49(4): 298–308.

Morrison, G. S. (2009b) Likelihood-ratio voice comparison using parametric representations of the formant trajectories of diphthongs. Journal of the Acoustical Society of America 125(4): 2387–2397.

Morrison, G. S. (2009c) train_llr_fusion_robust.m (matlab function). http://geoff-morrison.net/#TrainFus (retrieved 13 December 2011).

Morrison, G. S. (2011a) Measuring the validity and reliability of forensic likelihood-ratio systems. Science and Justice 51: 91–98.

Morrison, G. S. (2011b) A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: multivariate kernel density (MVKD) versus Gaussian mixture model-universal background model (GMM-UBM). Speech Communication 53: 242–256.

Morrison, G. S. (2013) Tutorial on logistic-regression calibration and fusion: converting a score to a likelihood ratio. Australian Journal of Forensic Sciences 45(2): 173–197. http://dx.doi.org/10.1080/00450618.2012.733025

Morrison, G. S. (2014) Distinguishing between forensic science and forensic pseudoscience: testing of validity and reliability and approaches to forensic voice comparison. Science and Justice 54(3): 245–256. http://dx.doi.org/10.1016/j.scijus.2013.07.004

Morrison, G. S., Ochoa, F. and Thiruvaran, T. (2012) Database selection for forensic voice comparison. In Proceedings of Odyssey 2012: The Language and Speaker Recognition Workshop 74–77. Singapore.

Morrison, G. S. and Enzinger, E. (2013) Forensic speech science. In N. Nic Daéid (ed.) Proceedings of the 17th International Forensic Science Managers’ Symposium 616–623. Lyon, France.

Mullen, C., Spence, D., Moxey, L., and Jamieson, A. (2014) Perception problems of the verbal scale. Science and Justice 54(2): 154–158. http://dx.doi.org/10.1016/j.scijus.2013.10.004

Nair, B., Alzqhoul, E. and Guillemin, B. J. (2014) Determination of likelihood ratios for forensic voice comparison using principal component analysis. International Journal of Speech Language and the Law 21: 83–112.

Nolan, F. J. (1997) Speaker recognition and forensic phonetics. In W. J. Hardcastle and J. Laver (eds) The Handbook of Phonetic Sciences 744–767. Oxford: Blackwell.

Nolan, F., McDougall, K., de Jong, G. and Hudson, T. (2009) The DyViS database: style-controlled recordings of 100 homogeneous speakers for forensic phonetic research. International Journal of Speech, Language and the Law 16(1): 31–57.

R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/

Robertson, B. and Vignaux, G. A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. Chichester: John Wiley and Sons.

Rose, P. (2002) Forensic Speaker Identification. London: Taylor and Francis.

Rose, P. (2006) The intrinsic speaker discriminatory power of diphthongs. In Proceedings of the 11th Australasian Conference on Speech Science and Technology 64-67. Auckland, New Zealand.

Rose, P. (2013) Where the science ends and the law begins: likelihood ratio-based forensic voice comparison in a $150 million telephone fraud. International Journal of Speech, Language and the Law 20(2): 277–324. http://dx.doi.org/10.1558/ijsll.v20i2.277

Rose, P. (2015) Forensic voice comparison with monophthongal formant trajectories – a likelihood ratio-based discrimination of ‘schwa’ vowel acoustics in a close social group of young Australian females. Proceedings of the International Conference on Acoustics Speech and Signal Processing (ICASSP) 4819–4823. Brisbane, Australia.

Rose, P., Kinoshita, Y. and Alderman, T. (2006) Realistic extrinsic forensic speaker discrimination with the diphthong /a?/. Proceedings of the 11th Australasian Conference on Speech Science and Technology 329–334. Auckland, New Zealand.

Rose, P. and Morrison, G. (2009) A response to the UK position statement on forensic speaker comparison. International Journal of Speech, Language and the Law 16(1): 139–163. http://dx.doi.org/10.1558/ijsll.v16i1.139

Schachter, S., Christenfeld, N., Ravina, B. and Bilous, F. (1991) Speech disfluency and the structure of knowledge. Journal of Personality and Social Psychology 60(3): 362–367. http://dx.doi.org/10.1037/0022-3514.60.3.362

Shriberg, E. (2001) To ‘errrr’ is human: ecology and acoustics of speech disfluencies. Journal of the International Phonetic Association 31(1): 153–169. http://dx.doi.org/10.1017/S0025100301001128

Simpson, S. (2008) Testing the speaker discrimination ability of formant measurements in forensic speaker comparison cases. Unpublished MSc Dissertation, University of York.

Stevens, K. (2001) Acoustic Phonetics. Cambridge, MA: MIT Press.

Swerts, M., Wichmann, A. and Beun, R.-J. (1996) Filled pauses as markers of discourse structure. In Proceedings of the International Conference on Spoken Language Processing (volume 2) 1033–1036. http://dx.doi.org/10.1109/ICSLP.1996.607780

Tabachnick, B. G. and Fiddell, L. S. (2007) Using Multivariate Statistics (5th edn). Boston: Pearson.

Thaitechawat, S. and Foulkes, P. (2011) Discrimination of speakers using tone and formant dynamics in Thai. In Proceedings of the 17th International Congress of Phonetic Sciences 1978–1981. Hong Kong.

Tottie, G. (2011) Uh and Um as sociolinguistic markers in British English. International Journal of Corpus Linguistics 16(2): 173–197. http://dx.doi.org/10.1075/ijcl.16.2.02tot

Tschäpe, N., Trouvain, J., Bauer, D. and Jessen, M. (2005) Idiosyncratic patterns of filled pauses. Paper presented at the annual conference of the International Association for Forensic Phonetics and Acoustics, Marrakesh, Morocco.

Umeda, N. (1975) Vowel duration in American English. Journal of the Acoustical Society of America 58(2): 434–445. http://dx.doi.org/10.1121/1.380688

van Leeuwen, D. A. and Brümmer, N. (2007) An introduction to application-independent evaluation of speaker recognition systems. In C. Müller (ed.) Speaker Classification vol. 1: Selected Projects 330–353. Heidelberg: Springer.

Van Summers, W., Pisoni, D. B., Bernacki, R. H., Pedlow, R. I. and Stokes, M. A. (1988) Effects of noise on speech production: acoustic and perceptual analyses. Journal of the Acoustical Society of America 84(3): 917–928. http://dx.doi.org/10.1121/1.396660

Wells, J. C. (1982) Accents of English (3 volumes). Cambridge: Cambridge University Press.

Wickham, H. (2015) ggplot2 (R package). http://cran.r-project.org/web/packages/ggplot2/index.html

Strength of forensic voice comparison evidence from the acoustics of filled pauses

Authors

DOI:

Keywords:

Abstract

Author Biographies

References

Downloads

Published

Issue

Section

License

How to Cite

Subscription

Information

Accessibility

Unsubscribe

Latest publications