The effect of speaker sampling in likelihood ratio based forensic voice comparison

  • Bruce Xiao Wang University of York
  • Vincent Hughes University of York
  • Paul Foulkes University of York
Keywords: forensic voice comparison, establishing reliability and validity, likelihood ratio, system stability, english filled pause, cantonese sentence final particle


Within the field of forensic voice comparison (FVC), there is growing pressure for experts to demonstrate the validity and reliability of the conclusions they reach in casework. One benefit of a fully data-driven approach that utilises databases of speakers to compute numerical likelihood ratios (LRs) is that it is possible to estimate validity and reliability empirically. However, little is known about the stability of LR output as a function of the specific speakers sampled for use in the training, test and reference data sets. The present study addresses this issue using two large sets of formant data: Cantonese sentence final particle /a/ and British English filled pauses UM. Experiments were replicated 100 times varying the 1) training, test and reference speakers, 2) training speakers only, 3) test speakers only, and 4) reference speakers only. The results show that varying the speakers in all three sets has the greatest effect on system stability for both the Cantonese and English variables, with the Cllr varying from 0.60 to 0.97 for /a/ and 0.32 to 1.33 for UM. However, this variability is primarily due to the effects of uncertainty in the test set. Varying only the training speakers has the least effect on system stability for /a/ (Cllr range: 0.76 to 0.88), while varying reference speakers has the smallest effect for UM (Cllr range: 0.40 to 0.54). The results indicate that in LR-based FVC it is important to assess the stability of the system as a function of the samples of speakers used (Cllr range) rather than just reporting a single Cllr value based on one configuration of speakers in each set. The study contributes to the general debate on reporting uncertainty in LR computation.

Author Biographies

Bruce Xiao Wang, University of York

Bruce Xiao Wang is currently a PhD candidate in Forensic Speech Science in the Department of Language and Linguistic Science at the University of York, UK. His research interests lie in forensic voice comparison, probability theory and uncertainty in forensic evidence evaluation, phonological variation and change, and sociophonetics.

Vincent Hughes, University of York

Vincent Hughes is a Lecturer in Forensic Speech Science in the Department of Language and Linguistic Science at the University of York. His research interests lie in forensic speech science, phonetics, phonology, sociophonetics and sociolinguistics. His current research focuses on understanding the bases and limitations of individual speaker characterisation and the relative contribution of acoustic, auditory and biological information. He is also interested in the application of the numerical likelihood ratio framework to the evaluation of speech evidence in forensic voice comparison cases. His doctoral research considered how the definition of the relevant population with regard to regional and social dimensions of variability and sample size affects the numerical estimation of the strength of evidence.

Paul Foulkes, University of York

Paul Foulkes is a Professor in the Department of Language and Linguistic Science at the University of York. His interests are mainly in forensic speech science, sociophonetics and child language development.


Aitken, C. G. and Lucy, D. (2004) Evaluation of trace evidence in the form of multivariate. data. Journal of the Royal Statistical Society: Series C (Applied Statistics) 53(1): 109–122.

Andrus, Tony, et al. (2016) IARPA Babel Cantonese language pack IARPA-babel101bv0.4c LDC2016S02. Web download. Philadelphia: Linguistic Data Consortium.

Boersma, P. and Weenink, D. (2017) Praat: doing phonetics by computer [Computer program]. Version 6.0.36.

Brümmer, N. and du Preez, J (2006) Application-independent evaluation of speaker detection. Computer Speech and Language, 20(2–3): 230–275. csl.2005.08.001

Curran, J. M. (2016) Admitting to uncertainty in the LR. Science and Justice 56(5): 380–382.

Enzinger, E. and Morrison, G. S. (2012) The importance of using between-session test data in evaluating the performance of forensic-voice-comparison systems. Proceedings of the 14th Australasian International Conference on Speech Science and Technology: 137–140.

Enzinger, E. and Morrison, G.S. (2017) Empirical test of the performance of an acousticphonetic approach to forensic voice comparison under conditions similar to those of a real case. Forensic Science International 277: 30–40. forsciint.2017.05.007

Enzinger, E., Morrison, G. S. and Ochoa, F. (2016) A demonstration of the application of the new paradigm for the evaluation of forensic evidence under conditions reflecting those of a real forensic-voice-comparison case. Science & Justice 56(1): 42–57.

Grigoras, C., Smith, J., Morrison, G. and Enzinger, E. (2013) Forensic audio analysis – Review: 2010–2013. Proceedings of the 17th International Science Managers’ Symposium: 612–637.

Home Office (1984) Police and Criminal Evidence Act. Her Majesty’s Stationery Office.

Hughes, V. (2017) Sample size and the multivariate kernel density likelihood ratio: How many speakers are enough? Speech Communication 94: 15–29.

Hughes, V. and Foulkes, P. (2015) The relevant population in forensic voice comparison: effects of varying delimitations of social class and age. Speech Communication 66: 218–230.

Hughes, V. and Foulkes, P. (2017) What is the relevant population? Considerations for the computation of likelihood ratios in forensic voice comparison. Proceedings of Interspeech 2017: 3772–3776.

Hughes, V., Wood, S. and Foulkes, P. (2016) Strength of forensic voice comparison evidence from the acoustics of filled pauses. International Journal of Speech, Language and the Law 23(1): 99–132.

Ishihara, S. and Kinoshita, Y. (2008) How many do we need? Exploration of the population size effect on the performance of forensic speaker classification. Proceedings of Interspeech: 1941–1944.

Kinoshita, Y. and Ishihara, S. (2014) Background population: how does it affect LR-based forensic voice comparison? International Journal of Speech, Language and the Law 21(2): 191–224.

Kwok, H. (1984) Sentence particles in Cantonese. Hong Kong: Centre of Asian Studies, University of Hong Kong.

Law, A. (2002) Cantonese sentence-final particles and the CP domain. UCL Working Papers in Linguistics 14: 375–398.

Lennes, M. (2003a) Save_intervals_to_wav_sound_files.praat. Retrieved on 21August 2018 from

Lennes, M. (2003b) Collect_formant_data_from_files.praat. Retrieved on 21August 2018 from

Leung, W. M. (2009) A study of the Cantonese hearsay particle wo from a tonal perspective. International Journal of Linguistics 1(1): 1–14. v1i1.204

Lindblom, B. (1963) Spectrographic study of vowel reduction. Journal of the Acoustical Society of America 35(11): 1773–1781.

Liu, X. M. (2006) 刑事侦查程序理论与改革研究 [Criminal investigation theory and reform]. China Legal Publishing House.

Lo, J. (2018) FVClrr: Likelihood ratio calculation and testing in forensic voice comparison [unpublished R package].

McDougall, K. (2004) Speaker-specific formant dynamics: an experiment on Australian English /aɪ/. International Journal of Speech, Language and the Law 11(1): 103–130.

McDougall, K. (2006) Dynamic features of speech and the characterization of speakers: towards a new approach using formant frequencies. International Journal of Speech, Language and the Law 13(1): 89–126.

Morrison, G. S. (2007) Matlab implementation of Aitken and Lucy’s (2004) forensic likelihood-ratio software using multivariate-kernel-density estimation. Retrieved on 20 July 2018 from

Morrison, G. S. (2008) Forensic voice comparison using likelihood ratios based on polynomial curves fitted to the formant trajectories of Australian English /aɪ/. International Journal of Speech, Language and the Law 15(2): 249–266.

Morrison, G. S. (2009) Likelihood-ratio forensic voice comparison using parametric representations of the formant trajectories of diphthongs. Journal of the Acoustical Society of America 125(4): 2387–2397.

Morrison, G. S. (2011) A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: multivariate kernel density (MVKD) versus Gaussian mixture model–universal background model (GMM–UBM). Speech Communication 53(2): 242–256.

Morrison, G. S. (2013) Tutorial on logistic-regression calibration and fusion: converting a score to a likelihood ratio. Australian Journal of Forensic Sciences 45(2): 173–197.

Morrison, G. S. (2016) Special issue on measuring and reporting the precision of forensic likelihood ratios: introduction to the debate. Science & Justice 56(5): 371–373.

Morrison, G. S. and Enzinger, E. (2016) What should a forensic practitioner’s likelihood ratio be? Science & Justice 56(5): 374–379.

Morrison, G. S., Ochoa, F. and Thiruvaran, T. (2012) Database selection for forensic voice comparison. Proceedings of Odyssey: 62–77.

Morrison, G. S. and Poh, N. (2018) Avoiding overstating the strength of forensic evidence: shrunk likelihood ratios/Bayes factors. Science & Justice 58(3): 200–218.

Nolan, F., McDougall, K., de Jong, G. and Hudson, T. (2009) The DyViS database: stylecontrolled recordings of 100 homogeneous speakers for forensic phonetic research. International Journal of Speech, Language and the Law 16(1): 31–57.

R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Robertson, B. and Vignaux, G. A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. Oxford: Oxford University Press.

Rose, P. (2004) Technical forensic speaker identification from a Bayesian linguist’s perspective. Proceedings of Odyssey: 3–10.

Rose, P. and Morrison, G. (2009) A response to the UK Position Statement on forensic speaker comparison. International Journal of Speech, Language and the Law 16(1): 139.

Rose, P. and Wang, B. X. (2016) Cantonese forensic voice comparison with higher-level features: likelihood ratio-based validation using F-pattern and tonal F0 trajectories over a disyllabic hexaphone. Proceedings of Odyssey 2016: 326–333.

Shriberg, E. (2001) To ‘errrr’ is human: ecology and acoustics of speech disfluencies. Journal of the International Phonetic Association 31(1): 153–169.

Sybesma, R. and Li, B. (2007) The dissection and structural mapping of Cantonese sentence final particles. Lingua 117(10): 1739–1783. lingua.2006.10.003

Tschäpe, N., Trouvain, J., Bauer, D. and Jessen, M. (2005) Idiosyncratic patterns of filled pauses. In Proceedings of the 14th Annual Conference of the International Association for Forensic Phonetics and Acoustics, Marrakesh, Morocco.

Wakefield, J. (2011) The English equivalents of Cantonese sentence-final particles. Doctoral dissertation, Hong Kong Polytechnic University.

Wang, B., Hughes, V. and Foulkes, P. (2019) Effect of score sampling on system stability in likelihood ratio based forensic voice comparison. In Proceedings of the 19th International Congress of Phonetic Sciences. Melbourne, Australia.

Zhang, C., Morrison, G. S. and Thiruvaran, T. (2011) Forensic voice comparison using Chinese /iau/. In Proceedings of the 17th International Congress of Phonetic Sciences 17: 21.
How to Cite
Xiao Wang, B., Hughes, V., & Foulkes, P. (2019). The effect of speaker sampling in likelihood ratio based forensic voice comparison. International Journal of Speech, Language and the Law, 26(1), 97-120.