The Use of ASR-Equipped Software in the Teaching of Suprasegmental Features of Pronunciation

A Critical Review


  • Tim Kochem Iowa State University
  • Jeanne Beck Iowa State University
  • Erik Goodale Iowa State University



Language Teaching, Automatic Speech Recognition (ASR) tools, Computer Assisted Pronunciation Training, suprasegmentals; pronunciation instruction; automatic speech recognition


Technology has paved the way for new modalities in language learning, teaching, and assessment. However, there is still a great deal of work to be done to develop such tools for oral communication, specifically tools that address suprasegmental features in pronunciation instruction. Therefore, this critical literature review examines how researchers have tried to create computer-assisted pronunciation training tools using automatic speech recognition (ASR) systems to aid language learners in the perception and production of suprasegmental features. We used 30 texts from 1990 to 2020 to explore how technologies have been and are currently being used to help learners develop their proficiency with suprasegmental features. Based on our thematic analysis, a persistent gap still exists between ASR-equipped software available to participants in research studies and what is available to university and classroom teachers and students. Additionally, there seems to be more development in the production of speech software for language assessment. In contrast, the translation of these tools into instructional tools for individualized learning seems to be almost non-existent. Moving forward, we recommend that more commercialized pronunciation systems utilizing ASR should be made publicly available using the technologies that are currently developed or are in development for the purposes of oral proficiency judgments.

Author Biographies

  • Tim Kochem, Iowa State University

    Tim Kochem is a PhD candidate in the applied linguistics and technology program at Iowa State University. His primary research areas include L2 pronunciation, teacher cognitions, classroom-based research, and technology-enhanced language learning. He has worked as a Graduate Peer Mentor and Supervisor at the Center for Communication Excellence in the Graduate College for over four years. He has also taught multiple global online courses for the Online Professional English Network (OPEN), including “Using educational technology in the English language classroom,” as well as introductory courses in public speaking and linguistics at Iowa State University.

  • Jeanne Beck, Iowa State University

    Jeanne Beck is a PhD student in the applied linguistics and technology program at Iowa State University. Her research interests include L2 assessment, project-based learning, CALL, and English learner policy. She has experience teaching English learners and technology at the K–12 level in the USA and Japan, as well as experience teaching college-level English learners and public speaking courses in the USA and South Korea. She mentors English teachers worldwide through the OPEN course “Using educational technology in the English language classroom,” and assists Iowa State Department of English instructors with technology and LMS needs.

  • Erik Goodale, Iowa State University

    Erik Goodale is a PhD student in the applied linguistics and technology program at Iowa State University. His research interests include L2 pronunciation instruction, oral communication, and online learning environments. He works as an English-speaking consultant and interpersonal communication consultant for the Center for Communication Excellence.


References marked with an asterisk indicate studies included in the text review.

*Al-Qudah, F. Z. M. (2012). Improving English pronunciation through computer-assisted programs in Jordanian universities. Journal of College Teaching & Learning (TLC), 9(3), 201–208.

*Anderson-Hsieh, J. (1992). Using electronic visual feedback to teach suprasegmentals. System, 20(1), 51–62.

Anderson?Hsieh, J., Johnson, R., & Koehler, K. (1992). The relationship between native speaker judgments of nonnative pronunciation and deviance in segmentals, prosody, and syllable structure. Language Learning, 42(4), 529–555.

Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56, 85–100.

Chapelle, C. A., & Chung, Y. R. (2010). The promise of NLP and speech processing technologies in language assessment. Language Testing, 27(3), 301–315.

*Chen, L., Zechner, K., Yoon, S.-Y., Evanini, K., Wang, X., Loukina, A., Tao, J., Davis, L., Lee, C. M., Ma, M., Mundkowsky, R., Lu, C., Leong, C. W., & Gyawali, B. (2018). Automated scoring of nonnative speech using the SpeechRaterSM v. 5.0 Engine. ETS Research Report Series, 2018(1), 1–31.

Chun, D. M. (1989). Teaching tone and intonation with microcomputers. CALICO Journal, 7(1), 21–46.

*Cox, T., & Davies, R. (2012). Using automated speech recognition technology with elicited oral response testing. CALICO Journal, 29(4), 601–618.

*Cucchiarini, C., Strik, H., & Boves, L. (1997). Automatic evaluation of Dutch pronunciation by using speech recognition technology. In 1997 IEEE workshop on automatic speech recognition and understanding proceedings (pp. 622–629). New York: IEEE.

*Delmonte, R. (2000). SLIM prosodic automatic tools for self-learning instruction. Speech Communication, 30(1), 145–166.

*Delmonte (2002). Feedback generation and linguistic knowledge in “SLIM” automatic tutor. ReCall, 14(2), 209–234.

Derwing, T. M., Munro, M. J., & Wiebe, G. (1998). Evidence in favor of a broad framework for pronunciation instruction. Language Learning, 48(3), 393–410.

*Ding, S., Liberatore, C., Sonsaat, S., Lu?i?, I., Silpachai, A., Zhao, G., Chukharev-Hudilainen, E., Levis, J., & Gutierrez-Osuna, R. (2019). Golden speaker builder—an interactive tool for pronunciation training. Speech Communication, 115, 51–66.

Dixon, D. H. (2018). Use of technology in teaching pronunciation skills. In J. I. Liontas (Ed.), The TESOL encyclopedia of English language teaching (pp. 1–7). Hoboken: Wiley.

*Evanini, K., & Wang, X. (2013). Automated speech scoring for nonnative middle school students with multiple task types. In Proceedings of Interspeech (pp. 2435–2439). 14th Annual Conference of the ISCA, Lyon.;

*Fergadiotis, G., Gorman, K., & Bedrick, S. (2016). Algorithmic classification of five characteristic types of paraphasias. American Journal of Speech-Language Pathology, 25, S776–S787.

*Holland, M., Kaplan, J., & Sabol, M. (1999). Preliminary tests of language learning in a speech-interactive graphics microworld. CALICO Journal, 16(3), 339–359.

Johnson, D. O., & Kang, O. (2016). Automatic detection of Brazil’s prosodic tone unit. In Proceedings of speech prosody (pp. 287–291). Boston: ISCA.

*Johnson, W. L., & Valente, A. (2009). Tactical language and culture training systems: Using AI to teach foreign languages and cultures. AI Magazine, 30(2), 72.

*Kang, O., & Johnson, D. (2018). The roles of suprasegmental features in predicting English oral proficiency with an automated system. Language Assessment Quarterly, 15(2), 150–168.

Kang, O., Rubin, D. O. N., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. Modern Language Journal, 94(4), 554–566.

*Komatsu, T., Ustunomiya, A., Suzuki, K., Ueda, K., Hiraki, K., & Oka, N. (2005). Experiments toward a mutual adaptive speech interface that adopts the cognitive features humans use for communication and induces and exploits users’ adaptations. International Journal of Human-Computer Interaction, 18(3), 243–268.

Lee, J., Jang, J., & Plonsky, L. (2015). The effectiveness of second language pronunciation instruction: A meta-analysis. Applied Linguistics, 36(3), 345–366.

Levis, J. (2007). Computer technology in teaching and researching pronunciation. Annual Review of Applied Linguistics, 27, 184–202.

Levis, J. (2016). Research into practice: How research appears in pronunciation teaching materials. Language Teaching, 49(3), 423–437.

*Liu, Y., Chawla, N. V., Harper, M. P., Shiberg, E., & Stolcke, A. (2006). A study in machine learning from imbalanced data for sentence boundary detection in speech. Computer Speech and Language, 20(4), 468–494.

*Mansour, S. (2014). Generation of suprasegmental information for speech using a recurrent neural network and binary gravitational search algorithm for feature selection. Applied Intelligence, 40, 772–790.

*Masmoudi, A., Bougares, F., Ellouze, M., Estève, Y., & Belguith, L. (2018). Automatic speech recognition system for Tunisian dialect. Language Resources and Evaluation, 52(1), 249–267.

McCrocklin, S. M. (2016). Pronunciation learner autonomy: The potential of automatic speech recognition. System, 57, 25–42.

*Ming, Y., Ruan, Q., & Gao, G. (2013). A Mandarin edutainment system integrated virtual learning environments. Speech Communication, 55(1), 71–83.

Mora, J., & Levkina, M. (2017). Task-based pronunciation teaching and research: Key issues and future directions. Studies in Second Language Acquisition, 39, 381–399.

Neri, A., Cucchiarini, C., Strik, H., & Boves, L. (2002). The pedagogy–technology interface in computer assisted pronunciation training. Computer Assisted Language Learning, 15(5), 441–467.

Pearson Education, Inc. (2015). Versant English test.

Pennington, M. (1999). Computer-aided pronunciation pedagogy: Promise, limitations, directions. Computer Assisted Language Learning, 12(5), 427–440.

Probst, K., Ke, Y., & Eskenzai, M. (2002). Enhancing foreign language tutors—in search of the golden speaker. Speech Communication, 37(3–4), 423–441.

Saito, K. (2012). Effects of instruction on L2 pronunciation development: A synthesis of 15 quasi-experimental intervention studies. TESOL Quarterly, 46(4), 842–854.

Saito, K., & Plonsky, L. (2019). Effects of second language pronunciation teaching revisited: A proposed measurement framework and meta?analysis. Language Learning, 69(3), 652–708.

*Scherrer, Y., Samardzic, T., & Glaser, E. (2019). Digitising Swiss German: How to process and study a polycentric spoken language. Language Resources & Evaluation, 53, 735–769.

*Setter, J., & Jenkins, J. (2005). State-of-the-art review article. Language Teaching, 38(1), 1–17.

*Shahin, I. M. A. (2012). Speaker identification investigation and analysis in unbiased and biased emotional talking environments. International Journal of Speech Technology, 15(3), 325–334.

*Shahin, I. M. A. (2013). Gender-dependent emotion recognition based on HMMs and SPHMMs. International Journal of Speech Technology, 16(2), 133–141.

*Shahin, I., & Nassif, A. B. (2018). Three-stage speaker verification architecture in emotional talking environments. International Journal of Speech Technology, 21(4), 915–930.

*Soonklang, T., Damper, R., & Marchand, Y. (2008). Multilingual pronunciation by analogy. Natural Language Engineering, 14(4), 527–546.

Surface, E., & Dierdorff, E. (2007). Special operations language training software measurement of effectiveness study: Tactical Iraqi study final report. Tampa, FL: U.S. Army Special Operations Forces Language Office.

*Tamburini, F., & Caini, C. (2005). An automatic system for detecting prosodic prominence in American English continuous speech. International Journal of Speech Technology, 8, 33–44.

Tanaka, R. (2000). Automatic speech recognition and language learning. Journal of Wayo Women’s University, 40, 53–62.

Taylor, J., & Kochem, T. (2020). Access and empowerment in digital language learning, maintenance, and revival: A critical literature review. Diaspora, Indigenous, and Minority Education, 1–12.

Thomson, R. I., & Derwing, T. M. (2015). The effectiveness of L2 pronunciation instruction: A narrative review. Applied Linguistics, 36(3), 326–344.

Van Compernolle, D. (2001). Recognizing speech of goats, wolves, sheep and ... nonnatives. Speech Communication, 35(1–2), 71–79.

*Vojtech, J. M., Noordzij, J. P., Cler, G. J., & Stepp, C. E. (2019). The effects of modulating fundamental frequency and speech rate on the intelligibility, communication efficiency, and perceived naturalness of synthetic speech. American Journal of Speech-Language Pathology, 28, 875–886.

*Walker, N., Trofimovich, P., Cedergren, H., & Gatbonton, E. (2011). Using ASR technology in language training for specific purposes: A perspective from Quebec, Canada. CALICO Journal, 28(3), 721–743.

*Wang, F., Sahli, H., Gao, J., Jiang, D., & Verhelst, W. (2015). Relevance units machine based dimensional and continuous speech emotion prediction. Multimedia Tools Application, 74, 9983–10000.

*Ward, M. (2015). I’m a useful NLP tool—get me out of here. In F. Helm, L. Bradley, M. Guarda, & S. Thouësny (Eds.), Critical CALL—proceedings of the 2015 EUROCALL Conference, Padova, Italy (pp. 553–557). Dublin:

*Witt, S. M., & Young, S. J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30(2–3), 95–108.






How to Cite

Kochem, T., Beck, J., & Goodale, E. (2022). The Use of ASR-Equipped Software in the Teaching of Suprasegmental Features of Pronunciation: A Critical Review. CALICO Journal, 39(3), 306–325.