Notes from Stanford Open Source Lab 2008 Unconference

Pronunciation Evaluation Using the SPHINX II Open Source Speech Recognition System

by James Salsman

November 14th, 2008


The use of speech recognition for tutoring language skills involving reading and pronunciation has been shown, when applied properly, to be able to substantially decrease the amount of time that it takes for a typical student to achieve a given reading level.[2] You can use Carnegie Mellon's Sphinx-II speech recognition system in its alignment (and optionally also allphone) mode to compare the acoustic scores and durations of phonemes, words, and phrases to an expected list of phonemes in a 16,000 sample/second, 16 bit/sample, monophonic PCM speech file, evaluating those scores and durations against multiple exemplary pronunciations. A demo was shown using an insufficient number of exemplary pronunciations on a stand-alone system, which did not perform as well as could be expected with a larger number of exemplars based on the likelihood of agreement with expert human phonologists. The demo was followed by an overview and code review of the Sphinx II system, its databases, the customizations required for optimal performance, and the and Perl script forming the server. Different client software architectures were discussed, including Java, Flash, Audacity, and sndrec32.exe. John Tukey's "cepstral alanysis" was introduced along with a discussion of sound production from both vocal cords and consonant sources. The talk concluded with remarks on recent trends in speech recognition research funding priorities.

Updated links

logoJames Salsman
June 17, 2012