Linguistics

PodZinger accepts Jesus

PodZinger rejects Jesus is the title of a blog entry by University of Pennsylvania Professor Mark Liberman, who is in both the Linguistics and Computer Science departments.  Prof. Liberman is very knowledgable about the techniques we use to generate our speech-to-text index of podcasts, and wrote about the strengths and weaknesses of our state-of-the-art speech recognition technology.

We use a statistical model of word and n-gram sequences in order to  produce a sequence of words that we think was the most probable word sequence matching the phoneme sequence that we recognized.  If the type of input (like entertainment vs news) is a good match to our corpus or training material, then our word error rates are likely to be quite low.

While we specifically haven’t trained on a corpus of religious texts, we have indexed a tremendous amount of sermons.  The largest podcast series I know of is “Sermon Audio” of which we have indexed 3,860 episodes at this writing, many of which appear to no longer be accessible. 

In total so far, Sermon Audio has 18.8 million words, and total 2706 hours worth of sermons.  So in fact PodZinger has listened to more sermons than anyone I know.

The Blog is mightier than the sword

University of Pennsylvania Professor Mark Liberman has an interesting analysis of some of our PodZinger transcriptions. Prof. Liberman was searching PodZinger for any interviews of George Deutsch, a former NASA public affairs officer.

Deutsch resigned his position last week after intense public scrutiny when it was reported by the Scientific Activist Blog that he lied on his resume by listing a degree from Texas A&M which he never received. For more background information on this, check out: the Scientific Activist Blog, a New York Times account, the Bad Astronomy Blog, and Deutsch’s article that the theory that a Satanic cult killed Laci Peterson is “actually quite credible.”

Prof. Liberman’s search for “NASA” turned up good relevant Podcasts, but there were some funny transcriptions. “NASA’s top uh climate scientist” came out “nasa’s top arafat climate scientists.” Our speech recognition works with a language model trained mostly on news articles. My theory is that “top Arafat aide” is disproprtionately represented in our language model due to the time window of our training data and news-weighted inputs, leading to a likelihood that the bigram “top Arafat” appears.

Also, check out Prof. Liberman’s prior post titled “PodZinger rejects Jesus.” It is quite humorous and a good look at how our technology works. However, if you listen to the last podcast referenced in that post, be sure to be in a work/kid-safe place!