Adventures in Speech Recognition

May 22, 2009 BY CJ JOHNSON
Updated: January 4, 2018

A few weeks ago, I wrote about the traditional metrics for accuracy in transcription.  Every now and then, we’ll run across an egregious mistake from an automatic speech recognizer (ASR) that’s pretty funny.  

Check out this one from today.  In an interview with a famous biologist, what was supposed to read “RNA catalysis, a wonderful discovery” came out as “naked palaces a wonderful discovered”.

Now I won’t argue the relative merits of RNA catalysis vs. naked palaces – I’m sure they’re both great in their own unique ways – but the first one was a bit more on the context we were looking for in that interview. 

I have come to appreciate the strange rhymes that come out of ASR engines, having spent a number of years now sifting through their outputs.  Doing so, you get a feel for how a group of sounds isolated from the surrounding words and context can be mistaken for any number of alternatives.  

Even more so, I have come to appreciate the ability of the human mind to comprehend sounds into words and meanings – to discern misspoken words, accents and stutters into the intended meaning. We decipher context every day, and not just with 50-cent words that challenge our vocabulary.  People say “tuh” instead of “to”.  People say “mighta” instead of “might have”.  The words “marriage” and “mirage” sound remarkably  similar spoken alone, yet our  experience and power of deductive reasoning allow us to correctly assess the intended meaning of the speech without any real deep thought.  

The technology behind machine recognition, adaptation, and reasoning is fascinating.  It is amazing how far we have come in just a few decades of research.  It is just as amazing to think how far there is to go, and to dream what innovations may come to fill the remaining voids to true artificial intelligence.


Read the free report: 2017 State of Captioning.

The closed caption CC icon shown in the middle of a TV.