On Accuracy, Part I
Spending 80+ hours a week on a startup business, it is natural that social conversations frequently wander to work. When the topic comes up, it leads to quite a few questions.
Typically, it starts something like:
They: “So what do you do for work?”
Me: “I started a company while I was in grad school with a few buddies of mine.”
They: “Oh cool! What do you guys do?”
Me: “We specialize in efficient transcription and closed captioning.”
They: “Oh…. Well don’t they have speech recognizers to do that?”
Me: “Well they aren’t really accurate enough. So we’ve made a bunch of tools that work with recognizers to scrub it up really quickly and accurately.”
They: “Huh… Well I heard they’re 80%, 90% accurate these days.”
This is where it tends to get tricky, as it turns out accuracy is pretty tough to define.
For speech-to-text, accuracy can be measured by looking at the text output, listening to the audio, and circling the words that match; x-ing out the ones that don’t. By this measure, 80%, 90% accuracy is technically amazing.
However, if you dig deeper, you see where speech recognition unravels as a practical means of transcription; and that human interaction is still necessary to complete the work in the absence of the “magic button”, a generalized, 100% accurate recognizer.
Let’s reframe “accuracy”. As a firm that specializes in software for transcription & captioning of multimedia, our goal is to create works that stand alone as accurate reflections of spoken content. Sentence-to-sentence, this means a group of words that accurately reflect the meaning of the spoken content. Word-to-word, this means spelling the words within those sentences correctly.
Spelling is easy to check, and in a sentence, you don’t need to be an English major to recognize the fact that one word, mistaken for another similar sounding one, can significantly alter the meaning of a sentence (won’t vs. want; feel vs. field; etc… more on this in my next post).
I’ve created a chart that outlines the propagated implications of accuracy rates from speech recognizers, assuming a range of accuracies, and 8 & 10 word sentences.
For example, 67% accuracy means 1 out of every 3 words is incorrect. For an 8-word sentence, the likelihood that the recognizer got all 8 words correct is 67%8 ≅ 4%. Similarly for a 10-word sentence, the likelihood of the recognizer getting all 10 words in a row correct is 67%10 ≅ 2%.
Word-to-word accuracy 1 of x words incorrect 8-word sentence accuracy 10-word sentence accuracy 50% 1 of 2 0% 0% 67% 1 of 3 4% 2% 75% 1 of 4 10% 6% 85% 1 of 7 27% 20% 90% 1 of 10 43% 35% 95% 1 of 20 66% 60% 99% 1 of 100 92% 90%
This tells me that with 90% accuracy, I will get less than HALF of my 8-word sentences back correctly transcribed, and only 35% accuracy on my 10-word sentences. With even slight dips in accuracy or increases in sentence length, these rates plummet, as the relation is exponentially scaled. And this doesn’t even begin to take into account other markup such as phrasing and punctuation!!
In generalized practice, we tend to receive files with significant background noise, many speakers, accents, etc. In these cases, we might see the recognizer hit 67%. If we returned these transcripts to our customers fresh of the ASR engine, they would only see 1 out of 25 8-word sentences correct!!! Something tells me we wouldn’t retain them for very long….
So it turns out that the 10%, 20% gap is much bigger than it would appear on the surface of the standard “accuracy” metrics; and to create useful stand alone works from audio, that gap between the state of the art in speech recognition and the end goal needs to be filled accurately and cost-effectively.
That’s where we come in :)