The Current State of Automatic Speech Recognition: Why We Still Need Humans for Captioning
We are often asked the question, “When will a completely automated (and therefore extremely inexpensive) transcription/captioning solution become reality?”
This question has recently become far more plausible than it once was. In specific applications, such as Apple Siri and Amazon Echo, speech-to-text (really, speech-to-meaning, but more on that later) technology has become increasingly useful, to the point where many people are able to send text messages, search the web, control their music players, and more by voice. These speech recognition apps are poised to improve even further with advances in machine learning, along with the increasing availability of large speech databases.
Why Is Siri So Good?
It is important to draw a careful distinction between “automated assistant” applications like Siri and the captioning/transcription task. Some of the characteristics that make the speech recognition task used by Siri and similar apps easier include:
- They respond to a single speaker, usually “known” ahead of time by the application (i.e., the technology has adapted over time to the speaker’s voice and language idiosyncrasies).
- The tasks they can complete are very constrained, so the possible output is limited.
- If the app isn’t certain what the user said, it can ask for clarification. For example, if Siri doesn’t grasp the user’s command, it will respond with, “I’m not sure I understand.”
- There are very few instances of unknown, domain-specific words that are critical to completing the task.
- Perhaps most importantly, these tasks do NOT require a verbatim transcript of the speech; these applications will work as long as the gist of the speaker’s intent is captured.
Why Speech Recognition for Captioning & Transcription Is Harder
In contrast, captioning/transcription is much more challenging. This task is primarily characterized by long-form content where the speaker is completely unknown and where it is essential to transcribe each and every word that is spoken (except for disfluencies, such as “um” and “ah,” or – sometimes – verbal correction phrases such as “take that back”). Still, state of the art Automatic Speech Recognition (ASR) systems can achieve very high accuracy rates – in the 90’s – if the following conditions are true:
- If there is only one speaker, since, in this case, on-the-fly speaker adaptation can be effective.
- If the speaker is reading from a script or is equivalently concise, since the “language model” for ASR is trained on clean, grammatical, organized text.
- If all of the speakers are using high quality microphones and speaking at an appropriate distance from the microphone, since frequencies “lost” from poor microphones (or from poor microphone placement) are very difficult to recover.
- If the signal-to-noise ratio of the environment is good, since speech which is “buried” in background noise is difficult to recognize.
- If all of the above conditions remain fairly constant throughout the recording, since rapid changes in any of these characteristics will challenge the most advanced adaptation algorithms.
As each of these conditions varies from favorable to unfavorable, the quality of the ASR output will decrease markedly. If two or three of these characteristics are “negative,” error rates may easily be as poor as 50%.
Current ASR Capabilities for Captioning & Transcription
At 3Play Media, we use a 3-step process to guarantee over 99% accuracy, a process which includes a combination of ASR and human cleanup. In most typical scenarios at 3Play, where we use state-of-the-art ASR technology, we are seeing ASR accuracy rates in the vicinity of 80%. It is also worth pointing out that we can use our large database of human-corrected, near-perfect transcripts to continually improve our recognition accuracy.
To be sure, the impact of improved machine learning will certainly transfer over to the long-form transcription task. We expect to see continuous improvements in ASR capabilities for captioning and transcription for the foreseeable future.
However, with current technology, we have a starting point of (approximately) 80% accuracy, which is clearly not acceptable for captioning. Studies have shown that even 95% accuracy is sometimes insufficient for accurately conveying complex material. For a typical sentence length of 8 words, a 95% word accuracy rate means there will be an error on average every 2.5 sentences.
Moreover, ASR technology is prone to fail on the small “function” words that are so important in conveying meaning in speech. Consider the following pair of sentences:
“I didn’t want to do that exercise.” vs. “I did want to do that exercise.”
The latter is a very typical ASR error, involving only a single word in a seven-word sentence. Such an error will often occur when in the presence of noise, or if the speaker deemphasizes the second syllable of the key word “didn’t.” However, the meaning is completely reversed. It is very rare for a human – especially a trained editor – to make such an error, as they will use the context to “fill in” the correct meaning in spite of any noise that may have been responsible for the ASR failure.
Do We Still Need Humans?
So, the not-so-simple answer to the original question – “When will a completely automated captioning solution become available?” – is: not in the near future. We are confident that 3Play’s approach – where we use state of the art technology to facilitate efficient editing by professional transcriptionists – will continue to be the best way to ensure 99%+ accurate transcripts and captions.