The Current State of Automatic Speech Recognition
Updated: January 24, 2019
We’ve often heard the question, “When will a completely automated captioning and transcription solution become a reality?”
With the emergence and growing popularity of speech-to-text applications like Apple Siri and Amazon Echo, the answer to this long-standing question has become more plausible. These speech recognition apps are poised to improve even further with advances in machine learning, along with the increasing availability of large speech databases. Automatic Speech Recognition (ASR), however, has a nature of its own.
Siri vs. Automatic Speech Recognition
Speech-to-text technology has become increasingly useful: people can send text messages, search the web, control their music players, and more using only their voice. It is important, though, to draw a careful distinction between “automated assistant” applications like Siri and Automatic Speech Recognition technology.
Why does Siri seem so advanced compared to ASR? Some of the things that made conquering Siri an easier task than conquering ASR include:
- Automated assistants respond to a single speaker and adapt over time to that speaker’s voice and language idiosyncrasies.
- The tasks that automated assistants can complete are very constrained, so the possible output is limited.
- If automated assistants don’t initially understand, it can ask the user to repeat what they said.
- Automated assistants work well as long as the gist of the speaker’s intent is captured.
In contrast, captioning and transcription are much more challenging. This task is primarily characterized by long-form content where the speaker is completely unknown and where it is essential to transcribe almost every word that is spoken (some words like “um” and “ah” are often discarded).
Some state of the art Automatic Speech Recognition systems can achieve very high accuracy rates – even in the ’90s – if the following conditions are true:
- There’s only one speaker
- If the speaker is reading from a script or is equivalently concise with virtually no grammatical or speech errors
- If all of the speakers are using high-quality microphones and speaking at an appropriate distance from the microphone
- If there is little to no background noise in the audio
- If all the above conditions remain constant through the majority of the audio file
Once the above conditions begin to waver, it immediately affects the quality of the transcript. More often than not, the majority of these conditions are not present, unless the audio was recorded in a professional studio. If even two or three of the conditions don’t exist, error rates may go as low as 50% meaning that 50% of the transcript would be inaccurate.
ASR Capabilities for Captioning and Transcription
At 3Play Media where we use state-of-the-art Automatic Speech Recognition technology, we’ve seen accuracy rates in the vicinity of 80%. Keep in mind that perfect audio conditions, which have an immediate and direct negative impact on accuracy, are rarely present, making 80% accuracy difficult for even the best ASR technology to achieve. In order to guarantee 99% accuracy, we provide a 3-step-process which includes a combination of ASR and human cleanup. The key to such high accuracy rates is human interaction: without it, caption and transcript quality is very poor.
Although we expect to see continuous improvements in ASR capabilities for captioning and transcription in the future, current technology provides an 80% accuracy rate for captions and transcripts at best in normal conditions. An accuracy rate of less than 99% is detrimental for several reasons, the most prominent being that inaccurate captions and transcripts convey the wrong meaning to those who rely on them to engage with audio, such as people who are d/Deaf and hard of hearing. Inaccurate captions and transcripts can also negatively affect video SEO and be detrimental to the perceived quality of your content.
Do we still need humans?
Automatic Speech Recognition technology is prone to fail on small “function” words which are important in conveying meaning in speech. Consider the following pair of sentences:
“I didn’t want to do that exercise.” vs. “I did want to do that exercise.”
The latter is a very typical ASR error, and such an error will often occur with the presence of background noise, or if the speaker deemphasizes the second syllable of the keyword “didn’t.” However, the meaning is completely reversed. It is very rare for a human – especially a trained editor – to make such an error, as they will use the context to “fill in” the correct meaning in spite of any noise that may have been responsible for the ASR failure.
Will a completely automated captioning solution ever exist? The answer, at least for the conceivable future, is no. For now, a human editor must be involved in the captioning and transcription process in order to produce 99% accurate captions.
Want to get started making your video accessible? 3Play Media provides premium quality closed captioning and transcription.
This blog was originally published on September 9, 2016, by Roger Zimmerman and has since been updated.
Quick Guide to Section 508 & 504 Accessibility Lawsuits
Both Section 504 and 508 require organizations to make the necessary accommodations for people with disabilities. Although many accessibility laws were written before the Internet was an integral part of everyday life, recent lawsuits and case law have extended accessibility requirements to…
FCC Ruling: Closed Captioning Requirements Extended to Online Video Clips
On July 11, 2014, the Federal Communications Commission (FCC) ruled that closed captioning requirements for IP delivered video content extends to video clips. This ruling was a big step for online video accessibility. The FCC defines video clips as “excerpts of full-length…
Captioning and Transcription for Societies and Associations
Societies and associations present unique opportunities for captioning and transcribing content. In many cases, no two society’s needs are exactly the same. For example, some produce large volumes of video all at once while others have a steadier flow. What do we…