The Current State of Automatic Speech Recognition
Updated: June 11, 2019
We’ve often heard the question, “When will a completely automated captioning and transcription solution become a reality?”
With the emergence and growing popularity of speech-to-text applications like Apple Siri and Amazon Echo, the answer to this long-standing question has become more plausible. These speech recognition apps are poised to improve even further with advances in machine learning, along with the increasing availability of large speech databases. Automatic Speech Recognition (ASR), however, has a nature of its own.
Siri vs. Automatic Speech Recognition
Speech-to-text technology has become increasingly useful: people can send text messages, search the web, control their music players, and more using only their voice. It is important, though, to draw a careful distinction between “automated assistant” applications like Siri and Automatic Speech Recognition technology.
Why does Siri seem so advanced compared to ASR? Some of the things that made conquering Siri an easier task than conquering ASR include:
- Automated assistants respond to a single speaker and adapt over time to that speaker’s voice and language idiosyncrasies.
- The tasks that automated assistants can complete are very constrained, so the possible output is limited.
- If automated assistants don’t initially understand, it can ask the user to repeat what they said.
- Automated assistants work well as long as the gist of the speaker’s intent is captured.
In contrast, captioning and transcription are much more challenging. This task is primarily characterized by long-form content where the speaker is completely unknown and where it is essential to transcribe almost every word that is spoken (some words like “um” and “ah” are often discarded).
Some state of the art Automatic Speech Recognition systems can achieve very high accuracy rates – even in the ’90s – if the following conditions are true:
- There’s only one speaker
- If the speaker is reading from a script or is equivalently concise with virtually no grammatical or speech errors
- If all of the speakers are using high-quality microphones and speaking at an appropriate distance from the microphone
- If there is little to no background noise in the audio
- If all the above conditions remain constant through the majority of the audio file
Once the above conditions begin to waver, it immediately affects the quality of the transcript. More often than not, the majority of these conditions are not present, unless the audio was recorded in a professional studio. If even two or three of the conditions don’t exist, error rates may go as low as 50% meaning that 50% of the transcript would be inaccurate.
ASR Capabilities for Captioning and Transcription
At 3Play Media where we use state-of-the-art Automatic Speech Recognition technology, we’ve seen accuracy rates in the vicinity of 80%. Keep in mind that perfect audio conditions, which have an immediate and direct negative impact on accuracy, are rarely present, making 80% accuracy difficult for even the best ASR technology to achieve. In order to guarantee 99% accuracy, we provide a 3-step-process which includes a combination of ASR and human cleanup. The key to such high accuracy rates is human interaction: without it, caption and transcript quality is very poor.
Although we expect to see continuous improvements in ASR capabilities for captioning and transcription in the future, current technology provides an 80% accuracy rate for captions and transcripts at best in normal conditions. An accuracy rate of less than 99% is detrimental for several reasons, the most prominent being that inaccurate captions and transcripts convey the wrong meaning to those who rely on them to engage with audio, such as people who are d/Deaf and hard of hearing. Inaccurate captions and transcripts can also negatively affect video SEO and be detrimental to the perceived quality of your content.
Do we still need humans?
Automatic Speech Recognition technology is prone to fail on small “function” words which are important in conveying meaning in speech. Consider the following pair of sentences:
“I didn’t want to do that exercise.” vs. “I did want to do that exercise.”
The latter is a very typical ASR error, and such an error will often occur with the presence of background noise, or if the speaker deemphasizes the second syllable of the keyword “didn’t.” However, the meaning is completely reversed. It is very rare for a human – especially a trained editor – to make such an error, as they will use the context to “fill in” the correct meaning in spite of any noise that may have been responsible for the ASR failure.
Will a completely automated captioning solution ever exist? The answer, at least for the conceivable future, is no. For now, a human editor must be involved in the captioning and transcription process in order to produce 99% accurate captions.
Want to get started making your video accessible? 3Play Media provides premium quality closed captioning and transcription.
This blog was originally published on September 9, 2016, by Roger Zimmerman and has since been updated.
Audio Description for YouTube
YouTube is one of the most popular video platforms in the world, boasting over 1.9 billion monthly active users. It’s the second largest search engine and second most visited website in the world, behind its parent company, Google. There’s no denying our…
FAQ: What You Should Know About Audio Description
Audio description is showing up in more and more places, but what is it, how does it work, and why is it important? In the Intro to Audio Description webinar, we answer those questions and more. Like closed captions, audio description is…
Accessibility for Higher Education Athletics
Go, team, go! If you’ve attended a college or university in the United States with a sports team, you probably know just how big sports culture is. Pre-game festivities typically take place in a parking lot outside of a sports stadium and…