The Current State of Automatic Speech Recognition
Updated: February 2, 2021
We’ve often heard the question, “When will a completely automated captioning and transcription solution become a reality?”
With the emergence and growing popularity of speech-to-text applications like Apple Siri and Amazon Echo, the answer to this long-standing question has become more plausible. These speech recognition apps are poised to improve even further with advances in machine learning, along with the increasing availability of large speech databases. Automatic Speech Recognition (ASR), however, has a nature of its own.
Siri vs. Automatic Speech Recognition
Speech-to-text technology has become increasingly useful: people can send text messages, search the web, control their music players, and more using only their voice. It is important, though, to draw a careful distinction between “automated assistant” applications like Siri and Automatic Speech Recognition technology.
Why does Siri seem so advanced compared to ASR? Some of the things that made conquering Siri an easier task than conquering ASR include:
- Automated assistants respond to a single speaker and adapt over time to that speaker’s voice and language idiosyncrasies.
- The tasks that automated assistants can complete are very constrained, so the possible output is limited.
- If automated assistants don’t initially understand, it can ask the user to repeat what they said.
- Automated assistants work well as long as the gist of the speaker’s intent is captured.
In contrast, captioning and transcription are much more challenging. This task is primarily characterized by long-form content where the speaker is completely unknown and where it is essential to transcribe almost every word that is spoken (some words like “um” and “ah” are often discarded).
Some state of the art Automatic Speech Recognition systems can achieve very high accuracy rates – even in the ’90s – if the following conditions are true:
- There’s only one speaker
- If the speaker is reading from a script or is equivalently concise with virtually no grammatical or speech errors
- If all of the speakers are using high-quality microphones and speaking at an appropriate distance from the microphone
- If there is little to no background noise in the audio
- If all the above conditions remain constant through the majority of the audio file
Once the above conditions begin to waver, it immediately affects the quality of the transcript. More often than not, the majority of these conditions are not present, unless the audio was recorded in a professional studio. If even two or three of the conditions don’t exist, error rates may go as low as 50% meaning that 50% of the transcript would be inaccurate.
ASR Capabilities for Captioning and Transcription
At 3Play Media where we use state-of-the-art Automatic Speech Recognition technology, we’ve seen accuracy rates in the vicinity of 80%. Keep in mind that perfect audio conditions, which have an immediate and direct negative impact on accuracy, are rarely present, making 80% accuracy difficult for even the best ASR technology to achieve. In order to guarantee 99% accuracy, we provide a 3-step-process which includes a combination of ASR and human cleanup. The key to such high accuracy rates is human interaction: without it, caption and transcript quality is very poor.
Do we still need humans?
Automatic Speech Recognition technology is prone to fail on small “function” words which are important in conveying meaning in speech. Consider the following pair of sentences:
“I didn’t want to do that exercise.” vs. “I did want to do that exercise.”
The latter is a very typical ASR error, and such an error will often occur with the presence of background noise, or if the speaker deemphasizes the second syllable of the keyword “didn’t.” However, the meaning is completely reversed. It is very rare for a human – especially a trained editor – to make such an error, as they will use the context to “fill in” the correct meaning in spite of any noise that may have been responsible for the ASR failure.
Will a completely automated captioning solution ever exist? The answer, at least for the conceivable future, is no. For now, a human editor must be involved in the captioning and transcription process in order to produce 99% accurate captions.
Want to get started making your video accessible? 3Play Media provides premium quality closed captioning and transcription.
This blog was originally published on September 9, 2016, by Roger Zimmerman and has since been updated.
The New Normal: A 2021 Snapshot of Video Accessibility in Higher Education
When the pandemic started, many higher education institutions faced unknown territory. The pandemic forced educators and students to learn new technologies, prioritize video as a classroom tool, and interact in virtual environments. Now, almost two years later, video’s role in higher education…
TikTok Accessibility: How to Add Captions and Other Best Practices
With over two billion lifetime downloads and 50 million daily active users in the U.S. alone, TikTok is one of the most popular apps of our time, particularly among Gen Z. However, the platform’s accessibility features leave something to be desired. While…
The Influencer’s Guide to Social Media Accessibility
Since its inception, social media has taken the world by storm. According to Hootsuite, there are more than 4.48 billion social media users worldwide – a little more than half of the global population. As our society relies more on digital technology,…