The Current State of Automatic Speech Recognition
Updated: February 2, 2021
We’ve often heard the question, “When will a completely automated captioning and transcription solution become a reality?”
With the emergence and growing popularity of speech-to-text applications like Apple Siri and Amazon Echo, the answer to this long-standing question has become more plausible. These speech recognition apps are poised to improve even further with advances in machine learning, along with the increasing availability of large speech databases. Automatic Speech Recognition (ASR), however, has a nature of its own.
Siri vs. Automatic Speech Recognition
Speech-to-text technology has become increasingly useful: people can send text messages, search the web, control their music players, and more using only their voice. It is important, though, to draw a careful distinction between “automated assistant” applications like Siri and Automatic Speech Recognition technology.
Why does Siri seem so advanced compared to ASR? Some of the things that made conquering Siri an easier task than conquering ASR include:
- Automated assistants respond to a single speaker and adapt over time to that speaker’s voice and language idiosyncrasies.
- The tasks that automated assistants can complete are very constrained, so the possible output is limited.
- If automated assistants don’t initially understand, it can ask the user to repeat what they said.
- Automated assistants work well as long as the gist of the speaker’s intent is captured.
In contrast, captioning and transcription are much more challenging. This task is primarily characterized by long-form content where the speaker is completely unknown and where it is essential to transcribe almost every word that is spoken (some words like “um” and “ah” are often discarded).
Some state of the art Automatic Speech Recognition systems can achieve very high accuracy rates – even in the ’90s – if the following conditions are true:
- There’s only one speaker
- If the speaker is reading from a script or is equivalently concise with virtually no grammatical or speech errors
- If all of the speakers are using high-quality microphones and speaking at an appropriate distance from the microphone
- If there is little to no background noise in the audio
- If all the above conditions remain constant through the majority of the audio file
Once the above conditions begin to waver, it immediately affects the quality of the transcript. More often than not, the majority of these conditions are not present, unless the audio was recorded in a professional studio. If even two or three of the conditions don’t exist, error rates may go as low as 50% meaning that 50% of the transcript would be inaccurate.
ASR Capabilities for Captioning and Transcription
At 3Play Media where we use state-of-the-art Automatic Speech Recognition technology, we’ve seen accuracy rates in the vicinity of 80%. Keep in mind that perfect audio conditions, which have an immediate and direct negative impact on accuracy, are rarely present, making 80% accuracy difficult for even the best ASR technology to achieve. In order to guarantee 99% accuracy, we provide a 3-step-process which includes a combination of ASR and human cleanup. The key to such high accuracy rates is human interaction: without it, caption and transcript quality is very poor.
Do we still need humans?
Automatic Speech Recognition technology is prone to fail on small “function” words which are important in conveying meaning in speech. Consider the following pair of sentences:
“I didn’t want to do that exercise.” vs. “I did want to do that exercise.”
The latter is a very typical ASR error, and such an error will often occur with the presence of background noise, or if the speaker deemphasizes the second syllable of the keyword “didn’t.” However, the meaning is completely reversed. It is very rare for a human – especially a trained editor – to make such an error, as they will use the context to “fill in” the correct meaning in spite of any noise that may have been responsible for the ASR failure.
Will a completely automated captioning solution ever exist? The answer, at least for the conceivable future, is no. For now, a human editor must be involved in the captioning and transcription process in order to produce 99% accurate captions.
Want to get started making your video accessible? 3Play Media provides premium quality closed captioning and transcription.
This blog was originally published on September 9, 2016, by Roger Zimmerman and has since been updated.
How to Handle Live Closed Captioning – and the Challenges
Technological innovation has paved a new way to conduct business, education, and life in general – particularly in a world forced to adapt to virtual substitutes during the pandemic. Most of the time, the technology we use is very helpful. For example,…
Transcribing Oral Histories with 3Play Media
History can be told in many ways, but one of the most impactful methods is through oral history. Oral history is a technique for preserving historical information through recorded interviews. In a typical oral history, an interviewer questions an interviewee and records…
How to Scale Live Closed Captioning: 6 Top Tips
As live video content continues to grow in popularity, most video & social media platforms have enabled live streaming features due to the sheer number of people tuning into live streams. In 2019 alone, internet users watched a staggering 1.1 billion hours of…