What Is ASR?

May 26, 2022 BY SAMANTHA SAULD
Updated: October 23, 2023

When you think of artificial intelligence, what do you think of? You might think of self-driving cars or the facial recognition software you use to unlock your device. No matter your familiarity level, artificial intelligence has become increasingly prevalent in our everyday lives, including speech recognition. So, what is ASR?

Artificial intelligence, commonly known as AI, is all around us. It uses machine learning to perform tasks and solve problems like a human. They help make our lives easier by making faster decisions, helping with repetitive tasks, and taking calculated risks.

So, where does ASR come into play? If artificial intelligence is the tree, then ASR is the branch. AI is the larger, overarching umbrella, while ASR is a subset of it.

In this post, we’ll go over what ASR is and how it’s used, particularly when it comes to captioning your content. Let’s dive in!

What is ASR? A Broad Overview:

ASR, or automatic speech recognition, is the process of a computer transcribing audio into text.

It was once expensive to access ASR software but thanks to technological advancements, it’s become more affordable and accessible than ever before. You can find ASR technology in many of the apps we use today, like Zoom and TikTok. These applications use your voice to create captions that overlay your videos.

Another example of ASR is the automated customer service over the telephone. Think of when you call your bank; you usually have to go through a series of questions with an automated rep before you speak to a human.

These different examples showcase the two main types of ASR: directed dialogue and natural language processing (NLP).

Directed dialogue is the simpler version of ASR. The speech recognition allots a limited amount of words you can use as responses. In the bank call example, the automated rep might give you a list of requests such as hours of operation, updating account information, or speaking to a customer service agent. As the person on the other end, you’ll only be able to choose from the given options. The ASR isn’t advanced enough to take other, more complicated requests.

NLP, on the other hand, is the more sophisticated version of ASR that allows the user to have more open-ended conversations – similar to how humans communicate.

On average, an NLP ASR system consists of 60,000 or more words. It would be extremely inefficient for a system to process every single word so it selects specific keywords and gives context to longer requests.

An example of this is Apple’s voice-controlled digital assistant, Siri. If you ask Siri, “what’s the weather today?”, it’ll likely select “weather” as the main keyword and proceed to share the day’s forecast. This allows the system to process requests more efficiently.

Learn More in the FREE State of ASR Report ✨

Humans and Technology: The Best of Both Worlds

Some of the best ASR systems can achieve an accuracy rate of 80%. However, this is only possible if audio conditions align perfectly – which is easier said than done. As audio conditions worsen, the accuracy rate quickly diminishes.

An 80% accuracy rate might be sufficient for personal assistants, like Siri, but when it comes to professional captioning and transcription, ASR alone doesn’t measure up. Humans are still needed in the captioning process.

Relying solely on ASR for captioning just doesn’t cut it. Captioning is a complex process that sometimes includes multiple speakers, accents, and non-speech elements. They can obstruct the ASR software from accurately picking up on what’s being said in the audio. Perfect audio conditions would normally exclude these elements, which is highly unlikely.

In normal circumstances, there will be a number of errors in a transcript performed by ASR alone. These are the most common causes of ASR errors:

Speaker labels
Punctuation, grammar, and numbers
Non-speech elements
[INAUDIBLE] tags
Multiple speakers or overlapping speech
Background noise or poor audio quality
False starts
Acoustic error

A frequent area in which ASR falls short is when it comes to small “function” words, which are important in conveying meaning in speech. Think of the sentence “I can’t go with you” versus “I can go with you”. One small error can drastically change the meaning of a conversation. With humans, the chances of these common mistakes decrease substantially since we are able to use nuance and context clues – things that technology hasn’t been developed enough to do yet.

With ASR and humans, you get the best of both worlds. In the next section, we’ll cover 3Play Media’s approach to captioning and how leveraging humans and technology creates a recipe for greatness!

Download the FREE State of ASR Report: ➡️

Captioning The 3Play Way

At 3Play Media, technology has played an important role in the captioning process. Our patent-pending 3-step process combines ASR technology and professional human editors to maximize and streamline how we caption your content.

ASR is the first step of 3Play’s captioning process. Once a file is uploaded into our account system, the ASR goes through the file and creates a rough draft.

3Play’s ASR engine out-performs most software on the market, including Google, IBM Watson, Rev’s Temi, and Trint. Our software has an average accuracy rate of 90.91% while the others averaged between 80-89%.

After the first round of ASR, the second round of editing consists of a human transcriber reviewing the transcript and cleaning up the draft where needed.

Finally, a quality assurance (QA) manager reviews the transcript a final time to ensure the highest level of accuracy.

We guarantee at least a 99% accuracy rate on all of your files because we understand how critical accuracy is to the captioning process. Not only does it ensure that your content is accessible to d/Deaf and hard of hearing viewers, but it also ensures that your organization is in compliance with major accessibility laws.

As a company that is constantly evolving and innovating our features and services, we always want to use the best technology. Every year we publish the “State of Automatic Speech Recognition” report to test the most popular ASR technologies on the market and how our technology compares. Check out the full report below to uncover the current state of ASR in regard to captioning accuracy!

The Current State of Automatic Speech Recognition: A Report

by Elisa Lewis in Industry Trends

Read the 2022 State of Automatic Speech Recognition [Free Report] 3Play Media conducts annual research to determine the current state of automatic speech recognition technology. The study looks at the general state of speech-to-text technology and evaluates how the top speech…

April 20, 2022

Machines and Humans: Stirred, Not Shaken for The Perfect Captioning Recipe

by Josh Miller in Company + Culture

The early days of 3Play Media included deep research into the various methods of transcribing audio and video content in order to create accurate, properly timed captions. One of the main focal points was the use of automatic speech recognition (ASR). Specifically,…

Updated August 16, 2021

How do 3Play’s Live Captions Compare to Zoom’s Built-in Captions?

by Tessa Kettelberger in Video Accessibility

Artificial intelligence-based automatic speech recognition (ASR) is one step of 3Play Media’s innovative transcription process, and it’s also what powers our live captioning solution. As a result, we’re deeply invested in following trends in the ASR industry in order to make sure…

Updated July 9, 2021

Subscribe to the Blog Digest

Sign up to receive our blog digest and other information on this topic. You can unsubscribe anytime.

By subscribing you agree to our privacy policy.

Product

Why 3Play?

Learn

Company

Further Reading

The Current State of Automatic Speech Recognition: A Report

Machines and Humans: Stirred, Not Shaken for The Perfect Captioning Recipe

How do 3Play’s Live Captions Compare to Zoom’s Built-in Captions?

Subscribe to the Blog Digest