Artificial Intelligence Is Good, but Is It Good Enough for Captions?

April 24, 2019 BY ELISA LEWIS
Updated: August 28, 2019

“Artificial intelligence (AI) makes it possible for machines to learn from experience, adjust to new inputs, and perform human-like tasks.” In other words, rather than programming machines with a single rule for output, they are taught to recognize patterns and then utilize those patterns to make decisions somewhat like a human might.

AI has many applications, and as it continues to develop coincident to the amount of video that requires captioning, we are often asked: When will there be a viable, fully automated solution for closed captioning?

In this post, we will take a look at the implications of artificial intelligence and machine learning for automatic speech recognition, and how this may be used for automatic captioning.

The Current State of AI

In recent years we’ve seen the rise of digital assistants with Siri and Alexa, chatbots, personalized search results (that are sometimes so good it’s creepy!) and increased reliance on speech recognition to dictate text messages, emails, and so on. But what does the current state of AI mean for automatic speech recognition (ASR?)

Understanding AI, ML, and ASR

Before we discuss the relationship between artificial intelligence, machine learning, and automatic speech recognition, let’s take a look at what each of these terms means:

Artificial Intelligence (AI): refers to intelligence demonstrated by machines
Machine Learning (ML): allows machines to learn outputs from previous experience
Automated Speech Recognition (ASR): converts spoken words into computerized text

If we put these all together, it’s easier to understand that AI is the overarching discipline that refers to making machines “smart.” Machine Learning refers to systems that can learn by themselves from experience. ML is not the same as AI, but rather is a subset of AI. Most AI work now involves ML because intelligent behavior requires considerable knowledge, and learning is the easiest way to get that knowledge.

One area of high achievement is ASR – automatically transcribing voice recordings into the written word. In the rest of this post, we will take a look at the applications in which ASR has been used (and how well) including – of course – for captioning.

Applications of ASR

For ASR to work, the machine must be programmed to predict and deliver all possible outcomes. We have seen successes in specific applications where the input space is very narrowly constrained. For example, digital assistants like Siri and Alexa work sufficiently well, as the vocab size is task and command. In other applications, larger vocabulary sizes have posed a challenge.

ASR for Captioning

Before we talk about captioning, think for a moment about self-driving vehicles. The notable failures in automated driving have been due precisely to unexpected visual input, which is also what happened with Google’s infamous image tagging error. If you think about programming a car to predict all of the possible situations it might encounter on the road and to adjust to those situations, it feels nearly impossible. The same is true for ASR. As the vocabulary size grows, the task gets more complicated. While there have been many new improvements in ASR, the captioning task is much more complicated than the tasks where the most improvements have been made.

In 2015 the National Association of the Deaf (NAD) sued Harvard University for allegedly failing to caption public online video content and for providing inaccurate closed captions where they did exist. Specifically, the lawsuit noted that “Much of Harvard’s online content is either not captioned or is inaccurately or unintelligibly captioned.” These “inaccurate” and “unintelligible” captions were produced by ASR.

We recently saw an example from the NASA launch of why automatic captions just don’t cut it. The images below show the incorrect automatic captions, along with what was actually said. While we agree the sight is phenomenal, inaccurate captions are certainly not!

Why Are ASR Capabilities so Different for Captioning?

When relying solely on ASR technology, the accuracy rates are pretty abysmal – but why? Captioning is much more complicated than many other applications of ASR. Captioning is primarily characterized by long-form content, often the speaker is unknown, and it’s critical to transcribe every spoken word.

When it comes to captioning, some of the most common causes of ASR errors include:

Speaker labels
Punctuation, grammar, and numbers
Non-speech elements
[INAUDIBLE] tags
Multiple speakers or overlapping speech
Background noise or poor audio quality
False starts
Acoustic error

ASR technology is also prone to fail on small “function” words which are important in conveying meaning in speech. For example, take a look at the two sentences below:

“I didn’t want to do that exercise.” vs. “I did want to do that exercise.”

This example is a very typical ASR error. However, the meaning is completely reversed. It is sporadic for a human – especially a trained editor – to make such an error, as they will use the context to “fill in” the correct meaning in spite of any noise that may have been responsible for the ASR failure. Many of the above challenges are not being addressed by current technology or by current research.

In cases of perfect audio conditions, we have seen ASR technology produce around 80% accurate captions, at best. Perfect audio conditions, however, are rare.

We Still Need Humans

As the saying goes, “If you want something done right, do it yourself.” (Or at the very least – have a human do it. 🤷‍♀️) All jokes aside, when it comes to purely automated captioning solutions, there are no solutions on the horizon that will address all of the challenges posed by the captioning task. Technology just doesn’t have the same capability that humans do to understand nuances or discern unclear words from context. Therefore, we should be careful about generalizing from the success of automated approaches to any particular business problem, such as captioning.

ASR can play an important part in captioning when used in conjunction with human editors. At 3Play Media, we use a 3-step process to provide a high-quality, yet cost-effective captioning solution.

First, your video will go through ASR technology to produce a rough draft.
Next, a human editor will clean up the rough draft using our proprietary software.
Finally, a quality assurance manager will conduct a final review to ensure 99% accuracy.

At 3Play, we feel confident that human editors will remain a necessary component of producing high-quality captions for the foreseeable future.

—

Get started with the highest quality captions in the industry or learn more about 3Play Media’s captioning solution today.

3Play’s Patent Playbook: Transforming Caption Placement at Scale with Automated Closed Caption Positioning

by Jena Wallace in Video Accessibility

3Play’s Patent Playbook blog series tells the stories behind our patented technology. Learn how 3Play Media’s Research and Development (R&D) teams are spearheading innovation in accessibility tech and creating breakthroughs in the media accessibility industry at large. Captioning Best Practices for Media…

March 22, 2024

Human-in-the-Loop (HITL) Dubbing: The Key to High-Quality and Engaging Global Content

by Jena Wallace in Industry Trends

Human-In-The-Loop AI Dubbing Artificial intelligence (AI)-based dubbing solutions are emerging as a way to make videos accessible globally. But can AI deliver the quality necessary? Enter human-in-the-loop (HITL), a critical part of any successful AI dubbing workflow. In this blog, we will…

Updated April 4, 2024

New Apple Podcasts Transcripts Are Changing the Way Users Consume Podcasts

by Rebecca Klein in User Engagement

Attention, all podcast creators! Get ready for a shift in podcast accessibility and listener engagement. Apple is taking an industry-defying leap forward with transcript support in the next iOS update, set to be released in March 2024. Apple will provide automatically generated…

February 9, 2024

Subscribe to the Blog Digest

Sign up to receive our blog digest and other information on this topic. You can unsubscribe anytime.

By subscribing you agree to our privacy policy.

Product

Why 3Play?

Learn

Company

Further Reading

3Play’s Patent Playbook: Transforming Caption Placement at Scale with Automated Closed Caption Positioning

Human-in-the-Loop (HITL) Dubbing: The Key to High-Quality and Engaging Global Content

New Apple Podcasts Transcripts Are Changing the Way Users Consume Podcasts

Subscribe to the Blog Digest