Speech Recognition Gaffes

November 14, 2018 BY SOFIA ENAMORADO
Updated: February 5, 2021

“Gentlemen, to bed!”

These words are repeated as British comedians, Rob Brydon and Steve Coogan ponder what it would be like to lead a medieval army to wage war the next day. They play with their plot line, humorously juxtaposing modern sophistication. For example what if the warriors rode at “nine thirty-ish” instead of “daybreak?” Or perhaps our strong leader would like a morning jog and continental breakfast before his impending battle. Watch one time and laugh. Watch another time with YouTube’s auto captions on and prepare to be quite confused.

Unfortunately, the speech recognition software goofs it up again and again. “To bed” was almost always a different phrase each time. “Tibet,” “two bags,” “too bad” and even “hopeful.”

A machine’s ability to recognize words can be compromised by factors such as accents and music. The above tone is exaggerated. In fact, many actors’ voices or theatrical presentations are not correctly interpreted. We have become very familiar with this issue here at 3Play Media in our work with Netflix. Automatic speech recognition just wasn’t built for film settings.

Another factor to consider is a speech recognition machine’s ability to interpret language in unusual contexts. The language model tries to predict which words will come next based on statistical language models. Since our actors are using an antiquated way of speaking, predictive text fails.

Speech Recognition & Audio Characteristic

One of my colleagues brought my attention to the following hilarious video. Two comedians, Rhett and Link, take a common approach to humor examining how communication misfires can lead to laughs… but with a spin. They act out a script and then put that script through YouTube’s automatic caption generator which yields a new, albeit somewhat flawed script. They then act that out, verbatim. Finally, they repeat the process one more time until the original transcript and new script barely resemble each other. (See all of Rhett and Link’s Caption Fail Videos)

Many of the reasons YouTube’s auto-captions has inaccuracies are related to several audio characteristics, such as:

  • Sound Quality- As you watch this video, you’ll notice the errors start right off the bat, largely due to police sirens in the background. No surprise YouTube’s speech recognition software screwed this up, as sometimes it is even hard for humans to understand speech during the presence of loud sounds. This is one of the reasons why captions and subtitles are helpful to us all.
  • Speech Quality- Fast speech, accents or a lack of enunciation between words often cause problems for speech recognition.
  • Complex Vocabulary- At one point, Link quickly sprouts off the phrase “bullet propulsion devices.” Considering that this isn’t the common wording for the more apt, “gun,” it’s no wonder the system does not recognize this jargon.

It is Google’s priority to create tools that are innately useful and helpful to people, enriching lives. They’ve made great strides in captioning, but the YouTube technology isn’t perfect yet– and that’s okay. In the staggering mission to caption the web’s video, Google and captioning advocates understand the onus lies with content creators to upload their high-quality transcripts and captions created by companies like us.

Speech Recognition & Translations

What happens when speech is translated from Spanish to English for captioning but the auto speech recognition software doesn’t recognize the native language? See below.

In the following video, The Colbert Report brought light to this goof by ABC News which translated Florida Senator Marcus Rubio’s Post State of the Union Address. Most of the lines came across as complete gibberish.

Jump to 1:45 to see.

Speech Recognition & Live Captioning

Across the pond, BBC received flack from deaf and hard of hearing viewers for less-than-stellar captions and subtitles.

For example, during the Queen Mother’s funeral, a call for silence became ‘we will now have a moment’s violence’! Obviously the threshold for error rises during live captioning despite captioning professionals working in earnest to avoid these blunders. (See the hard work that goes into ESPN’s live captioning process) The below comedic video, however, attributes these live errors to speech recognition.

Mock of the Week is comedy show which actually airs on BBC. At least they can take a joke at their own expense, right?

Watching this video with YouTube auto captions lends to some additional humor, as I don’t think it processes British accents so well. Hopefully, the BBC speech recognition software is having the same issues?

Using Speech Recognition in 3Play’s Process

When a customer uploads a video to 3Play Media, we also run it through speech recognition, but then a professional transcriptionist reviews every word. After this round of editing, a quality assurance manager conducts a secondary review, researching difficult words and checking punctuation. This is how we’re able to achieve 99%+ accuracy for transcripts and captions, despite cases of poor audio quality, multiple speakers, difficult content, and accents. We try not to pick on YouTube too much. After all, they don’t have our great team! While our advanced technology enables competitive prices, it’s our stringent, multi-step human review that delivers quality.

This blog post was originally published on March 22, 2013 by Shannon Murphy and has since been updated.

3play media logo in blue

Subscribe to the Blog Digest

Sign up to receive our blog digest and other information on this topic. You can unsubscribe anytime.

By subscribing you agree to our privacy policy.