How do 3Play’s Live Captions Compare to Zoom’s Built-in Captions?
Updated: July 9, 2021
Artificial intelligence-based automatic speech recognition (ASR) is one step of 3Play Media’s innovative transcription process, and it’s also what powers our live captioning solution. As a result, we’re deeply invested in following trends in the ASR industry in order to make sure we are powering our solutions with the technology which will provide the most accurate results possible. Every year, 3Play Media releases our annual report on the state of automatic speech recognition where we test many of the leading speech recognition technologies available on the market to ensure we’re powering our solutions with top-of-the-line technology, year after year.
In November of 2020, Zoom announced its collaboration with Otter.ai to provide live transcription and captioning for Zoom meetings. Users with both Zoom Pro and Otter for Teams receive this feature without any additional price per minute.
With all the attention surrounding this announcement, we felt compelled to investigate the accuracy of Otter’s real-time ASR. How does it stack up against 3Play’s live captioning solution, and what does any difference in accuracy mean for caption quality and understandability?
In order to test both Otter and our own ASR provider, Speechmatics v2 Real-Time, we collected video content that was representative of the type of content our customers ask us to transcribe. We sourced this content from a diverse set of domains to make sure we were covering as many customer use cases as possible.
The content fell across the categories of education, health, sports, news, entertainment, and corporate video. In total, we used a little over four hours of content which contained over 30,000 spoken words.
We used the audio from these files to generate Otter-powered transcripts. Then, we used the same audio with 3Play Media’s live auto-captioning solution to generate Speechmatics-powered transcripts.
We used 3Play Media’s 99% accurate transcription process to generate “truth” transcripts and ran these through an additional step of human review to ensure extremely high quality. Then, we used these transcripts to score the accuracy of both the Otter auto-captions and the 3Play Media auto-captions.
We measure accuracy using a standard metric called word error rate (WER). Word Error Rate is a percentage generated by dividing the count of errors in the transcript by the count of words in the “truth” transcript. In other words, with a WER of 10%, you would expect to see one error for every 10 words spoken.
The types of errors encountered can be split into three categories. Substitution errors are the count of incorrectly recognized words, where the correct word was “substituted” with an incorrect one. Insertion errors are extra words recognized by speech recognition that aren’t actually present in the speech. Finally, deletion errors are words that were missed or omitted completely by the ASR.
This error count does not count errors in punctuation or formatting. Error rate including punctuation and formatting errors would be called Formatted Error Rate, or FER.
|ASR Engine||% Error||% Substitution||% Insertion||% Deletion|
|Speechmatics v2 Real-Time||16.33||6.90||5.15||4.28|
3Play’s Speechmatics-powered captioning solution outperformed Zoom’s Otter-powered solution with a 26% lower word error rate.
Function words are words that perform a grammatical function rather than introducing meaning into a sentence. Examples include words like “the”, “do”, “and’, and “can”. These words fill very important roles that can change the meaning of a sentence. They’re also often misrecognized by ASR because they are often shortened or reduced in speech.
For example, the words “can” and “can’t” sound very similar, but mean completely opposite things. After analyzing substitution errors from both engines, we found that Otter.ai was twice as likely as Speechmatics v2 Real-Time to mix up “can” and “can’t”.
The table below shows how many times these function words were substituted for any other incorrect word, for each vendor.
|Word||Speechmatics v2 Real-Time||Otter.ai|
|do or don’t||3.49%||4.20%|
|can or can’t||2.04%||2.72%|
Speechmatics v2 Real-Time performed better for all function words evaluated.
What does this mean for you?
For a 12-word long sentence, a 16% error rate will result in an average of 1.92 errors. At a 22% error rate, a 12-word sentence averages 2.64 errors.
Or, in other words, a 16% error rate means users will see an error about once every 6.25 words, while a 22% error rate means they will see an error every 4.54 words.
The level of disruption to understandability caused by errors can vary greatly. The errors below come from a mixture of both speech recognition engines. These examples can demonstrate the level of impact an error can have on captions for those determining the importance of accuracy when choosing a live captioning vendor.
- “… the difference between deductive and inductive influences.”
- “… the difference between deductive and inductive inferences.”
- “for honest you can be more helpful it is.”
- “The more honest you can be, the more helpful it is.”
- “… and we lose.”
- “… and then we lose.”
- “Barbie queuers Morgan’s borders…”
- “Barbequers, smorgashborders… “
- “The privilege only extends to fax.”
- “The privilege only extends to facts.”
- “… what size window guard you need.”
- “… what size window guards you need.”
One place we found that 3Play’s solution particularly stood out was the rate of deletion errors. Otter.ai had over twice as many deletion errors as Speechmatics v2 Real-Time and is “deleting” or omitting almost one in every 10 spoken words.
This error type in particular can really impact participants who rely on captions as an accommodation. If captions are omitted, users might not only miss out on the content but also on the fact that something is being said at all. Additionally, if you use real-time transcription to generate meeting notes and transcripts for later reference, important information could be missing from the resulting transcript and be forgotten.
Achieving the highest accuracy
At 3Play Media, we believe that accuracy is crucial. Live captions can only create engagement, equal access, or improved understanding if they are sufficiently accurate. We are committed to seeking out the highest quality technology to ensure that our customers are getting the greatest benefit they can from our captioning.
No matter what method you are using to live caption your videos, following some best practices can help you optimize the resulting accuracy.
This blog post was written by Tessa Kettelberger, Research and Development Engineer at 3Play Media.
How to Handle Live Closed Captioning – and the Challenges
Technological innovation has paved a new way to conduct business, education, and life in general – particularly in a world forced to adapt to virtual substitutes during the pandemic. Most of the time, the technology we use is very helpful. For example,…
Transcribing Oral Histories with 3Play Media
History can be told in many ways, but one of the most impactful methods is through oral history. Oral history is a technique for preserving historical information through recorded interviews. In a typical oral history, an interviewer questions an interviewee and records…
How to Scale Live Closed Captioning: 6 Top Tips
As live video content continues to grow in popularity, most video & social media platforms have enabled live streaming features due to the sheer number of people tuning into live streams. In 2019 alone, internet users watched a staggering 1.1 billion hours of…