Machines and Humans: Stirred, Not Shaken for The Perfect Captioning Recipe

August 17, 2021 BY JOSH MILLER
Updated: August 16, 2021

The early days of 3Play Media included deep research into the various methods of transcribing audio and video content in order to create accurate, properly timed captions. One of the main focal points was the use of automatic speech recognition (ASR). Specifically, could ASR be used to generate quality captions? As a standalone solution: no. But the speed and scale of generating a “decent” draft was compelling.

As we thought about the problem – potentially millions of hours of content needing to be captioned – we were focused on designing a process and a system that could scale, recognizing that a human was required to achieve acceptable levels of accuracy for this use case. Our ultimate conclusion is still alive and kicking as our core process to transcribe and caption audio and video assets today: 1) generate an ASR draft, 2) human reviews and edits every second, and 3) a human QA check.

How to Select the Right Closed Captioning Vendor

How to select the right closed captioning vendor. 10 questions you need to ask. Download the checklist.

This white paper is designed to equip you with 10 crucial questions to ask to compare your options and find the right video accessibility solution for you.

Access the White Paper

At the time, we could not identify any other provider using this approach of combining speech recognition with human correction in a successful way. Had we built a better mousetrap? Or, were we really early to the market? [Best John McEnroe voice] Did people even understand what we were talking about?!

The few existing vendors in the space at the time immediately pounced, exclaiming that using speech recognition in the process was not feasible to achieve legitimate accuracy levels. Several noted that they had tried it themselves and could confidently proclaim that a hybrid human/tech solution could not possibly speed up the process, and anyone suggesting as much was fooling themselves. I’m not going to lie, this was a pretty amusing response. Can you imagine taxi companies going around saying that a ride sharing app would never work while it was happening right in front of them? Oh, right…

Nonetheless, other vendors doubled down on their anti-tech messaging. Competitive blog posts and articles about the accuracy realities of speech recognition and that “overuse” of technology equaled “bad” was spooking some customers and prospects. Some RFPs even started to specify the need for “human” captions and that processes using speech recognition would not be tolerated, rather than focusing on output accuracy measures. We found ourselves playing with the language to de-emphasize the automated aspects of the transcription process. Rather, we focused all automation messaging on the workflow aspects.

We started to focus on the output more than the exact process; noting that through our use of technology, we were often able to achieve higher levels of accuracy than fully manual solutions (this is still true today).

We learned that going into detail about how we created the captions could get extremely confusing for people new to these products and concepts – again, this was nearly ten years ago, when web video was just getting going and before digital accessibility was widely considered a must have. Did it even matter how we transcribed the content? Well, maybe. Automation was perceived by some to be the holy grail – cheap, fast, and infinitely scalable. However, many also misunderstood the balance of the capabilities and limitations of ASR at the time (and maybe still today). At the same time, working with a new company in the space that had found a way to innovate was exciting, and learning about the process was both interesting and part of the evaluation process. We started to focus on the output more than the exact process; noting that through our use of technology, we were often able to achieve higher levels of accuracy than fully manual solutions (this is still true today). And even still, some customers talked about the captioning we delivered as being automated in nature.

YouTube launched auto captions in 2010, making it the first major tech company applying ASR technology to the captioning use case. YouTube was already widely known, so this was big news. I immediately received phone calls from friends asking if we had just been put out of business. Our initial thinking was that YouTube’s announcement would help raise [deeply-needed] awareness around the benefits of captioning videos at scale and that the automated nature would not pose a dramatic risk for serious video publishers. But it was YouTube, so we awaited the market response with baited breath to some extent. To our benefit, the response was largely what we had hoped for: ASR as a standalone solution would not suffice for captioning, and YouTube acknowledged this as well.


Learn More About Captions Powered by 3Play Media Technology and Professional Captioners ✨


In the following years, other tech companies would follow YouTube by taking aim at the video market with their ASR engines: Microsoft, Amazon, and IBM Watson all started talking about their “auto-caption” solutions. IBM, in particular, started issuing marketing materials that suggested their Watson engine could completely solve captioning. This really could not have been more confusing.

What they were doing was interesting, but ultimately insufficient for the accessibility and compliance conversations we were in. In fact, accuracy had become a key differentiator for us; to the point that accuracy became one of the first items we’d talk to in a sales conversation. All the while, we were investing in R&D efforts to improve the speech models, identify differences in file complexity, manipulate file availability and distribution on our market – all with technology. But we were concerned that the message of automation in our captioning process might get intertwined with the “machine-only” negativity that was brewing.

The market has changed a lot in the last few years (for the better). There is more acceptance and acknowledgment of the fact that technology is needed to address the scale of the video accessibility requirements. We still use humans because accuracy is critical, and we will continue to invest in our machine learning efforts because the technology is what makes our approach unique and truly differentiated. We knew that the combination of humans and machines were necessary when the market didn’t accept AI; and we continue to be confident that balancing machines with the appropriate level of human intervention is necessary even now that AI has become more desirable. Our process takes the best of both worlds to deliver something neither can produce alone.


image of Josh on an abstract background
This blog post was written by Josh Miller, Co-CEO and Co-Founder at 3Play Media.

Get started today with clickable link to learn more

3Play Media logo

Subscribe to the Blog Digest

Sign up to receive our blog digest and other information on this topic. You can unsubscribe anytime.

By subscribing you agree to our privacy policy.