How the Vendor’s Captioning and Transcription Process Determines Their Rates

February 20, 2019 BY SOFIA ENAMORADO


When considering captioning vendor options, it’s important to keep in mind that the cost of captioning is usually in a delicate balance with accuracy.


Choosing a captioning vendor is like choosing a car; The cheapest car may “run,” but down the line, you may have mechanical problems. On the other hand, the most expensive car isn’t always the most cost-effective option.

With captions, it’s the same way; you want to be certain the vendor you choose is cost-effective and is taking advantage of current technology. Understanding the vendor’s process is key to determining which vendor has the most cost-effective rates for the quality of the work.

Automatic Speech Recognition (ASR)

what is 99% accuracy really? learn more

While automatically generating captions can be efficient when used in combination with human editors, this method is insufficient when used on its own.

Studies have shown that even 95% accuracy is sometimes insufficient for accurately conveying complex material. In an average sentence of 8 words, a 95% word accuracy means there will be an error on average every 2.5 sentences. Captions generated by ASR technology fall significantly below this, with an average of 60-70% accuracy.

ASR technology is also prone to fail on the small “function” words that are crucial in conveying meaning in speech. Take a look at the below sentences:

“I didn’t want to do that exercise.” vs. “I did want to do that exercise.”

This example is a very common ASR error, and although a seemingly small one, the meaning of the sentence is actually completely reversed. It is very rare for a human editor to make such a mistake, as they’re able to use context to help identify and choose the correct word.

Simply put, automatically generated captions are insufficient when it comes to accessibility.

Captions generated solely by automatic speech recognition technology are the lowest cost option, at an average rate of $0.25/minute. But, like with anything else, you get what you pay for.

If cost is your main priority, this may be the way to go. However, if you’re looking for high-quality captions that meet federal requirements for accessibility, keep in mind that cutting costs now may end up costing you a lot more down the road.

accuracy is important for your branding, reading comprehension, representation of facts, and to protect you from lawsuits.

Mechanical Turk

Although using the Mechanical Turk may appear to be less expensive, at $0.75 – $1.25/minute, costs can add up quickly, especially when multiple workers are employed on your tasks.

Amazon Mechanical Turk, a web-based crowdsourcing platform, gives businesses and developers access to an on-demand, scalable workforce. Mechanical Turk allows businesses to upload their video files and split them into Human Intelligence Tasks (HITs) which are single tasks to be completed by one freelancer. The HITs are then published to the marketplace for freelancers to claim and complete. Once the tasks are complete, they are submitted back to the business for review and approval.

Imagine hiring a group of students to paint a replica of the Mona Lisa. Every painting would turn out different in terms of quality and interpretation. Crowdsourcing works the same way, but can be even more detrimental for caption quality.

a file is split into multiple documents that are sent to multiple editors and finally pieced back together

Crowdsourcing can be risky for a number of reasons, particularly because when multiple people work on a larger file, inconsistencies are bound to arise. For instance, you might run into problems with conflicting speaker identification tags, different grammar and punctuation choices, or a change of tone from one segment to another. If the information you are transcribing is confidential, there may also be security risks, as multiple people will have their hands on your file. Another major drawback to using this model is a lack of training, which ultimately leads to transcriptionists not following legal accessibility standards and requirements.

Segmenting Files
Some captioning vendors try to minimize costs and speed up the process by breaking up single files among several transcriptionists.

Similarly to using the Mechanical Turk, this can lead to inconsistencies throughout the file.

The accuracy can get worse the longer the file or the more difficult the content.


10 questions to ask when slecting a captioning vendor. learn more

Offshore transcripts frequently fail to meet the FCC standard of accuracy, as they contain many spelling and grammar mistakes, hurting the quality of the captions.

While offshore labor is less expensive, costing between $1.50 – $3.00/minute, it’s also a lot less accurate. Offshore transcription is when a vendor outsources their media files to laborers outside of the native-language country. Non-native English transcriptionists are unlikely to have the same handle on English grammar, spelling, and intent that native speakers have. They may also lack general knowledge of cultural and linguistic nuances, slang, accents, and current events that are essential to understanding the intent of the media.

Even if choosing a US-based captioning company, it’s important to find out if they outsource their editing or transcription to US-based workers or offshore transcriptionists. The best transcriptionist is one native to the language of the media.

Professional US-Based Transcriptionists

The best way to produce transcripts that are both high-quality and offer fair rates involves a multi-step process. Vendors who leverage technology by using speech recognition software for a first draft are more cost-effective since transcriptionists aren’t starting on the file from scratch.

At 3Play Media, we use a 3-step process which includes a combination of ASR and human cleanup. Our advanced technology enables us to offer competitive rates (including volume-based discounts), and our multi-step quality assurance measures ensure we deliver premium quality captions, subtitles, and transcripts far more efficiently than traditional methods.

showing the 3play captioning process starting with ASR and ending with two rounds of human editing

This approach affords our transcriptionists the flexibility to spend more time on the finer details. Our process is the same for transcript and caption files, both of which are made time-synchronized with the associated audio or video file.

Our average measured accuracy is 99.6%, and we guarantee over 99%, even in cases of poor audio quality, multiple speakers, difficult content, and accents. Our staff of more than 1,000 transcriptionists gives us the flexibility to assign complex or technical content to transcriptionists with discipline-specific expertise or a familiarity with a certain accent, enabling us to process a broad range of complex content to a consistently high quality.

why is 3play media better at accuracy? read the blog

In order to guarantee an accurate transcript, it’s crucial to understand the vendor’s process. Trained, native English-speaking transcriptionists are most likely to provide you with the accuracy that you need in order to meet legal requirements for accuracy, time synchronization, completeness and placement.


If you’re looking for top quality and reasonably priced captions or transcripts, check out our pricing and get started with 3Play today!

Pig View Pricing CTA

This post was originally published by Elise Edelberg on March 7, 2017.

3play media logo in blue

Subscribe to the Blog Digest

Sign up to receive our blog digest and other information on this topic. You can unsubscribe anytime.

By subscribing you agree to our privacy policy.