Closed Captioning & Subtitling Standards in IP Video Programming

June 16, 2016 BY EMILY GRIFFIN
Updated: May 31, 2022


How do we consistently produce closed captions that are over 99% accurate?

By setting and adhering to video transcription, subtitling, and captioning standards.

3Play Media’s Operations Manager Claudia Rocha oversees a team of 1000+ transcript editors who create polished closed caption files that meet FCC standards for media broadcasting.

To ensure consistent quality control, she enforces strict closed captioning standards, which she describes in a webinar with Netflix and the Entertainment Merchants Association (EMA) on the best practices for closed captioning the digital distribution of TV and film.

Play the video below starting at 24:00 minutes to watch her presentation.

Here’s a quick recap of all the considerations the 3Play Media captioning team makes when producing FCC-compliant captions.

Media Transcription Standards

In 2014, the FCC mandated strict standards for caption quality.

While there is no universally accepted style guide, closed captions for broadcast media must meet the FCC’s guidelines for accuracy, synchronization, onscreen placement, and program completeness.

The CVAA, which applies to IP video programming that previously aired on US television with captions, requires that any online delivery of such programming follow FCC’s caption quality requirements as well.

To create captions that meets the minimum FCC quality requirements, these factors must be considered during the video transcription process.

Accurate Spelling

companiesClosed captioning standards are preferential and can vary significantly.

Broadcasters and online video publishers often have their own standards that may depart from the general standards described here.

As a premium captioning company, our core strength is that we can easily modify our standards to accommodate your needs.

For instance, we offer custom specs for publishing to Netflix, Hulu, iTunes, and Amazon video.

Learn more about custom entertainment captioning solutions >>

The highest priority is always caption quality and accuracy. Spelling should be at least 99% accurate — including proper names.

That means that beyond your typical spellcheck, you may need to research and confirm the correct spelling of people’s names, places, etc.

Grammar and Punctuation

Sentence case should be used to make the captions easier to read. compare this to all lowercase captions or ALL UPPERCASE CAPTIONS.

Punctuation is important for providing maximum clarity.

If you can, you should portray speech descriptions with punctuation.

For example, if someone is shouting, rather than transcribing, “(SHOUTING) Hi,” you can instead write “Hi!” Using an exclamation point takes up much less space in the caption frame, is quicker to read, and is much clearer for the reader.

Speaker Identification

Another best practice is to consider using speaker labels.

If there are multiple speakers present, it is helpful to identify who is speaking, especially when the video doesn’t make it clear. You may run into this when someone is speaking from offscreen or if multiple people are speaking at once.

However, make sure when using speaker ID labels that you do not reveal a plot point too early. For example, if there is a mysterious caller on the phone in a horror movie, you certainly wouldn’t want to reveal the name of the murderer before that information is revealed on screen.

The key here is to keep the plot in mind as you are transcribing the content so that the captions will be as comprehensible as possible without spoiling major plot points.

Non-Speech Sounds

It’s essential to communicate non-speech sounds in captions. For example, if there is music playing, you would need to include that. Non-speech sounds are typically denoted with [square brackets.]

And again, make sure to stay true to plot development: if someone is walking along the street and they happen to have their keys jingling in their pocket, you don’t need to include it as a sound effect; however, if someone is in a room and hears off-screen keys jingling to open a door, you want to include it because it is a part of the plot.


For broadcast media, you should transcribe content as close to verbatim as possible.

For a scripted show, you would include every “um,” every stutter, and every stammer because they are intentionally included in the movie.

There is more leeway for unscripted reality shows, documentaries, and news broadcasts, because the filler words are usually unintentional and irrelevant.

It becomes very hard to digest captions that denote every “um” or stutter; in this case, you should get as close as possible to verbatim without making the captions difficult to read.

Similarly, if someone puts on a fake accent for a couple of lines, you want to transcribe it using proper English and denote in parentheses that they’re speaking with an accent.

Caption Frame Display Standards

Once the process of transcription is complete, there are further standards for how the caption text is displayed on screen. All of these standards are designed for optimal reader comprehension.


It’s best to use a non-serif font style (like Helvetica medium) for caption text. This type of font is easiest for the viewer to read.

Characters Per Line

Each caption frame should hold one to three lines of text at a time, and each line should not exceed 32 characters.


The minimum duration of a caption or subtitle frame is 1 second. If there are extended sound effects, like in the instance of music playing, you should not keep the [MUSIC PLAYING] frame on screen the entire time; it should drop off after four or five seconds.

Each caption frame should be replaced by another caption frame, unless there is a long period of silence. So if someone stops speaking and 15 seconds of silence follow, you don’t want the caption frame to hang on the screen during the silence. It is unnecessary and makes it seem as though the speaker was talking longer than he or she was.

In accordance with the FCC quality standards, all caption frames should be time-synced to the audio. They should appear on screen precisely when the person speaks.

Caption Placement

Finally, caption frames should be repositioned if they obscure onscreen text. This is part of the FCC’s caption quality standards.

For example, if there is a person being interviewed on screen and there is a text description of the interviewee at the bottom of the screen, the captions should be repositioned so as not to obscure that text.

3Play Media offers vertical caption placement, which automatically repositions captions or subtitles if onscreen text is detected.

However, if you need to make sure that captions don’t obscure non-text elements, such as a character’s face, a logo image, etc, you can order manual caption placement services for extra quality control.

Download the white paper: Closed Captioning Best Practices for Media & Entertainment

This post was originally published on May 6, 2014 by Lily Bond under the title “Transcription, Captioning, and Subtitling Standards.” It has since been updated.

3play media logo in blue

Subscribe to the Blog Digest

Sign up to receive our blog digest and other information on this topic. You can unsubscribe anytime.

By subscribing you agree to our privacy policy.