Why Captioning Non-Speech Elements Matters for Accuracy

April 15, 2022 BY REBECCA KLEIN

 


Captioning Best Practices for Media and Entertainment [Free Ebook]


When we talk about the quality of closed captions and podcast transcripts, we often reference the 99% industry standard for accuracy.

Captioning accuracy measures punctuation, spelling, and grammar and is made up of two different elements: Formatted Error Rate (FER) and Word Error Rate (WER).

WER, which considers word substitutions, deletions, and insertions, is commonly used to judge the quality of automatic speech recognition (ASR) captions.

FER is the percentage of word errors when considering formatting elements such as punctuation, grammar, speaker identification, non-speech elements, capitalization, and other notations. WER alone is insufficient to determine accuracy, and closed captions and transcripts must meet formatting requirements to be 99% accurate.

While meeting all formatting conditions is essential for accuracy, one of the most commonly misunderstood requirements is captioning non-speech elements.

There are many commonly asked questions: What is a non-speech element? How do non-speech elements differ in closed captions for videos and transcripts for podcasts? Which non-speech elements are necessary for accuracy and comprehension, and how should they be effectively included in a transcript?

This blog will answer these questions and more, focusing on the importance of captioning non-speech elements for both video and podcast content.

What are non-speech elements?

While the text in a caption file often contains mainly speech, captions also include non-speech elements, such as speaker identification and sound effects, that are critical to understanding the plot.

Non-speech elements can include:

  • Sound effects (e.g., a bee buzzing, keys jangling, or a doorbell ringing)
  • Music, either in the background or as part of a scene
  • Audience reactions (e.g., laughing, groaning, or booing)
  • Manner of speaking (e.g., whispering, shouting, emphasizing a word, or talking with an accent)
  • Speaker identification for an off-screen narrator or speaker, or for multiple speakers
  • Other sonic information that might be necessary to follow the plot or dialogue

Non-speech elements in closed captions

Think about the last film or television show you watched. The dialogue was critical, but so were the non-speech elements throughout the program. Perhaps there was lyrical, symphonic music important to the scene or sounds of a clock ticking during a tense moment. Whatever non-speech elements you heard, they were included carefully and deliberately to create a specific viewing experience.

Accurate closed captions for videos are meant to give deaf and hard-of-hearing viewers and those watching without sound an equivalent viewing experience. For captions to recreate the intended audience experience without audio, they must include non-speech elements and relevant aural cues. Otherwise, viewers may be left confused, uninterested, and frustrated.

To learn more about closed captioning requirements for film and television, we recommend reviewing the Described and Captioned Media Program’s Captioning Key.


Create closed captions for media and entertainment


Non-speech elements in podcast transcription

If you’re one of the approximately 162 million Americans who has ever listened to a podcast, you know that a podcast episode is more than just words and dialogue. Depending on the complexity of a particular show, a podcast can contain music, soundbites, and more.

Here are a few of the many non-speech elements a podcast might include:

  • Multiple speakers, making speaker identification critical for comprehension
  • Music
  • Manners of speaking
  • Sound bites from different recordings
  • Sound design
  • Natural and ambient sounds that add nuance to a scene

Additionally, speech-based content isn’t always straightforward. Storytelling is complex, and transcripts need to differentiate between narration, the main conversation, external interviews, different recordings pieced together, and more to ensure accuracy and comprehension.

Some podcasters also question which non-speech elements to include in a transcript. If you’re creating a transcript after finishing a podcast episode, listening for non-speech elements that are important for comprehension can be challenging. To address this issue, we recommend considering the careful choices made during production.

A transcript should recreate a listener’s experience. While reading a transcript will not be the exact same as listening to the audio, transcripts deserve the same careful attention to detail and nuance. If you deliberately included sound design to achieve a desired aural effect, then you should probably include a description in your transcript.

Why ASR isn’t sufficient for accuracy and comprehension

While there are many benefits to accurate podcast transcription and closed captioning, a common issue we see is a reliance on ASR transcription without editing.

ASR transcription typically won’t provide adequate comprehension compliant with WCAG 2.0 Level A for video or audio-only content. While ASR technology can be impressively accurate for WER, FER issues are often abundant.

WCAG-compliant transcripts and captions require both speech and non-speech audio information needed to understand the content; however, ASR transcripts and captions often fail to identify key non-speech elements, such as sound effects, speaker identification, and differentiation of layered recordings.

Unless you can devote significant time to making edits and including necessary information for comprehension, professional, human-edited transcriptions and captions are the way to go for accessibility and legal compliance.


Captioning Best Practices for Media and Entertainment. Download the Guide. Image of clapperboard with drop frame timecode 00:59:56:12.

3play media logo in blue

Subscribe to the Blog Digest

Sign up to receive our blog digest and other information on this topic. You can unsubscribe anytime.


By subscribing you agree to our privacy policy.