Why Captioning Non-Speech Elements Matters for Accuracy
When we talk about the quality of closed captions and podcast transcripts, we often reference the 99% industry standard for accuracy.
Captioning accuracy measures punctuation, spelling, and grammar and is made up of two different elements: Formatted Error Rate (FER) and Word Error Rate (WER).
WER, which considers word substitutions, deletions, and insertions, is commonly used to judge the quality of automatic speech recognition (ASR) captions.
FER is the percentage of word errors when considering formatting elements such as punctuation, grammar, speaker identification, non-speech elements, capitalization, and other notations. WER alone is insufficient to determine accuracy, and closed captions and transcripts must meet formatting requirements to be 99% accurate.
While meeting all formatting conditions is essential for accuracy, one of the most commonly misunderstood requirements is captioning non-speech elements.
There are many commonly asked questions: What is a non-speech element? How do non-speech elements differ in closed captions for videos and transcripts for podcasts? Which non-speech elements are necessary for accuracy and comprehension, and how should they be effectively included in a transcript?
This blog will answer these questions and more, focusing on the importance of captioning non-speech elements for both video and podcast content.
What are non-speech elements?
While the text in a caption file often contains mainly speech, captions also include non-speech elements, such as speaker identification and sound effects, that are critical to understanding the plot.
Non-speech elements can include:
- Sound effects (e.g., a bee buzzing, keys jangling, or a doorbell ringing)
- Music, either in the background or as part of a scene
- Audience reactions (e.g., laughing, groaning, or booing)
- Manner of speaking (e.g., whispering, shouting, emphasizing a word, or talking with an accent)
- Speaker identification for an off-screen narrator or speaker, or for multiple speakers
- Other sonic information that might be necessary to follow the plot or dialogue
Non-speech elements in closed captions
Think about the last film or television show you watched. The dialogue was critical, but so were the non-speech elements throughout the program. Perhaps there was lyrical, symphonic music important to the scene or sounds of a clock ticking during a tense moment. Whatever non-speech elements you heard, they were included carefully and deliberately to create a specific viewing experience.
Accurate closed captions for videos are meant to give deaf and hard-of-hearing viewers and those watching without sound an equivalent viewing experience. For captions to recreate the intended audience experience without audio, they must include non-speech elements and relevant aural cues. Otherwise, viewers may be left confused, uninterested, and frustrated.
To learn more about closed captioning requirements for film and television, we recommend reviewing the Described and Captioned Media Program’s Captioning Key.
Non-speech elements in podcast transcription
If you’re one of the approximately 162 million Americans who has ever listened to a podcast, you know that a podcast episode is more than just words and dialogue. Depending on the complexity of a particular show, a podcast can contain music, soundbites, and more.
Here are a few of the many non-speech elements a podcast might include:
- Multiple speakers, making speaker identification critical for comprehension
- Manners of speaking
- Sound bites from different recordings
- Sound design
- Natural and ambient sounds that add nuance to a scene
Additionally, speech-based content isn’t always straightforward. Storytelling is complex, and transcripts need to differentiate between narration, the main conversation, external interviews, different recordings pieced together, and more to ensure accuracy and comprehension.
Some podcasters also question which non-speech elements to include in a transcript. If you’re creating a transcript after finishing a podcast episode, listening for non-speech elements that are important for comprehension can be challenging. To address this issue, we recommend considering the careful choices made during production.
A transcript should recreate a listener’s experience. While reading a transcript will not be the exact same as listening to the audio, transcripts deserve the same careful attention to detail and nuance. If you deliberately included sound design to achieve a desired aural effect, then you should probably include a description in your transcript.
Why ASR isn’t sufficient for accuracy and comprehension
ASR transcription typically won’t provide adequate comprehension compliant with WCAG 2.0 Level A for video or audio-only content. While ASR technology can be impressively accurate for WER, FER issues are often abundant.
WCAG-compliant transcripts and captions require both speech and non-speech audio information needed to understand the content; however, ASR transcripts and captions often fail to identify key non-speech elements, such as sound effects, speaker identification, and differentiation of layered recordings.
Unless you can devote significant time to making edits and including necessary information for comprehension, professional, human-edited transcriptions and captions are the way to go for accessibility and legal compliance.
Who Needs a Podcast Transcription Service?
Tips for Creating an Accessible Podcast [Free Guide] If you’re reading this blog, you might be wondering whether you need a podcast transcription service. While we’d love it if all podcasters offered transcripts, there are specific reasons why people may be…
5 Ways a Podcast Transcript Can Expand Your Reach
Tips for Creating an Accessible Podcast [Free Guide] Podcast transcription is a major topic right now, and for good reason: Three leading podcast platforms were recently sued for not providing transcripts on their popular mobile applications. While this lawsuit could have far-reaching…
Captioning Sound Effects in TV and Movies
The first electric television was invented in 1927. 44 years later, captions were introduced to TV programs. In 1972, Gallaudet University, American Broadcasting Company (ABC), and the National Bureau of Standards presented the technology needed to make television shows accessible with captions.…