How Can Advancements in Speech Recognition Help to Create Better Captioning?

January 6, 2020 BY ALEX FLEMING
Updated: January 7, 2020

When it comes to consideration around important issues like regulations and compliance, often the first thoughts go to the banking and financial sectors. Over the years, legislative regulation in these sectors has become one of the most important areas of consideration and innovation. Organizations have invested huge sums to not only protect themselves but the data and security of their customers. While we all want to ensure that our private data and especially banking data is safe, there are other industries that put equal importance on the adherence to the regulations in their industries to deliver better services to their end-users and protect themselves from fines.

online video with yellow heart in the centerClosed captioning is a crucial element within video content. It not only makes content more accessible; it also enables more people to engage with and enjoy the content. Additionally, adding text to video content enables enhanced searchability not only by end-users but from content providers to ensure their consumers can quickly find the content that they want.

Speech recognition has come a long way in recent years. In 2019, more organizations than ever started looking to this technology as a practical and scalable tool to enhance their businesses and to add value to existing business processes. Word error rates within automatically generated transcription have continued to improve, with English leading the way as one of the world’s most spoken languages. But transcription isn’t captioning!

Styling, formatting, bespoke language and terminology across industries and different platforms that content is available are all considerations and have different requirements when it comes to delivering compliant captions. Constant updates to these requirements also add an additional layer of complexity. Automatic speech recognition significantly reduces the human effort required to convert the bulk of speech within media content into text for use in captioning.


Recent advancements in automatic speech recognition services mean that organizations have never been so empowered to offload larger portions of raw transcription to machines. The word error rate (often the proxy for measuring accuracy) now delivered by these solutions continues to be driven down across more languages. This is a good thing as more territories are looking to leaders in accessibility like North America and applying similar captioning legislation in their regions.

thumbs up iconEven with these continued advancements in the reduction of errors within transcriptions, the accuracy required by the Federal Communications Commission (FCC) in North America and other regulatory bodies are a challenge without the addition of human editors. Automatic transcription means that human editors can focus on the complexities of delivering perfect captions that are still in the pipeline for automated transcription. This includes elements that are part of the legislations but not directly related to ASR like non-speech elements and the format of captions post transcription.


The words are a vital part of the transcript output, however, there are other elements that can significantly reduce the effort of transforming transcripts into captions. The inclusion of advanced punctuation with a rich number of punctuation characters and better capitalization means that transcripts are one step closer to a text-based representation of natural language and North American regulation even in their raw form. The regulation states that ‘punctuation should be used for maximum clarity in the text’. Better punctuation also makes machine transcripts easier to read for human editors, accelerating their ability to add the required changes to create captions quicker than ever while preserving the best possible accuracy.

 2019 State of Automatic Speech Recognition   ➡️


microphone iconThe Americans with Disabilities Act (ADA), tasked with equal opportunity for persons with disabilities, also sets out that captions should preserve and identify slang or accents. While this can be enabled through human editors, customization within the ASR means that solutions can be tailored to ensure it delivers the best possible accuracy, with limited editing.

Enabling the fast and effortless inclusion of difficult terms like names, accents, abbreviations, acronyms, and other specialists, industry or content-specific language into the recognition model, delivers the tools to take control of ASR capabilities and adapt to a diverse range of content. From documentaries and feature films to online videos and music videos, customization ensures that any media type can be transcribed without error – further reducing the time, and heavy lifting of humans in the captioning process.

SEO search iconCaptioning is a complex and heavily regulated area of the media industry. Expectations might be that this looks relatively straightforward, however, there are rules that need to be followed to ensure regulations are met and that content is properly accessible. It is proven that automatic speech recognition technology adds value in the captioning process and with the continued advancements in machine learning, this will further contribute towards the value it can offer across more languages and to more challenging audio. It is for this reason that an ASR solution not only requires best-in-class accuracy, a wide breadth of languages, customization and the ability to deliver continued improvements to help deliver better captions.


Read the full 2019 State of ASR Report to learn more!

Download the Report! 2019 State of Automatic Speech Recognition

This is a guest blog written by Alex Fleming, Product Marketing Manager, at Speechmatics.

3Play Media logo

Subscribe to the Blog Digest

Sign up to receive our blog digest and other information on this topic. You can unsubscribe anytime.

By subscribing you agree to our privacy policy.