Stay up-to-date on the latest episodes of Allied

Allied Podcast: The Evolution of Live Captioning with Jill Brooks

January 21, 2022


Welcome to 3Play Media’s Allied Podcast, a show on all things accessibility. This month’s episode features Jill Brooks of 3Play Media and is about the evolution of live captioning.

Jill Brooks is a captioning expert and all-around accessibility trailblazer. Jill worked at the National Captioning Institute for more than two decades, beginning her career in various production roles and working her way up to serve as the President & Chief Operating Officer. The arc of her career cemented her expertise in steno captioning, voice writing, and live caption creation.

Jill now applies her knowledge and experience to her current role as 3Play Media’s Senior Director of Live Operations.

Connect with Jill on LinkedIn

Learn more about 3Play Media’s Live Captioning service

Check out this episode on any of these platforms:

Want to get in touch? Email us at Allied@3playmedia.com. We’d love to hear from you.

Episode transcript

ELISA LEWIS: Welcome to Allied, the podcast for everything you need to know about web and video accessibility. I’m your host, Elisa Lewis, and I sit down with an accessibility expert each month to learn about their work. If you like what you hear on Allied, please subscribe or consider leaving us a review. Allied is brought to you by 3Play Media, your video accessibility partner. Visit us at www.3playmedia.com to learn why thousands of customers trust us to make their video and media accessible for all.

[MUSIC PLAYING]

Today we’re joined by a captioning expert and all-around accessibility trailblazer, Jill Brooks. Jill worked at the National Captioning Institute for more than two decades, beginning her career in various production roles and working her way up to serve as the president and chief operating officer. The arc of her career cemented her expertise in stenocaptioning, voice writing, and live caption creation.

While the National Captioning Institute continues to fund and develop media access services for those who need it, Jill now applies her knowledge and experience to her current role as 3Play Media’s senior director of live operations. Jill has played an instrumental role in guiding the development of 3Play’s own live captioning efforts and will share some of her insights with us today. Be sure to stick around until the end for a special promo on 3Play’s live captioning. And if you’re interested in getting started, head on over to www.3playmedia.com/services/livecaptioning.

Thank you, Jill. We’re so glad to have you join us on Allied today for a discussion on the evolution of live captioning.

JILL BROOKS: Thank you. I’m happy to be here.

ELISA LEWIS: So before we dive in to talk about our topic today, which is, of course, like I said, the evolution of live captioning, I want to make sure that all of our audience, all of our listeners are kind of aligned and on the same page as to what live captioning is. So could you share with us a little bit of broad background knowledge on what live captioning is and how that may differ from traditional closed captioning?

JILL BROOKS: Sure. Live closed captioning is captioning that is done in real time. So that means that the captioner is getting access to the media at the same time as the viewer or the audience is receiving it. They’re not getting anything in advance. So if you’re thinking about, let’s say, a news broadcast, they are literally sitting there listening to the exact same broadcast as the rest of the audience and then providing the closed captioning in real time.

So it’s different from a recorded situation, where a captioner has the luxury of– if they didn’t understand something that was spoken, they can rewind a little bit and listen to it again or even ask somebody else to give it a listen if they’re still not sure what was said. In a real-time situation, the captioner has to make those judgment calls in a split second. So it’s a very different situation and can be challenging for many reasons.

ELISA LEWIS: And what are some of the most common use cases? I know you mentioned a few. But what are some of the other common use cases where we might see live captioning?

JILL BROOKS: Well, now we see live captioning in many, many situations. So I think that the first thing that would come to mind, of course, is usually broadcast television, where you could think of a live sporting event, or the news, or something that is different every time that there is a broadcast of it, as opposed to a prerecorded program.

But we’re also seeing live closed captioning used quite a bit to provide accessibility in corporate settings, in any type of work settings that need to have that accessibility provided as an event is happening. And of course, we’re seeing it on a lot of the streaming platforms and other social media platforms that are having a lot of live video content out there for consumers. And providing live captioning gives that access to a bigger audience.

ELISA LEWIS: Thank you. The live captioning industry has seen a massive evolution since its inception. Can you give some background on the beginning of live captioning technology and methodology?

JILL BROOKS: Sure. Yeah, it has changed quite a bit over the time that I have been involved in the industry, which has been about 25 years. So back in the early days of real-time captioning, it was really a team effort to caption even a single broadcast. Real-time captioning then could only be done by specially trained stenographers. Those are experienced court reporters who had undergone additional months and, in some cases, additional years of training in order to be able to do live captioning.

So for those of you who are not familiar with stenography, it’s a method of shorthand writing that is based on phonetics. And it is performed on a special machine with a keyboard that has 22 keys. And they’re stroked in combinations that represent different phonetic sounds. So you can think of that as being more comparable to playing a piano than to typing on a keyboard. But the point being that it’s a very highly specialized skill that takes years to learn.

And so, as I was saying before, stenocaptioners would listen to the audio of the live broadcast. They would write the captions with the punctuation on the steno machine. And that would go through the software that converted the steno to English and then was sent as caption data.

Those captioners were supported by a large broadcast engineering team and infrastructure. Nowadays, that has largely been eliminated, as technology has evolved such that the captioner can connect directly to a customer. But then there was a complex technical center that would pull down a satellite feed for broadcasts, made them available to the captioner in their studio.

The engineers would then establish and monitor connections through which the caption data was sent to the customer’s encoder, which was usually over a POTS line, which just means a plain old telephone service line. And then the customer would transmit that caption data as part of the broadcast on line 21 of the television signal. And remember, back then, it was only being used for broadcast television, none of these other use cases that we just discussed. So that was kind of the first step.

In order to get to see the captions, a viewer needed a caption decoder box, which is a contraption that looks sort of like a set-top cable box. And actually, the National Captioning Institute created these decoder boxes and made them available to consumers. But you could also go down to the local Sears store and buy one if you wanted to.

Then you’d bring it home and hook it up to your TV, which was really not all that simple if you already had a cable box, or a VCR, or something else already hooked up to your TV. And I actually used to have one of these decoder boxes. And I have to admit, I never actually got it to work properly. So they really were not that user-friendly.

But let’s say you were able to get yours hooked up and working properly. It was still not, there you go, you have captions on everything. When real-time captioning first started, only a handful of programs were actually captioned. So you would have to look through the TV guide or other information to actually find the programs that had captioning attached to them.

So then that decoder box was a little bit of a problem. And NCI commissioned ITT Semiconductors to try to shrink that decoder box down into a chip, into a decoder chip. And I don’t mean to oversimplify this because it really was a monumental achievement and contributed immeasurably to the advancement of accessibility.

But that chip then became mandated in, I believe, 1993 that all new television sets needed to be equipped with that chip. And then that made it possible to enable the closed captioning function directly from your television or from your remote control. So that really was a good advancement.

And following shortly after that, we had the Telecommunications Act of 1996, which gave the FCC the authority to mandate closed captioning for broadcast, which they did, requiring nearly all television programming, including live broadcasts, to be closed captions, with some very few exceptions that still exist today. So I know your question was about technology and methodology, and here I am talking about regulations. But there’s a reason.

The mandate drastically changed the supply and demand. So prior to these mandates, closed captioning was optional. And particularly for live events, it was usually used only for very high-visibility programming. But once the mandates came along, suddenly there was a lot more demand for captioning.

And some customers that were not providing it before essentially felt like they were being forced to provide the service. And so they were not, perhaps, as invested in using it for true accessibility. And they also were not willing to pay a lot for it or, in many cases, maybe even couldn’t pay a lot for it.

So there was a flood of work and dramatically falling rates for the work that was being performed. And meanwhile, the enrollment and graduation rates from steno court recording schools were falling, decreasing the amount of available trained or trainable stenocaptioners. So how do you solve this problem of a decreasing labor pool and increasing demand when it’s really not just a matter of offering higher salaries to obtain captioners because there simply were not enough captioners out there to meet the demand in the industry?

So at that same time, there was a brand new technology emerging, which was voice recognition technology. And it really was in its infancy. It was really almost comical in its performance. And of course, now we all have funny stories of the mistakes that our Alexa or Siri makes. But maybe even you have a story that happened today. So you can imagine the state of voice recognition technology 10 years, 12 years before Siri and Alexa even existed. So this really was an emerging technology.

So while I was at NCI, I investigated the possibility of using voice recognition technology for real-time captioning. So to again make a very long story short, what I found was, yes, we could use voice recognition technology to create real-time closed captioning and of a similar quality as steno. And basically, how we did it was using a user-dependent speech recognition software.

So speaker-dependent means that the software itself was trained to an individual voice and manner of speaking. And so that user, through repetition and reinforcement and correction of the output and the chosen words, would, over time, help to improve the recognition accuracy of the system. And that refinement of the voice profile in the software was kind of an ongoing and never-ending exercise.

But the software was only one component. The person doing the captioning, the voice writer, also had to be trained in how to do the voice writing methodology that I developed to create the real-time closed captioning using their voice.

So live voice writing is similar to stenocaptioning in that the captioner listens to the audio of the broadcast or the event at the exact same time that the audience hears it. But instead of keying in what they hear on a steno machine, the voice writer repeats every word and all of the punctuation into a microphone. It goes through the voice recognition software and, from there, is generated into caption data and goes on its journey to be rendered as text on your screen.

So the voice writer is listening and speaking, editing, punctuating, reading, making corrections, all simultaneously on this continuous loop. And all of this is happening from the time that the word is spoken on the screen to the time that you see the word appear on the screen as caption a couple of seconds later.

So the introduction of voice writing was really a paradigm shift in the industry. And most providers now are using this methodology. But then, at that time, it had only been performed by stenocaptioners for 20 years or so. So using voice writing helped solve that supply and demand problem by creating a much larger potential labor pool than just the trained stenographers. People with various backgrounds could be trained in the discipline of voice writing.

But the speaker-dependent voice writing also had its problems. As I mentioned, training the software to an individual person’s voice was time-consuming. And even then, the word recognition accuracy was not 100%. So voice writers were also trained on various special techniques and methods to maximize the accuracy within the software system itself.

So the result of that was that while we were able to bring many, many new live captioners into the market, allowing supply to increase along with the demand, over time, voice writing was being performed by a group of people who were highly trained in a unique skill and using very specialized software, which was really the problem that we were trying to address when it was only stenocaptioning. So the solution of voice writing was just not scalable enough, and particularly for the environment that we’re in now, with the countless hours of live media available in so many ways and on so many platforms.

So really, that’s what brought me to 3Play, where we are trying to solve this problem yet again by simplifying the live captioning process and methodology to achieve even greater levels of scalability.

ELISA LEWIS: Your explanation is really helpful. And you really captured just how far live captioning has come. I’m thinking in particular how we now have platforms that even provide live automatic captions. And at least from the user end, it’s really available in just a few clicks of a button.

Automatically generated captions are notorious for inaccuracy, whereas human-generated captions are consistently and substantially more accurate. Can you share how the accuracy of live captioning is measured and what factors typically have an impact on the accuracy measurements?

JILL BROOKS: Yes. I think it’s very appropriate to bring that up. I’m sure many people were thinking as I’ve been talking about voice writing or live captioning in general about the automatically generated captions. So first, let me just explain the difference between auto-generated and human-generated captions.

Automated Speech Recognition, or ASR, certainly has a role to play in captioning. But it is not the solution. It is not the black box that everybody has been searching for, the holy grail of live captioning. Maybe some people have seen ASR used for live captioning, perhaps like on YouTube.

And if you’ve ever seen those, you can see they’re sometimes inaccurate to the point of being unusable. But I feel, personally, that some captions are always better than no captions. So ASR is a good solution if and when human-produced captions are not available. And even in some other situations, they may be adequate as well.

But in terms of providing accessibility, of providing a truly equal experience, it is really not robust enough at this point to achieve that. And some notable differences between ASR and human-generated captioning is, on a positive note, ASR will caption a word for every utterance, which human captioners cannot do, particularly with fast-paced dialogue. Instead, the human captioners will omit certain unimportant words in order to preserve or present the most comprehensible captions possible.

On the flip side of that, ASR will caption a word for every noise, whether it was a spoken word or not. So that can often introduce just very random and confusing words into the captions. And a human captioner will convey important nonspeech oral information, such as laughter, applause, whistle, buzzer, those types of noises, appropriately to increase the comprehension of what the viewer is seeing.

Some ASR systems will insert random punctuations. Others provide no punctuation at all, leading to one never-ending run-on sentence. Human captioners will insert all the proper punctuation, which, again, will enhance the user’s understanding of what is being said.

Similarly, with the exception of some meeting platforms, ASR cannot detect who is speaking. And human captioners will indicate every time there is a change of speaker and also try to identify that speaker by name, if possible.

But the most noticeable difference with ASR is that the recognition accuracy plummets in situations where there are multiple speakers talking over one another, when there is a lot of background noise, when there’s a speaker with a heavy accent. Any of these types of situations really cause ASR to struggle, whereas the human captioner can use context to discern what’s being said when there’s a speaker with a thick accent, for example. Or they can focus on the dominant speaker when many people are talking at once or when there’s a lot of background noise.

So these situations are also challenging to the human captioner. But they are able to apply their critical thinking to focus on the important communication and making sure that that is presented in a comprehensible way in the captions.

And then maybe the last thing is the latency, or the time between when the word is spoken and when the word appears on the screen, is generally lower for ASR, which is a good thing. But now that I’ve given some explanation of everything that is going on with the human captioner and everything that happens before those words make it to the screen, I think maybe that latency makes a little bit more sense.

But in a lot of ways, in terms of accuracy, comparing ASR captions to human captions is really like comparing apples to oranges. The accuracy calculations that are generally applied to these two different forms of captioning are really different. For ASR, the accuracy is strictly a word error rate, which really is, did it write the word that was spoken or did it not write the word that was spoken? So I really don’t need to go into that in depth. But that is really the only measurement that is being considered when you hear about what an accuracy rate is for an ASR.

Human-generated live captioning accuracy is a little different. I think about accuracy in three different ways– the correct versus incorrect words, like is applied with ASR, but also the comprehensibility and the completeness. So the correct versus incorrect is a little different from just counting a word error rate. But of the words captioned, how many are correct? And how many are incorrect, being maybe the wrong word or the wrong spelling or something like that?

From there, it’s a simple equation to figure out what that percentage is. And that is really what people are talking about when you hear the 98%, 99% accuracy. And that really is the generally accepted industry standard. And it’s a useful metric, but it doesn’t really tell the whole story about the quality of the captions.

Sometimes what is not captioned is also an error. So remember, I was saying that since human captioners really cannot capture every single word that is spoken, it’s incumbent upon them to ensure that they do not omit any important words or important information. For example, if a caption reads, the flash flood warning has been lifted, but the words that were spoken were actually that the flash flood warning has not been lifted, the omission of that one small word is really a significant error.

So omissions need to be considered in the overall quality and accuracy assessment. But this can really pose a problem in evaluating live captions because, by their nature, live events are live. They’re not always preserved or recorded for reference to go back and evaluate the captioning. And even when it is or that information is available, it’s really not cost effective or good use of time, necessarily, to review every live caption file in that way. So that type of situation is really not accounted for when simply looking at the correct versus incorrect words to calculate the accuracy.

And the punctuation also should be considered in the accuracy measurements. Because just like with the omitted words, some punctuation can be omitted without consequence. But other times, it really needs to be there to convey what is really being said.

I think maybe a lot of people have heard the example of “here, comma, I am God” as compared to “here I am, comma, God.” So it’s the same words but two completely different meanings just depending on the placement of that little punctuation mark. Or if a speaker identification is omitted or incorrect, it can lead to considerable confusion and really ought to be reflected in the accuracy evaluation or the overall quality evaluation. But it’s really not possible to do when there’s no reference video available.

And also, the comprehensibility, which I’ve thrown out there a lot of times– this is something that we do measure in our live captions here at 3Play. And really, that is, are the captions an accurate reflection of what the speaker was conveying? So it’s a little bit more of a subjective evaluation but, I think, really the most important part of evaluating the overall quality of live captions.

And then lastly, we have the completeness, which is also a very straightforward calculation of– which is measuring how many words were captioned as compared to how many words were spoken in a live event. So for example, I could probably get 100% accuracy if I were only captioning 10% of the words that were spoken. So of course, that wouldn’t be very acceptable to the audience. So we also want to track as part of the overall quality evaluation how much of the content was conveyed in the caption.

So again, that’s a very simple equation if you have a reference point, such as a verbatim transcript or even an ASR transcript. But if those aren’t available, we can also use a ballpark figure that is an average expected word count for a similar event of similar length. So sorry. That might have been a little long-winded answer to how to calculate accuracy. But I just wanted to point out that the overall quality of live captioning really cannot just be summed up by saying it’s 98% or 99% accurate.

ELISA LEWIS: Yeah, thank you so much. I think you’re right that a lot of the times in the industry, we hear a number or a percentage. But hearing you go through the different components of measuring accuracy is really helpful to get a better holistic picture of all the different pieces and different factors that can play into it. So thanks for explaining that so thoroughly.

As we enter the new year, I’m curious what you would predict for the future of live captioning in 2022 and beyond.

JILL BROOKS: Well, I certainly see a lot more demand in the future. Along with that, I see content producers expecting easier solutions and integrations for all platforms. I see them wanting an easier process end to end. I expect users to go beyond just expecting captioning to requiring it, demanding it, or turning away from media that doesn’t include captioning.

ELISA LEWIS: 3Play Media has developed a disruptive and innovative solution for creating live captions by combining the best of both worlds, both Automatic Speech Recognition, or ASR, and live human captioners. How do you see this solution fitting into the larger live captioning landscape?

JILL BROOKS: That’s absolutely my favorite thing about 3Play’s live offering. I think that the industry and the customers out there have really been looking for this type of solution. A common complaint that we hear from customers is that their capturing provider doesn’t have enough capacity.

And coming from the other side of it, from the provider side of it, yes. It is true. It is difficult to have captioners for every single demand. And that scheduling can make it sometimes difficult to accommodate changes. And we also know that interruptions in the caption delivery can happen for many, many reasons anywhere in that delivery chain.

So to solve that problem, 3Play is combining, really, industry-leading ASR as a failover for every human-captioned event, which means that customers will always have peace of mind that they will be getting live captioning, even if there is an interruption with the human-generated captions, because that ASR is always running in the background. So that’s really a game changer.

ELISA LEWIS: Great. Thank you. Before we wrap up for today, do you have any final pieces of advice that you’d like to share with our listeners who would like to make their live video content more accessible?

JILL BROOKS: My advice is do it. Make it an essential part of your production planning rather than an afterthought. And consider your audience and what their needs might be. Consider the possibilities for expanding your audience by anticipating what their needs might be. And the reps at 3Play can help find the appropriate service for your viewers because accessibility comes in many forms. So there are many different solutions depending on what the use cases are and what the viewers are looking for.

ELISA LEWIS: Yeah. I think you really hit it spot on with saying that you have to plan in advance and just do it. Something we are constantly preaching at 3Play is that it needs to be– accessibility needs to be baked in from the beginning. It’s much harder, much more expensive, and just much more complicated to try to tack it on as an afterthought. So I totally agree. And I think that’s great advice.

How can our listeners connect with you online to stay in touch and follow what you’re doing in this space?

JILL BROOKS: Certainly. Connect with me on LinkedIn and also by following 3Play Media on social media on all the platforms. Or also, you can email me directly at jillbrooks@3playmedia.com.

ELISA LEWIS: Thank you so much, Jill. I really appreciate you taking time out of your day and sharing your expertise with all of our audience and listeners. And I hope you have a great rest of the day.

JILL BROOKS: Thank you. Thank you for having me.

[MUSIC PLAYING]

ELISA LEWIS: Thanks for listening to Allied. We’re excited to offer a special promotion for our live captioning service. When you book your first live professional captioning event, you’ll receive 15 minutes of live auto captioning free. Visit 3playmedia.com/services/livecaptioning to learn more.

If you enjoyed this episode and you’d like to support the podcast, please share it with others, post about it on social media, subscribe, or leave a review. To catch all the latest on accessibility, visit www.3playmedia.com/alliedpodcast. Thanks again, and I’ll see you next time.


Contact Us

 

Thank you for listening to Allied! For show information and updates, visit our website. To get in touch, email us at Allied@3playmedia.com.

Follow us on social media! We can be found on Facebook and Twitter.