3Play Media - Intelligent Transcript
  • Home
  • Blog
  • Login
  • Get Started
  • Contact Us
  • ph (617) 764-5189
  • Applications
  • Services
  • Interactive
  • Resources
  • COMPANY
  • COMPANY
  • About
  • Leadership
  • Partners
  • Testimonials
  • News
  • Events
  • Webinars
  • Blog
  • Jobs
  • Contact Us

Google brings buzz to captions like never before

November 23rd, 2009 by Josh Miller

Just the other day Google announced its intentions to automatically generate closed caption files on a select group of YouTube files.  The story quickly made it to the NY Times and all over the blogosphere, as it rightfully should.  The idea is to eventually rollout the capability across YouTube for all users to test.  With 20 hours of video being uploaded to YouTube every minute, that’s a lot of text being created!

At its core, this is a brilliant move by Google to improve YouTube search (and advertising) capabilities.  But Google’s announcement, largely because it’s Google, also puts the accessibility issue in front of the entire country for a change.  Captions are mandated for much of television, but they are only beginning to get some attention on the internet, well until now.  Representative Ed Markey, the same Congressman who made the original push for closed captioning on television, introduced H.R. 3101, the Twenty-first Century Communications and Video Accessibility Act of 2009, during this session of Congress, and it currently has 19 co-sponsors.  This is actually the second attempt at getting a bill passed that would mandate an improved user experience for the hearing impaired.

Thanks to one of the most talked about technology companies of our time, closed captioning is getting attention all over the internet.  Anyone who works with online video is now paying attention to closed captioning.  Not only are we empowering the hearing impaired, but in a virtual world that seems to be driven by search and discovery, video can now be made more “accessible” than ever.

So for a business that is centered on providing high quality, time synchronized transcripts, what does this announcement mean?

Well, it could mean a lot of things.  First, let’s look into this new Google service.  Google will deploy the same technology that powers Google Voice across YouTube to enable the creation of text.  This means they will be using automatic speech recognition (ASR) to create the caption files.  Using ASR on audio and video is not a new concept, but it’s new at this scale.  We’ve commented on our experiences with ASR capbilities in the past.  In fact, we’ve even played with the very engine that will be front and center for the YouTube initiative.

We’ve spoken with many people who have tested ASR solutions.  Usually, if they are talking to us, they weren’t satisfied!  The truth of the matter is that ASR will be good enough for some people, and it won’t be good enough for others.  80% accuracy (at its best and in studio quality recording conditions) leaves a lot to be desired.  In fact, Google even admits that results can be somewhat amusing when they’re off.  On the search front, the most critical keywords tend to be the most unique and, therefore, least common to be recognized accurately.  Google’s announcement does not change that, it just makes an ASR solution easier to use and free to consume.  In many cases, Google has likely provided a medium for people who may never have put captions on their video with the ability to do so with very little effort.  Google has also made the search benefits of captions glaringly obvious.

Ultimately, the organizations that require (or believe in) high quality output for captions and search will be willing to pay for cleaned up text.  There are significant benefits to the high-quality approach, whether it be accurate search results or truly legible transcripts.  Branding is also a critical issue for many organizations who add a text component to their video offering.

We at 3Play Media will continue building high quality solutions that make multimedia more accessible for everyone.  More people than ever are aware of the benefits of captions and time-synchronized transcripts now.  We have some new product launches on the way that will build off these very benefits, and we can’t wait to show the world how their online video experience can be changed forever.

Tags: accessibility, Accessibility Act, accuracy, Captioning, Ed Markey, Google, Online Video, speech recognition, video search, YouTube
Posted in 3Play General, Josh, Online Video, Transcription, YouTube | No Comments »

Adventures in Speech Recognition, II

August 19th, 2009 by CJ Johnson

Drove to Manchester, NH and back this morning, borrowing a family member’s car for the trip.  On the way up, we were listening to a well-crafted 80s mix CD, when I switched lanes, accidentally hit a button on the steering wheel’s control panel, and the music went silent!  Frantically, I pulled every bell and whistle on the steering wheel to get WHAM! pumping back through the speakers, but to no avail.  Then, after a couple of seconds, the voice from the GPS system says…. “Please repeat.”

I looked down, and realized I must have hit the button with the icon of the person speaking.  Excellent!  Something fun to play with on the drive (there’s really not much fun to look at on Route 3 unless you’re REALLY into highway construction).

Of course we started with the essential Captian Picard:

CJ: “Maximum warp!”

Car: (pause).

CJ: “Bah!”

Car:  ”XM Channel Ten.” (CD stops and the Radio tunes to XM Channel Ten).

Interesting response.  Let’s try something else. “Billie Jean” had been playing earlier, so naturally….

CJ: “A heee heeee.”

Car: “CD, Random Track.” (The CD is back on… now we’re listening to some delightful Devo).

That is actually pretty cool.  ”Hee hee” sounds a bit like “Cee Dee”.

Then we wrapped up with an homage to Ron Burgundy.

CJ: You ate an entire wheel of cheese?

Car: “Air Conditioner.”

I’m not even sure what happened here.  I guess it was acknowledging that there was, in fact, an air conditioner on board.

Next was to see if I could actually get it to do something I wanted.

CJ: “XM.  Nine!”

Car: “Please repeat.”

CJ: “XM. Eight!”

Car. “Say again.”

CJ: “XM. Seven!”

Car: “XM Channel Eight.” (Radio is on again, tunes to XM Channel Eight.)

Fun stuff.  So far…. yet so far to go….

Tags: accuracy, speech recognition
Posted in 3Play General, CJ | No Comments »

Accuracy Still a Problem for Google's Ears

June 30th, 2009 by Josh Miller

As we’ve discussed, speech recognition can be a very powerful tool.  But it can’t quite complete the transcription process all on its own.  There is still a gap between what its capabilities are and what would be a high quality, legible transcript.  Many have tried to conquer this automated linguistic feat including Google.

On Friday, David Gallagher of the NY Times started a discussion on Google’s new Google Voice app that allows users to have their voicemail transcribed into text automatically.  The Google app uses an automatic speech recognizer to decipher the spoken content into a friendly email format.  While some might expect Google to be able to put the speech recognition puzzle together, even Google Voice gives us some entertaining reading material.  Yesterday, Mr. Gallagher posted the results of his Google Voice testing.

As much as the speech and AI experts try to model a human’s voice, only a human ear can pick up all the tiny nuances of speech.  From dialect to tone to context, so much can go wrong so fast with a machine.  There is a lot speech recognition can offer, but there has to be a way to allow a human to be part of the process to ensure quality.  And if search or ad delivery is part of the equation (as we might guess with our Google friends), you can imagine what happens to those results when you start with a misguided transcript.

Tags: accuracy, speech recognition, Transcription
Posted in 3Play General, Transcription | No Comments »

On Accuracy, Part 2: What does accurate even mean?

June 11th, 2009 by Josh Miller

A couple weeks ago, CJ laid out how detrimental just a few percentage points of inaccuracy can be to the integrity of a transcript.  Even at 99% accuracy, there are residual effects on the overall results that may jeopardize the true quality.  Pretty startling really.  But the question remains: what does 99% accuracy even mean?

Speech and writing are different means of communication.  For example, writing has the advantage of being independent of time.  It can be revised several times before it is made public in what is essentially a final draft format.  Speech on the other hand is a one swing event.  First draft and that’s it – what’s said is said.  Naturally, people have developed mechanisms to give their brains a little more time to polish their speech.  Fillers, including “um”, “uh”, and “you know”, or elongated sounds are common examples of these time-buying speech patterns.  In addition, we don’t speak with proper sentence structure.  Filler words allow us to talk in run-on sentences without even hesitating.  As listeners, we’ve even trained ourselves to filter out a lot of these sounds to capture a seemingly clean delivery from the speaker.  Imagine if the next book you read was filled with false sentence starts and “you knows” – you’d go insane.

So should a written transcript capture every single utterance or should it be edited for a reading audience?  Should accuracy be measured on every sound that comes out of a speaker’s mouth or should it be based on a cleaned up representation that captures all the intended content?  All of a sudden, the objective measure that so many people want to use can be extremely subjective.  For reference, many transcription firms guarantee an accuracy rate of 98%.

We call capturing every single utterance a “verbatim” transcript.  As such, we would capture every single “um”, “uh”, stutter, interrupting speaker change of “uh-huh”, and so on.  Very frustrating to read.  But in a way, easier to measure.  You either have the sound written down or you don’t.

Most people/customers prefer what we call a “clean read.”  This is the case where we cut out the stutters and unneccessary filler words.  The most important part here is to preserve the meaning of every single sentence.  But how can one measure accuracy in objective terms for a process that calls for subjective editing?

Here’s a brief example comparing the two methods:

Transcription Accuracy Example Image

As you can see, there is a significant difference between the two methods, and this is just a brief excerpt.  Over the course of a one hour interview, the gap in word counts will dramatically widen.

We tend to avoid throwing guaranteed accuracy rates around because we realize just how difficult it is to measure.  What if a transcript really is 98% accurate, but the 2% of mistakes happen to be critical words within sentences, resulting in lost meaning?

We firmly believe that it is our responsibility to provide an output with the critical content in tact.  We’d rather miss a filler word than mis-type a noun.  Stated rate or not, the work we do is high quality and consistent.  One of my favorite quotes from one of our customers is, “this is better than what I get back from my copy editor!”

MIT gets a reputation for throwing numbers at every problem imaginable.  In fact, every course, room, building, and student are identified by number.  But in reality, MIT excels at teaching how to use quantitative models effectively as well as when these models break down in the real world.  Transcription accuracy is one of those cases where the numbers cannot tell the entire story.

Besides, if you’re worried that we’re not comfortable using numbers, just remember that we put one in our name.

Tags: accuracy, clean read, MIT, quantitative modeling, speech, Transcription, verbatim, writing
Posted in 3Play General, Josh, Transcription | No Comments »

Adventures in Speech Recognition

May 22nd, 2009 by CJ Johnson

A few weeks ago, I wrote about the traditional metrics for accuracy in transcription.  Every now and then, we’ll run across an egregious mistake from an automatic speech recognizer (ASR) that’s pretty funny.  

Check out this one from today.  In an interview with a famous biologist, what was supposed to read “RNA catalysis, a wonderful discovery” came out as “naked palaces a wonderful discovered”.

Now I won’t argue the relative merits of RNA catalysis vs. naked palaces – I’m sure they’re both great in their own unique ways – but the first one was a bit more on the context we were looking for in that interview. 

I have come to appreciate the strange rhymes that come out of ASR engines, having spent a number of years now sifting through their outputs.  Doing so, you get a feel for how a group of sounds isolated from the surrounding words and context can be mistaken for any number of alternatives.  

Even more so, I have come to appreciate the ability of the human mind to comprehend sounds into words and meanings – to discern misspoken words, accents and stutters into the intended meaning. We decipher context every day, and not just with 50-cent words that challenge our vocabulary.  People say “tuh” instead of “to”.  People say “mighta” instead of “might have”.  The words “marriage” and “mirage” sound remarkably  similar spoken alone, yet our  experience and power of deductive reasoning allow us to correctly assess the intended meaning of the speech without any real deep thought.  

The technology behind machine recognition, adaptation, and reasoning is fascinating.  It is amazing how far we have come in just a few decades of research.  It is just as amazing to think how far there is to go, and to dream what innovations may come to fill the remaining voids to true artificial intelligence.

CEJ

Tags: accuracy, speech recognition, Transcription
Posted in 3Play General, CJ | No Comments »

On Accuracy, Part I

April 23rd, 2009 by CJ Johnson

Spending 80+ hours a week on a startup business, it is natural that social conversations frequently wander to work.  When the topic comes up, it leads to quite a few questions.

Typically, it starts something like:

They: “So what do you do for work?”

Me: “I started a company while I was in grad school with a few buddies of mine.”

They: “Oh cool! What do you guys do?”

Me: “We specialize in efficient transcription and closed captioning.”

They: “Oh…. Well don’t they have speech recognizers to do that?”

Me: “Well they aren’t really accurate enough.  So we’ve made a bunch of tools that work with recognizers to scrub it up really quickly and accurately.”

They: “Huh… Well I heard they’re 80%, 90% accurate these days.”

This is where it tends to get tricky, as it turns out accuracy is pretty tough to define.

For speech-to-text, accuracy can be measured by looking at the text output, listening to the audio, and circling the words that match; x-ing out the ones that don’t.  By this measure, 80%, 90% accuracy is technically amazing. 

However, if you dig deeper, you see where speech recognition unravels as a practical means of transcription; and that human interaction is still necessary to complete the work in the absence of the “magic button”, a generalized, 100% accurate recognizer.  

Let’s reframe “accuracy”.  As a firm that specializes in software for transcription & captioning of multimedia, our goal is to create works that stand alone as accurate reflections of spoken content.  Sentence-to-sentence, this means a group of words that accurately reflect the meaning of the spoken content.  Word-to-word, this means spelling the words within those sentences correctly.

Spelling is easy to check, and in a sentence, you don’t need to be an English major to recognize the fact that one word, mistaken for another similar sounding one, can significantly alter the meaning of a sentence (won’t vs. want; feel vs. field; etc… more on this in my next post).

I’ve created a chart that outlines the propagated implications of accuracy rates from speech recognizers, assuming a range of accuracies, and 8 & 10 word sentences.  

For example, 67% accuracy means 1 out of every 3 words is incorrect.  For an 8-word sentence, the likelihood that the recognizer got all 8 words correct is 67%8 ≅ 4%. Similarly for a 10-word sentence, the likelihood of the recognizer getting all 10 words in a row correct is 67%10 ≅ 2%.

Word-to-word accuracy 1 of x words incorrect 8-word sentence accuracy 10-word sentence accuracy
50% 1 of 2 0% 0%
67% 1 of 3 4% 2%
75% 1 of 4 10% 6%
85% 1 of 7 27% 20%
90% 1 of 10 43% 35%
95% 1 of 20 66% 60%
99% 1 of 100 92% 90%

This tells me that with 90% accuracy, I will get less than HALF of my 8-word sentences back correctly transcribed, and only 35% accuracy on my 10-word sentences.  With even slight dips in accuracy or increases in sentence length, these rates plummet, as the relation is exponentially scaled. And this doesn’t even begin to take into account other markup such as phrasing and punctuation!!

In generalized practice, we tend to receive files with significant background noise, many speakers, accents, etc.  In these cases, we might see the recognizer hit 67%. If we returned these transcripts to our customers fresh of the ASR engine, they would only see 1 out of 25 8-word sentences correct!!!  Something tells me we wouldn’t retain them for very long….

So it turns out that the 10%, 20% gap is much bigger than it would appear on the surface of the standard “accuracy” metrics; and to create useful stand alone works from audio, that gap between the state of the art in speech recognition and the end goal needs to be filled accurately and cost-effectively.

That’s where we come in :)

-CEJ

Tags: accuracy, Captioning, speech recognition, Transcription
Posted in 3Play General, CJ | No Comments »

  • Email Subscription

    Subscribe to Transcribing Ourselves: The 3Play Media Blog by Email
  • Tags

    3Play accessibility Accessibility Act accuracy brightcove Captioning captions closed captions Conferences Ed Markey Education Entrepreneurship events FCC features Google innovation integration integrations interactive transcript interactive transcript plugins Kaltura legislation manual transcription media archive MIT Online Video partners product SEO speech speech recognition Startup subtitles subtitling support television Transcription translation University of Wisconsin video accessibility video player video search webinar YouTube

Home

Applications

  • Online Video
  • Education
  • Market Research
  • Post Production
  • Conferencing

Services

  • Transcription & Captioning
  • Translation
  • Account System
  • Process

Interactive

  • Interactive Transcript
  • Captions Plugin
  • Archive Search
  • Video Clipping
  • Clipmaker

Resources

  • FAQ
  • Pricing
  • Video Tutorials
  • Quick Start Guide
  • Caption Format Converter

Company

  • About
  • Leadership
  • Partners
  • Testimonials
  • News
  • Events
  • Webinars
  • Blog
  • Jobs
  • Contact Us

©2011 3Play Media | Terms | Privacy | info@3playmedia.com

Facebook 3Play Media on Facebook

Twitter 3Play Media on Twitter