3Play Media - Intelligent Transcript
  • Home
  • Blog
  • Login
  • Get Started
  • Contact Us
  • ph (617) 764-5189
  • Applications
  • Services
  • Interactive
  • Resources
  • COMPANY
  • COMPANY
  • About
  • Leadership
  • Partners
  • Testimonials
  • News
  • Events
  • Webinars
  • Blog
  • Jobs
  • Contact Us

Integrating Raytheon BBN Technologies Automatic Speech Recognition Technology into Our Core Process

August 23rd, 2011 by Roger Zimmerman

We are proud to announce that our transcription process now uses high quality automatic transcriptions provided by Raytheon BBN Technologies using BBN’s Speech-To-Text engine. This milestone is the culmination of an exhaustive search for the highest quality provider of automatic transcriptions produced using the best available automatic speech recognition (ASR) technology to incorporate into our manual transcription.


The relationship with BBN provides a platform for continual improvement of the draft transcripts, which are subsequently cleaned up by our workforce of more than 250 professional transcriptionists. A crucial benefit of BBN’s ASR engine is its very rich output capabilities. Instead of just a single “best path” draft transcript, BBN automatic transcription provides lattices, search parameters, and information from multiple passes. With this information, we are able to improve the formatting of drafts and to more accurately assess the editing difficulty for a given media. These improved drafts have already made a significant impact, allowing our editors to focus on the finer details of transcription, such as proper punctuation, numerals, formulaic expressions, and correct spelling of names, places, and neologisms.

BBN’s ASR engine is widely recognized as the leading technology for batch speech-to-text processing in the most demanding applications, including some mission critical government deployments. In addition, BBN’s algorithms consistently rate at or near the top in the frequent speech technology evaluations conducted by the National Institutes of Standards and Technology. BBN continues to invest heavily in R&D to improve its ASR technology. Importantly, this relationship includes an ongoing commitment to continually adapt and improve the acoustic and language models, increase the ASR vocabulary, and pursue other improvements to provide consistently high quality transcriptions.

Our relationship with BBN exemplifies our dedication to continually improving our products and services. We expect to make more such announcements in the coming months, as our team is working on many different aspects of the application, with an eye toward improving the quality of our time-synchronized transcripts and closed captions.


Tags: bbn, innovation, partners, raytheon, software, speech recognition, Transcription
Posted in Uncategorized | No Comments »

3Play Media Welcomes Roger Zimmerman to the Team

March 2nd, 2011 by Tole Khesin

Roger Zimmerman3Play Media is excited to announce that Roger Zimmerman has joined 3Play Media as the VP of Research & Development. Roger has an extensive background in speech and language processing, as well as design and implementation of large, scalable software systems.

Roger comes to us from Nuance, where he held the position of Senior R&D Architect. Prior to Nuance, Roger was the first employee at eScription, where he developed advanced speech recognition and natural language processing technology for medical transcription. In addition, Roger pioneered the development of a scalable farm of speech recognition servers that transcribe thousands of hours of audio per day. With Roger’s key contributions, eScription became enormously successful and was acquired by Nuance in 2008. Prior to eScription, Roger held research and technical leadership positions at Philips Speech Processing and Voice Processing Corporation.

Roger is a graduate of Brown University and has authored numerous patents and publications on speech and language processing. He is also an accomplished brewer, having just completed a batch he calls “3Play Medi’ale”, which has a noted hoppy bitterness.

Tags: employees, language processing, leadership, management team, natural language processing, roger zimmerman, speech, speech processing, speech recognition
Posted in Uncategorized | No Comments »

Google brings buzz to captions like never before

November 23rd, 2009 by Josh Miller

Just the other day Google announced its intentions to automatically generate closed caption files on a select group of YouTube files.  The story quickly made it to the NY Times and all over the blogosphere, as it rightfully should.  The idea is to eventually rollout the capability across YouTube for all users to test.  With 20 hours of video being uploaded to YouTube every minute, that’s a lot of text being created!

At its core, this is a brilliant move by Google to improve YouTube search (and advertising) capabilities.  But Google’s announcement, largely because it’s Google, also puts the accessibility issue in front of the entire country for a change.  Captions are mandated for much of television, but they are only beginning to get some attention on the internet, well until now.  Representative Ed Markey, the same Congressman who made the original push for closed captioning on television, introduced H.R. 3101, the Twenty-first Century Communications and Video Accessibility Act of 2009, during this session of Congress, and it currently has 19 co-sponsors.  This is actually the second attempt at getting a bill passed that would mandate an improved user experience for the hearing impaired.

Thanks to one of the most talked about technology companies of our time, closed captioning is getting attention all over the internet.  Anyone who works with online video is now paying attention to closed captioning.  Not only are we empowering the hearing impaired, but in a virtual world that seems to be driven by search and discovery, video can now be made more “accessible” than ever.

So for a business that is centered on providing high quality, time synchronized transcripts, what does this announcement mean?

Well, it could mean a lot of things.  First, let’s look into this new Google service.  Google will deploy the same technology that powers Google Voice across YouTube to enable the creation of text.  This means they will be using automatic speech recognition (ASR) to create the caption files.  Using ASR on audio and video is not a new concept, but it’s new at this scale.  We’ve commented on our experiences with ASR capbilities in the past.  In fact, we’ve even played with the very engine that will be front and center for the YouTube initiative.

We’ve spoken with many people who have tested ASR solutions.  Usually, if they are talking to us, they weren’t satisfied!  The truth of the matter is that ASR will be good enough for some people, and it won’t be good enough for others.  80% accuracy (at its best and in studio quality recording conditions) leaves a lot to be desired.  In fact, Google even admits that results can be somewhat amusing when they’re off.  On the search front, the most critical keywords tend to be the most unique and, therefore, least common to be recognized accurately.  Google’s announcement does not change that, it just makes an ASR solution easier to use and free to consume.  In many cases, Google has likely provided a medium for people who may never have put captions on their video with the ability to do so with very little effort.  Google has also made the search benefits of captions glaringly obvious.

Ultimately, the organizations that require (or believe in) high quality output for captions and search will be willing to pay for cleaned up text.  There are significant benefits to the high-quality approach, whether it be accurate search results or truly legible transcripts.  Branding is also a critical issue for many organizations who add a text component to their video offering.

We at 3Play Media will continue building high quality solutions that make multimedia more accessible for everyone.  More people than ever are aware of the benefits of captions and time-synchronized transcripts now.  We have some new product launches on the way that will build off these very benefits, and we can’t wait to show the world how their online video experience can be changed forever.

Tags: accessibility, Accessibility Act, accuracy, Captioning, Ed Markey, Google, Online Video, speech recognition, video search, YouTube
Posted in 3Play General, Josh, Online Video, Transcription, YouTube | No Comments »

Adventures in Speech Recognition, II

August 19th, 2009 by CJ Johnson

Drove to Manchester, NH and back this morning, borrowing a family member’s car for the trip.  On the way up, we were listening to a well-crafted 80s mix CD, when I switched lanes, accidentally hit a button on the steering wheel’s control panel, and the music went silent!  Frantically, I pulled every bell and whistle on the steering wheel to get WHAM! pumping back through the speakers, but to no avail.  Then, after a couple of seconds, the voice from the GPS system says…. “Please repeat.”

I looked down, and realized I must have hit the button with the icon of the person speaking.  Excellent!  Something fun to play with on the drive (there’s really not much fun to look at on Route 3 unless you’re REALLY into highway construction).

Of course we started with the essential Captian Picard:

CJ: “Maximum warp!”

Car: (pause).

CJ: “Bah!”

Car:  ”XM Channel Ten.” (CD stops and the Radio tunes to XM Channel Ten).

Interesting response.  Let’s try something else. “Billie Jean” had been playing earlier, so naturally….

CJ: “A heee heeee.”

Car: “CD, Random Track.” (The CD is back on… now we’re listening to some delightful Devo).

That is actually pretty cool.  ”Hee hee” sounds a bit like “Cee Dee”.

Then we wrapped up with an homage to Ron Burgundy.

CJ: You ate an entire wheel of cheese?

Car: “Air Conditioner.”

I’m not even sure what happened here.  I guess it was acknowledging that there was, in fact, an air conditioner on board.

Next was to see if I could actually get it to do something I wanted.

CJ: “XM.  Nine!”

Car: “Please repeat.”

CJ: “XM. Eight!”

Car. “Say again.”

CJ: “XM. Seven!”

Car: “XM Channel Eight.” (Radio is on again, tunes to XM Channel Eight.)

Fun stuff.  So far…. yet so far to go….

Tags: accuracy, speech recognition
Posted in 3Play General, CJ | No Comments »

Accuracy Still a Problem for Google's Ears

June 30th, 2009 by Josh Miller

As we’ve discussed, speech recognition can be a very powerful tool.  But it can’t quite complete the transcription process all on its own.  There is still a gap between what its capabilities are and what would be a high quality, legible transcript.  Many have tried to conquer this automated linguistic feat including Google.

On Friday, David Gallagher of the NY Times started a discussion on Google’s new Google Voice app that allows users to have their voicemail transcribed into text automatically.  The Google app uses an automatic speech recognizer to decipher the spoken content into a friendly email format.  While some might expect Google to be able to put the speech recognition puzzle together, even Google Voice gives us some entertaining reading material.  Yesterday, Mr. Gallagher posted the results of his Google Voice testing.

As much as the speech and AI experts try to model a human’s voice, only a human ear can pick up all the tiny nuances of speech.  From dialect to tone to context, so much can go wrong so fast with a machine.  There is a lot speech recognition can offer, but there has to be a way to allow a human to be part of the process to ensure quality.  And if search or ad delivery is part of the equation (as we might guess with our Google friends), you can imagine what happens to those results when you start with a misguided transcript.

Tags: accuracy, speech recognition, Transcription
Posted in 3Play General, Transcription | No Comments »

Adventures in Speech Recognition

May 22nd, 2009 by CJ Johnson

A few weeks ago, I wrote about the traditional metrics for accuracy in transcription.  Every now and then, we’ll run across an egregious mistake from an automatic speech recognizer (ASR) that’s pretty funny.  

Check out this one from today.  In an interview with a famous biologist, what was supposed to read “RNA catalysis, a wonderful discovery” came out as “naked palaces a wonderful discovered”.

Now I won’t argue the relative merits of RNA catalysis vs. naked palaces – I’m sure they’re both great in their own unique ways – but the first one was a bit more on the context we were looking for in that interview. 

I have come to appreciate the strange rhymes that come out of ASR engines, having spent a number of years now sifting through their outputs.  Doing so, you get a feel for how a group of sounds isolated from the surrounding words and context can be mistaken for any number of alternatives.  

Even more so, I have come to appreciate the ability of the human mind to comprehend sounds into words and meanings – to discern misspoken words, accents and stutters into the intended meaning. We decipher context every day, and not just with 50-cent words that challenge our vocabulary.  People say “tuh” instead of “to”.  People say “mighta” instead of “might have”.  The words “marriage” and “mirage” sound remarkably  similar spoken alone, yet our  experience and power of deductive reasoning allow us to correctly assess the intended meaning of the speech without any real deep thought.  

The technology behind machine recognition, adaptation, and reasoning is fascinating.  It is amazing how far we have come in just a few decades of research.  It is just as amazing to think how far there is to go, and to dream what innovations may come to fill the remaining voids to true artificial intelligence.

CEJ

Tags: accuracy, speech recognition, Transcription
Posted in 3Play General, CJ | No Comments »

On Accuracy, Part I

April 23rd, 2009 by CJ Johnson

Spending 80+ hours a week on a startup business, it is natural that social conversations frequently wander to work.  When the topic comes up, it leads to quite a few questions.

Typically, it starts something like:

They: “So what do you do for work?”

Me: “I started a company while I was in grad school with a few buddies of mine.”

They: “Oh cool! What do you guys do?”

Me: “We specialize in efficient transcription and closed captioning.”

They: “Oh…. Well don’t they have speech recognizers to do that?”

Me: “Well they aren’t really accurate enough.  So we’ve made a bunch of tools that work with recognizers to scrub it up really quickly and accurately.”

They: “Huh… Well I heard they’re 80%, 90% accurate these days.”

This is where it tends to get tricky, as it turns out accuracy is pretty tough to define.

For speech-to-text, accuracy can be measured by looking at the text output, listening to the audio, and circling the words that match; x-ing out the ones that don’t.  By this measure, 80%, 90% accuracy is technically amazing. 

However, if you dig deeper, you see where speech recognition unravels as a practical means of transcription; and that human interaction is still necessary to complete the work in the absence of the “magic button”, a generalized, 100% accurate recognizer.  

Let’s reframe “accuracy”.  As a firm that specializes in software for transcription & captioning of multimedia, our goal is to create works that stand alone as accurate reflections of spoken content.  Sentence-to-sentence, this means a group of words that accurately reflect the meaning of the spoken content.  Word-to-word, this means spelling the words within those sentences correctly.

Spelling is easy to check, and in a sentence, you don’t need to be an English major to recognize the fact that one word, mistaken for another similar sounding one, can significantly alter the meaning of a sentence (won’t vs. want; feel vs. field; etc… more on this in my next post).

I’ve created a chart that outlines the propagated implications of accuracy rates from speech recognizers, assuming a range of accuracies, and 8 & 10 word sentences.  

For example, 67% accuracy means 1 out of every 3 words is incorrect.  For an 8-word sentence, the likelihood that the recognizer got all 8 words correct is 67%8 ≅ 4%. Similarly for a 10-word sentence, the likelihood of the recognizer getting all 10 words in a row correct is 67%10 ≅ 2%.

Word-to-word accuracy 1 of x words incorrect 8-word sentence accuracy 10-word sentence accuracy
50% 1 of 2 0% 0%
67% 1 of 3 4% 2%
75% 1 of 4 10% 6%
85% 1 of 7 27% 20%
90% 1 of 10 43% 35%
95% 1 of 20 66% 60%
99% 1 of 100 92% 90%

This tells me that with 90% accuracy, I will get less than HALF of my 8-word sentences back correctly transcribed, and only 35% accuracy on my 10-word sentences.  With even slight dips in accuracy or increases in sentence length, these rates plummet, as the relation is exponentially scaled. And this doesn’t even begin to take into account other markup such as phrasing and punctuation!!

In generalized practice, we tend to receive files with significant background noise, many speakers, accents, etc.  In these cases, we might see the recognizer hit 67%. If we returned these transcripts to our customers fresh of the ASR engine, they would only see 1 out of 25 8-word sentences correct!!!  Something tells me we wouldn’t retain them for very long….

So it turns out that the 10%, 20% gap is much bigger than it would appear on the surface of the standard “accuracy” metrics; and to create useful stand alone works from audio, that gap between the state of the art in speech recognition and the end goal needs to be filled accurately and cost-effectively.

That’s where we come in :)

-CEJ

Tags: accuracy, Captioning, speech recognition, Transcription
Posted in 3Play General, CJ | No Comments »

  • Email Subscription

    Subscribe to Transcribing Ourselves: The 3Play Media Blog by Email
  • Tags

    3Play accessibility Accessibility Act accuracy brightcove Captioning captions closed captions Conferences Ed Markey Education Entrepreneurship events FCC features Google innovation integration integrations interactive transcript interactive transcript plugins Kaltura legislation manual transcription media archive MIT Online Video partners product SEO speech speech recognition Startup subtitles subtitling support television Transcription translation University of Wisconsin video accessibility video player video search webinar YouTube

Home

Applications

  • Online Video
  • Education
  • Market Research
  • Post Production
  • Conferencing

Services

  • Transcription & Captioning
  • Translation
  • Account System
  • Process

Interactive

  • Interactive Transcript
  • Captions Plugin
  • Archive Search
  • Video Clipping
  • Clipmaker

Resources

  • FAQ
  • Pricing
  • Video Tutorials
  • Quick Start Guide
  • Caption Format Converter

Company

  • About
  • Leadership
  • Partners
  • Testimonials
  • News
  • Events
  • Webinars
  • Blog
  • Jobs
  • Contact Us

©2011 3Play Media | Terms | Privacy | info@3playmedia.com

Facebook 3Play Media on Facebook

Twitter 3Play Media on Twitter