Video: How Reliable Is ASR-Generated Live Captioning?

Learn more about ad insertion for live events at Streaming Media's next event.

Watch the complete video of this presentation from Streaming Media West, LS202: Reaching the Audience--Advances and Challenges in Captioning Live Streams, in the Streaming Media Conference Video Portal.

Read the complete transcript of this clip:

John Capobianco: The reliability of automatic speech recognition (ASR) is the natural question that comes up from everybody. Everybody believes that ASR does good levels of captioning. It doesn't. We've studied this. We've run several ASR engines internally because we're always looking for the best way to get captions done. It doesn't do a good enough job, and I have proof of that because I test what's going on in the country all over the place every day.

ASR Is Cheap

Most of the ASR engine issues happen because people use it because it's cheap. That's really the only reason that anybody cares about: It doesn't cost hardly anything to do it. It's worth every penny you pay for it. It doesn't do a very good job of it. It gets most proper nouns wrong. It gets most proper names wrong. It gets most names wrong. I've averaged this across all of the big providers. I'm not going to name them, I don't care who they are. I look at what they do live on broadcast television. And the average right now is just under 68% accurate.

One out of every three words is wrong. Two-thirds of the errors are wrong words or missing words. Think about that. One out of every three words that you have going on is wrong or missing. It's not an adequate way to communicate.

In addition to that, it captions poorly if there's not a very good connection, there's background noise, people talk over one another. The place where it actually does work is when people are trained to use it and they talk like this to the machine and the machine captions and they put in their period and captions and commas, and they do that through a monotone and they talk to it and there's only a single speaker.

Training ASR

You can train ASR to do okay. We do that. That's what voice-writers do, but at the same time, that's not adequate for most of your broadcast needs.

In addition to that, it doesn't capitalize. It doesn't punctuate. People always say, "Who cares about the punctuation?" If I gave you a pamphlet and it didn't have paragraphs or punctuation or commas or anything else, how far do you think you'd read in that document? You'd be confused very, very quickly. People don't think about it that way. Take a paperback book and imagine it with no punctuation, no chapters, no indexes, no commas, no indents for paragraphs or any of those things. Just a stream of words. It would be awful. You can watch it and it's just not very good.

Why Human Captioners Paraphrase

One of the other really important things about automatic speech recognition is that, and because you say, "Well, I watch captioners and there are missing words when humans do it too." That's true. They paraphrase sometimes. Captioners are trained to do that sometimes, and as much as we don't like to think about not doing verbatim because we all want to do verbatim all the time, verbatim is not always the best delivery on the screen.

Our captioners are taught to paraphrase in order to slow it down enough so the words stay on the screen long enough for somebody to be able to read them. And sometimes they'll leave out some words for better meaning.

ASR engines leave outwards because they get befuddled. Humans leave out words because they're trying to improve the meaning. So when you actually compare what happens between human captioners and automatic speech recognition, the huge difference is the readability of what's happening. And it's the human context of knowing how to communicate effectively with the words that are being spoken.

When Will ASR Be Ready?

We get a lot of questions about this. Everybody wants to know when is it going to be ready? Well, so do we, which is why we test it every day. I've got currently 58 tests that I've just done on 80,000 words and that's where I get my statistics from of 67.88% accurate. It was the 32.12 point, whatever that is, that were inaccurate. And two-thirds of that is missing words and wrong words.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

A Machine-Learning-Based Approach to Automatic Caption Alignment for Video Streaming

To ensure a high-quality viewing experience—while maintaining compliance with regional regulations—it's imperative that audio and captions are in alignment. This can be achieved efficiently and cost-effectively with an auto-alignment system that utilizes machine learning. The result is a viewing experience that meets the high expectations of today's global audiences and drives growth.

21 Nov 2022