Streaming Media

Streaming Media on Facebook Streaming Media on Twitter Streaming Media on LinkedIn

Ai-Media's Matthew Mello Talks the Evolution of AI Captioning

Ai-Media's Matthew Mello talks with Tim Siglin about the evolution of AI captioning in this exclusive interview from Streaming Media East 2023.

Tim Siglin, Founding Executive Director, Help Me Stream Research Foundation, and Contributing Editor, Streaming Media, sits down with Matthew Mello, Technical Sales Manager, Ai-Media, to discuss the evolution of AI captioning in this exclusive interview from Streaming Media East 2023.

Siglin starts the conversation by asking Mello to talk a little about what Ai-Media does.

“We do live human and AI-based transcription for closed captioning,” Mello says. “Typically for live broadcast, either sports or news, but also recorded content of any kind.”

“So when I've seen human transcripts during a live event, one of the things I've noticed occasionally is things are done phonetically,” Siglin says. “Because you're hearing part of a word, and you're trying to sort of get ahead of the word, and then after you go past that, you don't have time to go back and correct it. When we use phones that have a certain level of machine learning, once it learns the word that you're doing, it'll go back and correct that occasionally. For me, if I try to type ‘Tom’ in since my name is Tim, it'll constantly try to correct me to ‘Tim.’ Right. I have to be, ‘No, it really is Tom.’ But are the machine learning [transcription] systems trying to spell it out before the word is uttered? Or do they wait until the word is fully done or the context is fully [understood], and then you see it on the screen?”

Mello says that there is an element of contextual learning to their AI transcriptions. “One of the benefits that we have is that there's a base dictionary that we're working from,” he says. “You can go in and kind of customize your dictionary on top of that as another layer. So let's say it was constantly saying your name was Tom when your name is Tim, you can go in and say, ‘Never say Tom, say Tim instead.’”

“I worked with a group that did speech-to-text with a product called Dragon Naturally Speaking from years ago,” Siglin says. “And one of the things you have in English is ‘to, two, too.’ What typically happened in those systems was the base package worked better if it had a distinct set of words. So it worked really well for medical, it worked really well for legal, because you had the Latin basis of a lot of those. Things didn't tend to work well for general conversations until it had 10 to 15 minutes' worth of information. So tell me how the state of the art has improved on that. If I've not trained the system to a particular voice, is that base library working well enough with a large language model to allow it to actually pick up within the first couple words of somebody speaking, as opposed to having to be trained?”

“The newer models are getting much better,” Mello says. “It has its large dictionary, obviously, but it starts to tune into [context]. Let's say we're talking about a basketball game, an NBA game, and you have two teams that are playing each other. It can start to pick up which two teams are playing and then go through the current roster for those and understand that you'd spell this player's name this way because it's part of this franchise. So it's starting to get more into that, which is part of the AI piece of this.”

“So it's the filtering piece that works there,” Siglin says. “Essentially, it says, ‘I hear Celtics and Golden State, and it says, ‘Oh, this must be basketball…”

“And it'll do things like capitalize Bucks,” Mello says. “Whereas in one circumstance, it might not because you could be talking about bucks, like in the wild. So it does start to learn those things. And that's the artificial intelligence piece that's really coming into the automatic captioning space more recently.”

Siglin asks, “Where's the typical customer for you? Is it broadcasts, or streaming, or enterprise town hall meetings, that kind of thing?”

“The biggest customer for us right now is broadcast,” Mello says. He notes that their product LEXI has been working within news broadcasts for the past three years. “Where it struggled a little bit for a while was things like player names and punctuation and things that sometimes would interrupt context. We have a new version that we just put out, [with] a new engine that we've put LEXI on, which really started to pick up things like player names and being able to get context better. And then of course, punctuation being one of the bigger things that helps, like legibility.” He cites the example of their current conversation, in which LEXI would add line breaks when one person begins speaking after another and also do a chevron. “So that way it's much easier to understand context and back and forth.”

“So the other problem in the older systems was if you and I talked over each other, it didn't know who to follow,” Siglin says. “But I assume the newer language models – because it can look at tonality and that kind of thing – can actually keep track of multiple people talking.”

“Exactly,” Mello says. “And you'll see it line break when I interject…it'll line break, then continue what you were saying after that.”

“But what if we're literally talking over each other?” Siglin asks.

“That's a good test!” Mello says.

“We should try the pundits,” Siglin says. “Because the problem we always had with those systems was all of a sudden it had three words that made no sense together, because it was you talking and me talking, and we weren't saying different words.”

“There's a decent chance it would still kind of do that, where it's like you say something, I say something, you say something, and it's all in one line,” Mello says.

“But that would actually be better than what we had in the past, [where] it literally would just say ‘unintelligible,’ at that point,’” Siglin says. “So you said you have a new engine that you've put out there and it's working better with nuance and punctuation. Where do you see the next markets for what you're doing, beyond broadcast?”

“The newest one that's been really exciting recently is with sports,” Mello says. “Sports has been traditionally held by human captioning. Now [AI] is finding its place as it's become reliable. You never worry about scheduling a captioner because sometimes they don't show up. The accuracy is so good now that even if [the quality] is just a little bit lower than a human captioner, it's worth it…you can have it there when you need it, it's more affordable, it’s very easy to use. So sports are a big one. There are other segments like government that we're very much researching right now and figuring out the best path forward.”

“And especially multilingual,” Siglin says. “So in Canada where everything has to be in French and English, or if you're in the EU where everything has to be in multiple languages simultaneously, that's certainly a fascinating challenge as well.”

“Sometimes there are cases where you'll have English and French but being spoken at the same time,” Mello says. “Simultaneous translations, so you can't have just it set to English, it needs to go back and forth. There is some progress on that too that I've seen very recently.”

“In the old days with satellite, I think it was called SAP, it was the alternate audio channels,” Siglin says. “Where you could essentially flip over [to] French, flip over to your English or German…the captioning, if it's two languages being spoken simultaneously, somebody probably doesn't want to get up and go to their set and go change the subtitles from English to French because [they’re] more comfortable hearing the French. Are there models around how you somebody chooses what their preference is to see in the closed caption?”

“If, let's say, most of the program was in English, but you wanted to have it available in French also, you can do that with a translation pretty easily, and have that as a separate track,” Mello says. “So the viewer can decide if they want English or French captions. Now what's very new is it being able to automatically detect languages. Flip back and forth on the fly.”

Learn more about AI and streaming at Streaming Media Connect 2023.

Related Articles
James Broberg of StreamShark discusses why video captioning is essential for accessibility to video content and increasing viewer comprehension.
LiveX's Corey Behnke discusses the need for accuracy in captioning that goes beyond what AI can do, particularly with the increased demand for accessibility that has accelerated during the pandemic, in this clip from Streaming Media East Connect 2021.
How Adobe Premiere Pro's Caption panel helped save an otherwise-unsalvageable project.