Dubbing Optimization for Streaming Dialogue and Singing in the AI Era
Generative AI has transformed the technology, the workflow, and arguably the ethics of dubbing dialogue for TV and movies in recent years, with the costly and time-consuming traditional approach of a voice actor, director, and technical crew gathering in a studio for live audio replacement very much a thing of the past. The perception now is that generative AI can easily (and some would say dangerously) produce or replicate virtually any voice, but Dubformer’s Anton Dvorkovich and Google’s Nick Manoochehri insistthat generating the desired voice and refining it toward emotional precision and perfection remain challenging and as much art as science—for dialogue and especially for singing—as he explains in this conversation with DigitalGlue’s Philip Grossman and PADEM Media Group’s Allan McLennan at Streaming Media 2025.
AI and Accurate Vocal Representation
Jumping right in with a reference to the Hollywood strikes of 2023 and actors’ efforts to protect their livelihoods from the studios’ ability to leverage generative AI to appropriate and reproduce their likenesses in perpetuity, McLennan says, "This was the underlying basis of the SAG-AFTRA strike. When the misrepresentation of the actors within the AI environment comes into play, we're leaving one thing off the table," McLennan contends. Beyond the ethical issues, he says, there are elements of oversight that go into the quality and accuracy of the vocal impersonation. "You don't really need to have the lead actors' voices accurately portrayed, but they're the ones whp created the brand. They are who the film is all about. And this is when [I miss] the old style of dubbing in the studio, with the director there and the linguists coming in to put that down, to have it be accurate. There is a certain level of direction that has to come into play. And I'm wondering," he says, turning to Dvorkovich, "do you have that capability currently in the dubbing factor of directing? Let's go into being the director with a technical solution."
"This is exactly what we're building as well for non-live content," Dvorkovich says. "The idea is that when you talk about higher-end content and dubbing, it's about creating [something that's] almost like a piece of art. A dubbed audio track is less sophisticated than the whole movie, but it's still part of the artistic expression. We're trying to build creator tools where you can actually direct AI voices and make AI dubbing sound exactly as you want it because there are thousands and thousands of different ways to say something correctly and accurately, and every other director would choose a different way to do it and to deliver it."
As AI models improve, he continues, "we will be able to eliminate mistakes--maybe not to zero, [but close]. What we will not be able to do is create an experience where you press a button and you get perfect results because there's no such thing as perfect. It's very subjective and it's subjective to the creative person behind it."
Singing Dubs—Something Lost in Translation?
DigitalBlue’s Grossman chimes in with an intriguing question that nudges the conversation onto a fascinating tangent: "How do you handle something like singing? I remember from an NAB presentation on DCP and they used frozen and they literally had the actress sing the song and do it. And then when they did the demo, they showed how they could basically switch from language to language as the person is singing. Do you find singing to be a more difficult challenge in the AI world?"
"Singing is definitely a big challenge because you have to convey the rhythmic expression that might be really tricky with different languages and stuff like that," Dvorkovich replies. "But in general, there are three mechanisms [involved in] how you can control AI voice generation. There's the very coarse one where you switch a bunch of settings and you can, for example, switch an accent to a different region, or you can switch an acoustic environment or something like that. But the two more powerful tools to control it [use] prompting, which could be either textual prompting or audio prompting. Textual prompting is more like a human dubbing director saying, 'Please say the same [thing], but with more energy,' 'say it faster,' 'emphasize a different kind of thing,' and stuff like that. And voice prompting is usually about, 'Repeat after me. Use my intonation, use my pitch, listen to how I sing it, and do the same.'"
The Tech Is the New Director
Turning to Google's Manoochehri, McLennan presses further into how AI dubbing systems are directed and who provides that direction.
"Nick, you're providing an intelligent API to be able to respond to these kinds of directions. Is this what you're starting to see within the industry--that the technologist is becoming the director of this intonation and capability, but it's leveraging the API that you have?"
"Very much so," says Manoochehri. "And I love the way that Anton phrased it—this really is art at the end of the day. And it's not going to be one click of a button and you get it to the exact place that you want it in every language. I think some of the points that Anton was bringing up really go to the complexities of how to do this properly. So yes, Google has APIs, but what we are seeing from our customers is when you get into dubbing in different languages, there's all this nuance that happens and all these tweaks that you need to make for assigning emotions to particular words if it didn't come out the right way, or changing pitch and tone. And this is where you really see a team of people come together to make the final product as close to the original art that happened in the first place. Or if they want to take it in a different direction for a different market, they have the ability to make those tweaks. So we're working on our own platform that allows people to do that, but today, it's just the APIs. So Anton, I'm happy you're working on the same things."
"Living in the South, I just visualize having these dials," Grossman interjects. "A little more Texas, a little more Tennessee, a little less Georgia, add in a little Alabama. So all right, now I've got the perfect accent."
“It’s easier for us West Coast guys,” McLennan quips. “It’s all the same.”
Join us February 24–26, 2026 for more thought leadership, actionable insights, and lively debate at Streaming Media Connect 2026! Registration is open!
Related Articles
With content production largely on hiatus in the US for months during the WGA and SAG-AFTRA strikes, as those strikes drew to a close in fall 2023, content companies found themselves at a crossroads, with the opportunity to start generating content fraught with strategic complications. In some respects it's a new beginning, and a chance to consider new approaches rather than simply filling a void. So what's next in the premium content world? On the content side, Warner Bros. Discovery's Dan Trotta and Vevo's Bethany Atchison, and from the analyst corner, TVREV's Alan Wolk, Chris Pfaff, and Paul Erickson weigh in on this critical question of where we go from here in this clip from their panel at November's Streaming Media Connect.
15 Dec 2023