New Benchmark Reveals How Top AI Dubbing Systems Really Perform
Amsterdam-based AI data platform Toloka has launched VOX-DUB, the first open, human-evaluated benchmark designed to assess AI dubbing systems on emotional accuracy, prosody, and voice character across languages. Using more than 30,000 native-speaker A/B e
Amsterdam, Netherlands(25 Nov 2025)
Amsterdam-based AI data platform Toloka, which helps train and evaluate AI models using real human input, has introduced VOX-DUB — the first open, human-evaluated benchmark designed specifically to assess the performance of AI dubbing systems. The new benchmark provides an independent framework to evaluate how effectively current technologies capture the emotion, prosody, and voice character of actors across languages – a capability essential for realistic, emotionally resonant media localization.
VOX-DUB fills a major gap in current AI audio evaluation practices. While text-to-speech (TTS) quality has nearly reached human parity, dubbing requires a deeper understanding of emotional and cultural nuance. The new benchmark introduces a pairwise A/B testing methodology using native speakers and measures performance across five core dimensions: pronunciation, naturalness, audio quality, emotional accuracy, and voice similarity.
The benchmark includes a curated set of acting performances covering eight source languages, including French, German, Russian, Hindi, Chinese, and Japanese, translated into English (US) and Spanish (Latin American). Using pairwise A/B testing, native-speaker annotators rated audio clips across five key criteria: pronunciation, naturalness, audio quality, emotional accuracy, and voice similarity. More than 30,000 individual human judgments were aggregated to generate a validated ranking across four commercial systems: Dubformer, Deepdub, ElevenLabs, and Minimax.
Across both target languages, Dubformer demonstrated strong results in pronunciation and naturalness, placing it among the leading performers in the study.
“The VOX-DUB benchmark addresses a critical gap between the dubbing and media localization industry: the lack of standardized, methodology-driven evaluation for AI dubbing systems. It introduces a structured approach that allows for more objective, apples-to-apples comparison – something sorely needed amid the recent wave of interest and hype,” commented Anton Dvorkovich, Co-Founder & CEO of Dubformer.
He also noted that the VOX-DUB results validated its current direction while revealing opportunities for further refinement. “We already maintain our own internal benchmarks, but it is always so valuable to get independent analysis as well, and to see our internal insights confirmed by a third-party,” Dvorkovich noted. “The findings confirmed a key trade-off we’ve been addressing – between the fidelity of voice replication and overall speech quality. We’re pleased with our performance, and the results validate our direction while showing where further refinement is needed, particularly around emotional expression and naturalness in AI dubbing.”
As pointed out on the company’s website, future iterations of VOX-DUB will expand beyond audio into video-based evaluation, incorporating lip-sync alignment, visual rhythm, and acoustic environment – a step toward measuring true cinematic dubbing performance.
The benchmark arrives amid rapid expansion in the AI media localization sector. The global AI dubbing market size is expected to be worth around just under $3 billion by 2033, from around $800 million in 2023, growing at a CAGR of 13.9%. Broader AI video dubbing applications are expanding even faster, at more than 30% annually, driven by demand from streaming, gaming, and entertainment platforms seeking scalable multilingual content.