How to Effectively Deploy Auto Captioning Solutions for Streaming VOD
Artificial Intelligence (AI) is transforming the video streaming world. While AI as a technology has been around for some time, the digitization of data coupled with the need for such solutions have pushed the industry to adopt AI quicker than expected. AI-based systems now exist for speech recognition, data analytics, and other deep learning platforms. They offer both accuracy and scalability that not only complement human input but have the capability to exceed human efficiency.
An area where AI offers multiple benefits is Automated Speech Recognition (ASR). Speech recognition is a field of AI that enables recognition and translation of spoken language into text. ASR is a core component for multiple systems, including auto captioning systems used in the Video-on-Demand (VOD) streaming environment.
Why Auto Captioning is Critical for Streaming
Captions are a crucial component of VOD streaming services. Using captions, OTT providers offering VOD services can extend their reach and make streaming content accessible to millions of viewers across the globe with ease.
For many years, captioning was a manual process. However, OTT service providers are dealing with a massive volume of streaming content for an increasingly global audience. It is not humanly possible or cost-effective to caption everything manually. Captioning is a specialized job and needs to be carried out by experts who are aware of language intricacies. To minimize costs and maximize efficiency, auto captioning has become a significantly important AI task.
Key Components of an Auto-Captioning Solution
There are several essential components of auto-captioning solutions for ensuring VOD streaming occurs with a high degree of accuracy and quality (Figure 1).
Figure 1. Components for auto captioning generation
The ASR engine is the core component that is responsible for transcribing the speech to text. If OTT service providers want to ensure effective global coverage and accuracy of content, they need an ASR engine that supports most languages and important dialects for each language.
From a technology standpoint, newer ASR technology offers better accuracy—greater than 95% for clean speech content.
Choosing an ASR solution that is capable of identifying speaker change in transcripts is also important. Speaker identification can help with proper positioning of captions to ensure each caption is close to the speaker. It can also provide clarity in instances where there are multiple speakers.
In addition, the ASR solution should provide a transcription of non-speech sounds such as “hmm” and “oh” to maintain close accuracy between what is spoken and what is being transcribed.
Natural language processing (NLP) forms a key part of the overall auto-captioning solution, ensuring accurate punctuation and intelligent sentence segmentation. With NLP, OTT service providers can punctuate sentences to improve readability. NLP can also aid with providing line breaks at natural points in captions to further optimize readability.
Additionally, it is imperative for streaming service providers to comply with regional requirements. An auto captioning system can help service providers manage caption quality, such as words per minute, number of maximum lines to be used for caption display, and the sensitive use of profanity.
Having a solution with a custom dictionary will increase the accuracy of ASR systems by providing context before ASR is invoked. Let’s say a service provider is trying to auto caption a television series for its streaming offering. The names of all the characters are known, and some of them are difficult. ASR engines can prioritize these names during the recognition phase to ensure that the transcriber maintains good accuracy.
Best Practices for Deploying ASR Systems
Adopting an ASR engine that offers a flexible deployment strategy is ideal for VOD streaming applications. OTT service providers should look for an ASR system that can be deployed on-premises as well as on different cloud services like AWS and Google Cloud. Cloud-based solutions, in particular, can be deployed with a faster time to market.
Auto-captioning solutions have advanced compared with 20 years ago. They are now widely used in real-world video streaming applications. But there are accuracy limitations. Because of accents and the number of languages, it is not possible to maintain high accuracy all of the time.
To overcome accuracy limitations of auto-captioning solutions, a growing number of service providers are embracing a hybrid model to where the auto-captioning results are manually inspected before video is streamed to global audiences. Manual inspection is only needed in cases where there is a need for higher compliance and availability of clean dialog is not feasible (Figure 2).
Figure 2. Hybrid Model for Auto Captioning
Performing a full manual inspection of generated captions can be a very tedious task. Review tools were created to help service providers review and correct generated captions in the most efficient way possible. Review tools should have the capability to sort utterances based on confidence score so that ones with a low confidence score can be reviewed first as they are most likely to have errors. Review tools need to be able to play all utterances along with audio in a loop for fast inspection. Once an error is detected, the tool must be able to provide means to correct its attributes (i.e., text, font style, timecodes, color, etc.) in an easy fashion. This will ensure faster reviewing of auto-captioning tasks and faster time to delivery.
ASR systems solve critical problems in the VOD streaming industry today, enabling service providers to improve the accuracy of captions created leveraging speech-to-text processing. However, ASR systems are not without limitations.
By taking a hybrid approach that combines auto captioning with quick manual inspection before delivery, OTT service providers can improve accuracy and introduce significantly higher efficiencies into their VOD streaming workflow.
[Editor’s note: This is a contributed article from Interra Systems. Streaming Media accepts vendor bylines based solely on their value to our readers.]
To ensure a high-quality viewing experience—while maintaining compliance with regional regulations—it's imperative that audio and captions are in alignment. This can be achieved efficiently and cost-effectively with an auto-alignment system that utilizes machine learning. The result is a viewing experience that meets the high expectations of today's global audiences and drives growth.
As viewers increasingly stream videos to mobile devices in public places, captions take on a greater importance.
The editing capabilities found in YouTube's backend aren't going to compete with nonlinear editors like Adobe Premiere Pro, but there are some powerful and unique tools that make simple editing projects even simpler.
New captioning requirements went into effect on July 1 for live, near-live, and prerecorded broadcast video that is put online.
We're still a few years away from live video captioning standards, and the available solutions are anything but plug-and-play. But that doesn't mean it can't be done. It just takes a little effort.
Companies and Suppliers Mentioned