Writing Text for Video: Did Someone Say 'Autumn Aided Cap Shins'?
Not long after the invention of the modern computer, a notably incorrect assumption was that computers would shortly be put to use competently processing natural-language data. People can typically communicate tolerably well by the time they’re about 3 years old, so this didn’t seem to be an unreasonable expectation, since computers were known to solve problems beyond the capabilities of the brightest of 3-year-olds. Speech comprehension—the faculty of sensing complex sound-pressure variations caused by another human being speaking and then assigning to them symbolic interpretations informed by the local culture and immediate context—turns out to be very difficult to teach to a computer.
With the explosive growth of streaming media in the past decade, substantial resources have been applied to the challenge of automatically captioning that video. Happily, some improvement has been made. The task of captioning is essentially this: Identify candidate speech sounds the speaker might be making; identify candidate words that fit the sequence of plausible sounds; choose the most probable sequence of candidate words; add appropriate punctuation; and segment the resulting text so it appears on screen in a way that can be easily and fluently read as it is spoken. Each of those tasks is difficult in its own right, and different automated captioning software tools are better at some than at others.
One of those tasks that has improved recently is the identification of phonemes—the vowel and consonant sounds of speech. This is a famously hard problem: Since everyone’s voice is unique, speech recognizers need to be trained to learn the idiosyncrasies of each user. Improvement has come from two directions. On the client side, most of us carry small but powerful computers that have bad keyboards but decent microphones. Both mobile and desktop operating systems now feature voice-enabled assistants that will continuously tune themselves to recognize your unique voice and the way you produce sounds with it. On the server side, we have classifiers, software that selects whether input data belongs to some classification of similar, previously encountered data or rather to another category. A server-side platform can compare your speech signal with impossibly large data sets of phonemic patterns and classify candidate sounds more accurately than predecessor systems.
Another of those tasks that has improved, and will continue to improve, is choosing the most probable sequence of words from available candidates. This is traditionally done with a language model; in its simplest form, a statistical analysis of how commonly different words occur together. The words “automated” and “captions” are more likely to appear together than the words “autumn aided cap shins.” That likelihood is what language models capture.
The captioning of educational video is particularly ripe for driving speech recognition research. A school is a fairly closed ecosystem. We can easily identify who the teacher is giving the lecture and we can easily have that teacher train a custom speech model to be reused whenever she appears on video for captioning. Teachers at large research universities are the brightest minds recruited from all over the world and so their linguistic diversity is extreme; these custom-tuned speech models are critical for accurate captioning when your speakers are from such varied linguistic backgrounds.
Educational video typically includes technical vocabulary and jargon that would be difficult for a standard recognizer to identify. However, we have access to the visual aids the teacher used in the video (typically slides), and those aids can be mined for contextually relevant vocabulary. This is exactly what Microsoft Garage’s Presentation Translator does. It is critical that these atypical jargon words be accurately captioned, since the captions would be misleadingly bad otherwise.
Universities are where many of the top researchers in speech recognition are working and where the need for accurate automatic captioning is desperate. It is a perfect example of where the triple missions of universities—to educate, to research, and to provide public service—demand cooperative action.
[This article appears in the June 2018 issue of Streaming Media magazine as "Autumn Aided Cap Shins."]
The editing capabilities found in YouTube's backend aren't going to compete with nonlinear editors like Adobe Premiere Pro, but there are some powerful and unique tools that make simple editing projects even simpler.
New captioning requirements went into effect on July 1 for live, near-live, and prerecorded broadcast video that is put online.
We're still a few years away from live video captioning standards, and the available solutions are anything but plug-and-play. But that doesn't mean it can't be done. It just takes a little effort.