Creating Automatic Transcripts in Flash Video Using Adobe CS4
The latest version of Adobe’s Premiere Pro (CS4) added the ability to automatically generate a text transcript based on speech in a video. Not only that, but the transcript is embedded in the video file, tightly synced with the video itself. If you’re into Flash video, this is the kind of feature that gets you dreaming. What would it take to display that transcript alongside a Flash video? Could I then allow viewers to click on any word in the transcript to navigate to that point in the video? If so, could I, for instance, get a nontechie organic farmer to simply narrate a table of contents for a video on organic farming?
All this and more is possible, but getting the transcription into Flash in a usable form is a bit tricky. When you export a movie from Premiere, the autotranscribed text is embedded as XMP metadata, which the Flash player cannot read. What it can read are cue points, or named and timed markers that can be embedded in an FLV file and are used either for navigation ("go to" points in the video) or for triggering events ("do this" points in the video). So the challenge is to turn the XMP transcription metadata into cue points. It turns out that another product in the CS4 suite, Soundbooth, provides the key functionality required to create these cue points. The name of each cue point is actually the text of the transcription at the cue point time.
In this tutorial, I’ll show you how to take any video clip, use Premiere Pro CS4 to create a new clip with embedded XMP transcription metadata, and convert that XMP metadata into cue points in an FLV file that can be played in Flash player 9 or 10. Just to demonstrate that the cue points really are embedded in the FLV file and are synced with the video, I’ll use the cue points to create a transcript window in which the text scrolls by as the video plays and stops when the video stops.
In addition, this is a relatively simple ActionScript program. A working program uses only 14 lines of code, and even with a few enhancements, this particular program contains only about double that. In addition, there are some comments and trace statements that provide interesting and potentially useful information to the programmer while working on the program, but they have no effect in the deployed application.
The automated transcription is done in Premiere Pro. Soundbooth has a similar transcription feature, so theoretically, you could bring the video into Soundbooth and do the transcription there. However, when working with video, there are significant advantages to using Premiere Pro for speech transcription. In particular, Premiere Pro allows you to lock audio tracks. The primary purpose of this feature is to prevent the editing of particular tracks. However, locking also has a special purpose in relation to transcription, namely that locked tracks, though they are transcribed, are not included when you export transcription data. Thus, for example, you could choose to include only the formal part of a speech, not the Q&A afterwards, or only spoken words, not music or sound-effects tracks. Soundbooth doesn’t have track locking.
Of the steps I mentioned, the only one likely to be labor-intensive is editing the transcription. How much time you actually need to spend editing will depend on the quality of the initial automated transcription. That, in turn, depends primarily on the quality of the speech in the video and the amount and nature of any background noise. In general (not just in Adobe CS4 products), the current state of speech recognition leaves much to be desired. Even speech that seems quite clear to you and me may not be perfectly clear to the transcription software. For the sample clip in this article, I had to do quite a bit of editing. I also left a few dozen words unedited at the end of the transcription in the deployed application for comparison purposes.
At a minimum, you’ll probably want to add capitalization and punctuation to the transcription. The transcription software doesn’t even attempt to do that. You also might want to reduce background noise using the tools and processes in Soundbooth that were designed for that purpose; then either do the transcription in Soundbooth or re-import the audio into Premiere Pro for transcription. (I didn’t try this. To me, the speech in my sample video seemed pretty clear overall, and the background noise didn’t seem too bad. Obviously, the transcription software didn’t entirely agree.)
Unfortunately, in addition to a potential lack of accuracy in recognizing words (similar to other speech recognition software I’ve used), the CS4 transcription editing function is quite cumbersome. For instance, you have to double-click each word before you can edit it. And to shift the timing of a word, you have to right-click it first and then join it to the previous or next word. The same is true for deleting or copying a word.
There is also no way to train the transcription software to recognize words or to adapt to a particular speaker, a feature that is commonly available in speech recognition software.
Hopefully, these problems will be addressed in future versions of the transcription feature. However, if you are dictating specifically for transcription, you can train yourself to speak in a way that is more understandable to the transcription software by, for instance, overpronouncing words a bit and pausing slightly between words. With a little practice, transcription accuracy can be vastly improved in this manner.