The Complete Guide to Closed Captions
Legal requirements for the captioning of streaming media content in the U.S. became clear for media produced or funded by the federal government with 2010's 21st Century Communications and Video Accessibility Act and the most recent updates to Section 508 that went into effect in 2018. Both the act and the updates also cover streaming media that simultaneously or previously aired on conventional television distribution systems in which the Federal Communications Commission has established requirements for captioning. For video programming that never aired on television, case law has been developed to enforce expectations, notably a series of civil cases between the National Association of the Deaf and major online video platforms and private universities.
At this point, the expectation for captioning is well-demonstrated, and captioning is demanded by a much broader segment of viewers than the deaf community. As this demand has gone mainstream, it indicates that a critical mass was reached some time ago, and any platform providing video needs to caption it to compete. Achieving a critical mass is one of the goals of the Universal Design movement. It seeks to integrate technological accommodations for the portion of the audience with a sensory disability that strictly requires it in a way that opportunistically maximizes value to the general audience as well.
Closed captioning on television once required a separate device to decode the captions and print them on the television screen; for decades, that decoding functionality has been built into television sets and is easily toggled on and off. Thanks to this ease of availability, closed captioning made inroads with hearing audiences in situations in which listening to the audio isn't possible: in restaurants or taverns with multiple televisions playing different channels, at airports, and while rocking the baby to sleep. Captions are of tremendous value to second-language learners and for anyone who is watching content that uses unfamiliar vocabulary or is spoken with prominent accents. The costs of producing and delivering accurate captions are fixed, so maximizing the audience for the captions benefits the content producers and delivery platforms substantially.
Before moving on, let's clarify three different types of auxiliary text on screen. The first is closed captioning, which is text that is delivered in sync with the video. It can be toggled on or off and provides equivalent content for all the audio content in the program. In addition, closed captioning can include unspoken audio elements like music, an alarm ringing, footsteps, or a door closing, since these elements may be required to create the intended viewing experience. The second, open captioning, fills the same role except that the text is composited onto the video stream carrying the motion picture content and cannot be hidden. Open captioning is good for kiosk displays and social media campaigns in which the expectation is that people passing or scrolling by wouldn't hear the audio, and, therefore, the text needs to provide that information. Open captioning is not ideal in most other situations in which the viewer expects more control over the playback. Open captions are also not text metadata that can be repurposed for search, copy/paste, or SEO in good Universal Design fashion.
The third type of text is subtitles. These are much like closed captions except that their purpose is not to provide equivalent content of the audio content, but rather translated content. Some films use open subtitles for the occasional scene in which a different language is being spoken. Although subtitles serve a different function than captions, that function reaches a much larger target audience, and, by virtue of using the same technologies, it has played a crucial role in the mainstreaming of captioning. Subtitles will likely merge with captions in online global platforms in which captions in different translations would continue to provide descriptions of non-spoken audio for the international deaf audiences, and the terms may become interchangeable in fact as well as in common usage.
Legacy Captioning Formats from the Standard Definition Era
Originally, captions on U.S. television, and subsequently on VHS videocassettes, were standardized as EIA-608, later CEA-608. They were also called Line-21 captions, since the captioning data was encoded on that line in a National Television Standards Committee picture field. EIA-608 caption encoding remains relevant today despite standard definition television's obsolescence for 2 decades. The bitrate of EIA-608 set the maximum caption length at 32 characters, and that is still generally the default for caption segmentation. EIA-608 data stored in a text file is called a Scenarist Closed Caption (SCC) file. This is still in use and is supported by various software and many video platforms, including Apple Final Cut Pro and Adobe Premiere.
Another obsolete format that remains relevant is the subtitle encoding for the DVD standard, in which subtitles were stored as bitmap files with an alpha channel in a subtitle track within the MPEG-2 transport stream container on the disc. When the DVD player was instructed to display one of the available subtitle tracks, it would overlay those mostly transparent images over the video frames. DVDs could also store caption data in Line-21, in which case, the television set would handle the decoding instead of the DVD player.
The DVD subtitle format is going strong because of SubRip, an open source program popularized for making bootleg DVD localizations. SubRip scans the DVD's picture stream and, using supervised OCR, generates a text metadata file of the subtitle track containing all of the extracted subtitles. SubRip uses its own caption data format called SubRip text (SRT), which is now one of the most commonly used caption file formats. SRT contains interesting anachronisms. For example, since it's a format that was intended to be hand-edited to translate the subtitles for an unofficial localization, it was designed for ease in reading rather than machine parsing. An SRT file consists of a sequence of data blocks about extracted pictures: the number of the picture in the DVD subtitle track; a pair of timecodes for the start and stop time that the picture was shown on the DVD, separated by an ASCII art arrow; and one or more lines of text that were identified in the picture (see Figure 1). Since the original author of SubRip is French, the timecodes use commas as the decimal point delimiter between seconds and milliseconds. Blank lines separate the captions in the list.
SRT was supported in the early 2000s by the free, open source MPlayer. It became a popular format that was adopted by other video playback software in the very early years of streaming media, but only for downloadable or ripped content.
Figure 1. An example of SubRip text (SRT) caption data.
Old-School Captioning in Streaming Media
Video captioning for streaming media in the early 2000s did not adopt SRT, but one influential example of an early streaming media platform with good support for captioning was RealMedia. Real servers hosted video in RealMedia files containing the full encoding ladder of video and audio data to support adaptive bitrate switching. The caption data was not included in the RealMedia files along with the A/V streams; instead, caption data was stored in separate RealText files—XML-based data files that could present timed caption or subtitle text—or for very different purposes like teleprompter text. The RealMedia and RealText presentations were composited together using Synchronized Multimedia Integration Language (SMIL) files to play the media and captions in parallel within a controlled layout. SMIL could define a layout in which captions were displayed in a region below the video player with maximal contrast between text and background and without blocking any part of the video. Alternatively, the SMIL layout could place the captions in a region over the video, and the markup in the RealText could position different captions at different places in the frame to avoid covering an important figure or lower-third title card. Caption placement and styling were not controllable in EIA-608, so captioned content on television could be a worse experience than on internet video until the CEA-708 captioning standard was adopted for the first generation of HDTV. Captioning requirements, both legal and industry-standard as established by Web Content Accessibility Guidelines, put emphasis on placing caption text so as not to block important visual content in the frame.
By 2010, RealNetworks and other Real-Time Streaming Protocol-based media delivery technologies had been largely displaced by Flash, a browser plugin with higher ambitions than merely supporting media playback. Flash's video components and ActionScript programming language adopted a captioning standard originally called Distributed Format Exchange Profile (DFXP) that was around this time renamed Timed Text Markup Language (TTML; see Figure 2). TTML is an XML-based data format that had previously also been used by Microsoft's streaming media technologies. It was developed by a World Wide Web Consortium working group that originally included experts from SMPTE, Microsoft, Apple, and the Media Access Group at WGBH (the Boston PBS station that has played an enormous role in pioneering captioning technology—as well as audio description—on television and other media). That group was later joined by others, including Netflix, the BBC, and the European Broadcasting Union. Several contributors received a technical Emmy Award in 2016 for developing TTML. Incidentally, the RealNetworks company went on to other adventures, selling Napster for $70 million in 2020 and offering a core product line of AI-powered media products like SAFR and Kontxt.
Figure 2. An example of Timed Text Markup (TTML) caption data
Closed Captioning with HTML5 and Beyond
At the same time that Flash had achieved dominance of streaming media, HTML5 standards were coming together, including the highly anticipated video standardizations that would simplify embedding media on websites while using technology native to all HTML5-compliant browsers. Perhaps to assert the HTML5 video standards as a foil against Flash specifically and browser plugins in general, the video captioning standard adopted for HTML5 was a minor update to SRT, originally rebranded WebSRT, then Web Video Text Tracks (WebVTT, or just VTT). There were several major functional changes: Subtitle picture numbers were no longer required, and if provided, were used as chapter markers; the decimal delimiter was switched from a comma to a period; and optional metadata headers or inline markup were added to allow precise placement and styling of caption text. Adopting VTT over TTML was a surprising decision given VTT's reliance on whitespace and ASCII art—recall that SRT was optimized for hand-editing rather than reliable machine validation and parsing. Many useful video-related developments in ActionScript 3.0 were also excluded from future versions of ECMAScript, along with TTML—two babies thrown out of the standards along with the Flash bathwater.
But TTML is on a major comeback campaign. The international ATSC 3.0 broadcast standard adopted TTML as its mandatory captioning standard, specifically the IMSC1 profile that defines the subset of the TTML specification required for captions and subtitles. ATSC 3.0 went nationwide in South Korea in 2016, and the transition is occurring in the U.S. under the NextGen TV moniker, which is currently on the air on more than 150 TV stations in 43 cities. Apple enthusiastically added support for IMSC1/TTML captions to the HTTP Live Streaming (HLS) specification in 2017.
That TTML is both the captioning standard for broadcast television and the adopted standard used by streaming industry leaders like Netflix presents an obvious benefit to content producers and points the way to a bright future for TTML.
Producing Closed Captions
The task of captioning is twofold: accurately transcribing the audio content and laying it out so that it appears in sync with the audio without blocking important portions of the motion picture content. For captioning, compromising accuracy is not an option. The fastest and most accurate transcriptionists are stenographers, such as court reporters and Communication Access Realtime Translation (CART) captioners. These are well-paid, highly skilled professionals, earning, at minimum, $55 per hour (and usually much more) to produce verbatim transcripts in real time. For video on demand (VOD), speed is not so much of the essence, so cheaper, slower transcriptionists can do the work. Often, domain expertise costs another premium, for example, with medical transcription services.
Non-specialist transcription times can be significantly improved by using Automatic Speech Recognition (ASR). For more than 2 decades, ASR performance has been adequate when the ASR engine is well-tuned to a specific speaker's voice and that speaker enunciates their words unnaturally clearly. To leverage those optimizations, a technique called parroting has been employed in which the transcriptionist listens to an audio track and very clearly produces what they hear to be computer transcribed, correcting as they go. In the past 5 years, general purpose ASR has improved to the point that it takes much less time to correct an automatically recognized transcript than it does to type it up from scratch.
Once the transcript is produced, it needs to be time-synchronized to the audio track it transcribes. This task is called forced alignment and is a fascinating technical challenge that has been well-studied over decades and can be performed with high accuracy. Forced alignment
implementations can either use ASR to create timed text from the audio and replace inaccuracies with plausible alternatives from the provided transcript, use speech synthesis to generate aligned audio from the transcript to find an optimal match with the real audio, or both.
One piece of the puzzle that's largely lacking now is the use of ergonomic editors to segment the aligned transcript so that captions appear fluently along with the audio, rather than appearing in fragments or with distracting garden path sentences. Usually, caption editors make it easy to correct the text but not so easy to correct the timing or to move parts of a phrase from one caption block to another. A notable exception is Amara, which offers an impressive web-based caption editor.
Live Closed Captioning
Captioning live internet video is still a very difficult challenge. The most obvious problem is that you need to generate an accurate transcript in real time, which requires the services of a professional CART captioner. The second, more difficult, problem is delivering the caption data in a usable form to your viewers. Starting in 2009, my go-to technique for captioning live-event video was to simply open-caption the video. The caption data would be encoded into my program output, which I would then send through a caption decoder box much like the devices deaf television viewers would use before closed-caption decoders were integrated with televisions, specifically Link Electronics' PCD-88 for standard definition video with EIA-608 captions and, later, its LEI-590 for HD video with CEA-708 caption data embedded. These devices can decode the captions and superimpose them over the program video just like a television set would. I would then stream that video out to the audience (and also to any projection at the event) while recording an un-open-captioned version for the polished VOD and archival version.
Another low-complexity way to provide captions for a live stream is by using a third-party service that displays the live captions in a "sidecar" either on an entirely different webpage from your embedded video or in an I-frame near your video. To make this work, you simply need to send a copy of the audio to the service using a lower-latency solution than your live stream uses.
When COVID-19 forced everyone apart, Zoom and other videoconferencing platforms offered everyday live-streaming experiences. Providing live-captioning solutions at scale was one of the major engineering challenges the platforms faced. Microsoft Teams and Google Meet were able to integrate existing ASR engines from Azure and Google cloud services to provide passable captions in real time. Zoom has two mechanisms for ingesting captions: either by one of the meeting participants typing them in or through an application programming interface (API) endpoint that CART providers and other captioning services can use, notably Otter.ai, a popular ASR transcription and captioning website. Although perhaps good enough in the pinch that was a global pandemic, uncorrected ASR captions are not adequate for meeting legal requirements or for the expectations of your audience. Screenshots of unfortunate, comical speech-recognition errors will inevitably be taken and shared to besmirch your brand or content.
However, it's entirely plausible that low-cost workflows could be constructed that turn the relatively long latency of HLS video delivery from a bug into a feature. I typically see HLS streams on commercial platforms showing latency of 30 and up to 60 seconds. That's plenty of time to generate live ASR captions and have one to three people editing them in assembly-line fashion before the text is needed on the client to synchronize with the video. The caption data could be fetched with regular polling or pushed over a websocket.
Several modern live-streaming platforms now support broadcast-style closed captioning, ingesting streams containing old-fashioned, embedded Line-21/EIA-608 caption data—notably YouTube, Twitch, and Wowza. All can decode the caption data and provide it to viewers as closed-caption data delivered along with the stream. Unfortunately, most browsers will need a client-side web application boost to display it; only Safari thus far supports the part of the Media Source Extensions API intended to stream caption data natively.
YouTube, for example, uses its own caption file format, which superficially resembles TTML. The audio, video, and caption fragment streams are downloaded separately, and the audio and video are muxed together into a data blob that plays in a video element using the Media Source Extensions API. The caption data is displayed using conventional DOM manipulation in sync with the video in an element that users can reposition wherever they want in the video field, and they can style the caption text using a pop-up menu (see Figure 3).
Figure 3. Streaming caption data rendered within a video element's shadow DOM
None of that caption work uses the canonical HTML5 video and track technology except for capturing emitted timing events from the video element. Even on Safari, where caption data can be streamed into a Track element for native HTML5 display, YouTube uses the same method as on other browsers. This affords Safari users the same mechanisms for repositioning and restyling the caption text. My expectation is that this is how most platforms will continue to handle captions until browser-native caption support catches up to the point that there's no way to differentiate beyond what the browser can already do. To see native caption streaming work in Safari, though, check out the demonstration provided in this excellent blog post from Mux.
The provision of captions for streaming media has evolved alongside the state of the art of internet media technology. We're now at what I consider to be the beginning of a golden age, with broadcast and internet technologies adopting shared standards where it makes sense and pushing the envelope of how much good they can make the technology do for all viewers. Caption data may originally have been a mandated requirement to accommodate the deaf audience, but now it's an integral type of metadata within the Universal Design that modern video delivery systems are built around, allowing for viewers to find and be entertained or educated by the content they want.
[Editor's note: This article originally appeared in the Nov/Dec 2021 edition of Streaming Media magazine.]
To ensure a high-quality viewing experience—while maintaining compliance with regional regulations—it's imperative that audio and captions are in alignment. This can be achieved efficiently and cost-effectively with an auto-alignment system that utilizes machine learning. The result is a viewing experience that meets the high expectations of today's global audiences and drives growth.
As viewers increasingly stream videos to mobile devices in public places, captions take on a greater importance.
The editing capabilities found in YouTube's backend aren't going to compete with nonlinear editors like Adobe Premiere Pro, but there are some powerful and unique tools that make simple editing projects even simpler.
New captioning requirements went into effect on July 1 for live, near-live, and prerecorded broadcast video that is put online.
We're still a few years away from live video captioning standards, and the available solutions are anything but plug-and-play. But that doesn't mean it can't be done. It just takes a little effort.
Companies and Suppliers Mentioned