Welcome to the Matrix: Beyond Scene-Based Encoding for Video

Before I started editing Streaming Media, I wrote for a magazine called EMedia, which covered the optical disc industry. That magazine is long gone, but one of the central points of its coverage remains: the need for continuous improvement in encoding video and audio. Back in the early 2000s, people were excited about the progression from constant bitrate encoding to variable bitrate encoding, which, of course, eventually morphed into adaptive bitrate (ABR) streaming.

BlogABR is the standard for video delivery today, and every major video publisher follows or has created their own per-title encoding ladder. In his latest The Producer's View column, though, Jan Ozer declared "the per-title encoding ladder as we know it is on the way out, and it will soon be supplanted by dynamic and context-aware encoding."

So what's next in the never-ending quest for better quality at lower bitrates? At Akamai Edge World in Las Vegas last week, I posed the question to Will Law, the company's chief architect for media cloud engineering, and here's what he had to say. First of all, he (like Ozer) emphasized that scene-dependent encoding is where it's at today. But eventually, he said, we need to start questioning how we deploy ABR altogether.

"Ultimately, I think we need to start addressing adaptive bitrate encoding. It's a construct that says I assume that by increasing my bitrate, I increase the quality my end user receives. That's the fundamental basis that we operate adaptive bitrate on today. Yet that's not how we switch today. We will often raise the bitrate automatically, even if it doesn't improve the quality the end user receives."

For instance, Law says, a CDN may sense that an end user's device has the capability and the bandwidth to receive a 1080p stream and send that, even if the viewer is watching on a mobile phone held vertically and can't see the difference in quality between 1080p and 720p. Which means the CDN—and therefore the content publisher—is unnecessarily wasting money on higher-quality video.

"I think we need a more sophisticated approach where the multiple segments are produced at different quality levels and at different sizes," Law said. "And then a manifest's job is to describe what's available, describe a matrix of segments instead of a list of segments, and then the player can intelligently walk a path through the matrix. It's like a contour. You might say 'I need you to be at the highest quality for a given bitrate.' So it can dimension that matrix by the bitrate and then pick for each segment, whether it's high activity or low activity, which would be the appropriate bitrate to use."

Bitrate should be a constraint, but not the sole determining factor. "You need some measure of the perceived quality," Law said. "And that gets difficult because [a metric] like VMAF is a jewel reference implementation, and you need your master copy to compare it [to]. So then you have to predict ahead of time when you're encoding content what does this content, say at 720p, look like if it's rendered at 480, 1080, or 4K? Get a relative score, and all that information has to be in the manifest, and then the player can say 'Ah, I know I'm showing it at this size, so given that, what's my max quality for a given bitrate?'"

At the moment, such a system is purely theoretical. "We're talking about stuff that doesn't need to happen in the next five years," Law said. "But if you want to keep optimizing OTT, eventually I think it's going to happen."

