Conference Research Tests Adaptive Video and Quality Benchmarks
Video encoding professionals should take note of four papers presented at the recent International Symposium on Electronic Imaging. Read on for a detailed assessment.
The Society for Imaging Science and Technology hosts the annual International Symposium on Electronic Imaging, held this year in San Francisco, California, from February 14 to 18. The Symposium has eight tracks across a range of disciplines, where researchers from industry and academia present papers and findings.
I attended primarily to learn the latest in two arenas: adaptive streaming and video quality benchmarks. In this article, I’ll present an overview of the sessions and papers I found most interesting and relevant. The first two relate to work on adaptive streaming performed by Google. The second two discuss how to measure the quality of adaptive streaming experiences.
A Subjective Study for the Design of Multi-Resolution ABR Video Streams With the VP9 Codec
One common problem facing encoding professionals is identifying when to switch between streams in an adaptive group. This paper, authored by Chao Chen, Sasi Inguva, and Anil Kokaram from YouTube/Google, presented a hybrid objective/subjective technique for identifying the appropriate data rate for switching stream resolutions. Though the experiment focused on the 4K/2K decision point using the VP9 codec, the technique can be used for any decision point and codec.
Adaptive streaming involves a group of encoding configurations at various resolution and data rate pairs. At each data rate in the ladder, the player has to choose the appropriate resolution. Intuitively, it’s the resolution that delivers the highest quality at that data rate, as shown in Figure 1, and Google broke no new ground in sharing this observation.
Figure 1. Theoretically, you want to switch resolutions to maximize quality throughout the encoding ladder.
As mentioned, Google’s focus was on the appropriate data rate to switch between 2K and 4K videos, and the short answer is that it’s between 4 Mbps and 5 Mbps when encoding with the VP9 codec. How Google got there is the interesting part.
Google selected 7966 4K videos uploaded to YouTube, created 2K versions, encoded both the 4K and 2K versions with VP9 at various data rates, and computed their Structural Similarity Index (SSIM) scores. Based upon these scores, the average switching rate between 2K and 4K was 4 Mbps. That is, below 4 Mbps, most 2K clips had a higher SSIM rating, while above 4 Mbps, the 4K videos had a higher SSIM rating.
To test this premise, Google ran subjective tests, but even Google doesn’t have the time, patience, or funds to test 7966 videos. In fact, The Google researchers wanted to subjectively test just 10. So the question was how to choose 10 videos that represent the entire universe of 4K clips that are and ever will be uploaded to YouTube. Not surprisingly, the answer is a bit wonky, though comprehensible if you simply trust Google's math.
The researchers reasoned that the encoding complexity of videos involve two main factors: the amount of motion in the clip and the amount of detail. To assess the amount of detail, the authors used I-frame size, since at a constant quantization parameter, more detail requires a larger file size to preserve. To measure the amount of motion in a clip, the researchers used average P-frame size divided by I-frame size, to decouple the effect that large I-frames can have on P-frames (this is the trust the math part).
Figure 2. Differentiating clips based upon the amount of motion and detail.
These metrics decided upon, the researchers encoded 3226 of the highest-quality clips in the 4K library to H.264 using FFMPEG with a constant quantization parameter of 28. Then, they measured I frame and I/P frame size and plotted the graph shown in Figure 2, which in essence creates a 9-slot taxonomy of 4K clips based upon the amount of motion and detail. In each region, they selected 20 clips closes to the center of each region, and selected the highest quality clip.
From these clips, they produced 2K variants, and encoded these variants and the original 4K clips to 2 Mbps, 3 Mbps, 6 Mbps, and 11 Mbps. Then they scaled the 2K versions back to 4K for side-by-side subjective testing. The result was an average switching rate of about 5 Mbps. From this, the authors concluded: “In this sense, SSIM is probably a good quality index for the purpose of estimating the average resolution switching bitrate for large amount of videos. Although SSIM may overestimate or underestimate the quality for a particular video, its estimation error will be averaged out when estimating average quality for a large collections of videos.”
What’s significant about this study? Multiple items. First, it validates using SSIM as the basis for determining how to configure streams in adaptive groups. Second, the 4/5 Mbps switch point between 2K/4K video is interesting, though this will vary from codec to codec. Finally, if you find yourself having to select a limited set of clips that accurately reflect the characteristics of a larger group, the I and P-frame/I-frame technique described might just do the trick.
Optimizing Transcode Quality Targets Using a Neural Network With an Embedded Bitrate Model
One of the more significant recent events in the encoding world was Netflix’s per-title encode blog post where the authors discussed their schema for creating a custom encoding ladder for each video distributed by the service. Netflix’s approach involves multiple trial encodes, which works well when you distribute a large, but limited set of content. The compressionists at YouTube have a completely different problem to manage; In essence, how to pull off per-title encoding when you have 300 hours of video uploaded every minute of every day. This talk, and the above titled paper, discussed their approach.
The paper, authored by Google’s Michele Covell, Martin Arjovsky, Yao-chung Lin, and Anil Kokaram, starts by describing the conditions that YouTube must work under. First, YouTube encodes files in parallel, splitting each source into chunks and then sending them off to different encoding instances. Since communications between these instances would complicate system design and operation, the solution couldn’t involve communications between these instances.
Second, any approach must be codec agnostic, because YouTube deploys multiple codecs. To make this work, the solution had to depend upon a single rate control parameter for each codec, though it can vary from codec to codec. For x.264, which was the focus of the paper, YouTube used the Constant Rate Factor (CRF) value as the single rate control parameter.
CRF is a rate control technique that adjusts the quantization level to optimize quality over the duration of the file (or file segment). The problem with CRF is that it has no rate control mechanism; you set the CRF value, and x264 produces a file at whatever data rate is necessary to meet the selected quality level. YouTube’s files have to meet a target data rate, so the object of the exercise was how to choose the CRF level that would deliver the required data rate.
One obvious solution would be to run a first encoding pass on all incoming files, and distribute this information to all encoding instances. However, implementing two-pass encoding would dramatically increase the encoding horsepower necessary to process the incoming load. For this reason, YouTube had to implement the solution, if at all possible, in a single pass.
As the paper describes, while YouTube can’t afford a first pass on all incoming files, it does gather some information from a high-bitrate mezzanine file produced from all incoming files. Essentially, because users upload files in a variety of formats, sizes, bit rates, and frame rates, this mezz file is necessary to normalize these files before encoding. When creating this mezz file, YouTube gleans many details about the file, though not up to the level of information gained from a true first-encoding pass.
Schooling the Neural Network
The issue was how to predict the right CRF value from this limited information, and for this, YouTube deployed a neural network. At a high level, a neural network is a multiple-CPU system with the ability to learn via training. To train the network, YouTube performed over 137,000 encodes on 14,000 clips, and fed the data into the network (Figure 3). The researchers then encoded 1,000 test clips based upon input from the network and found that the system choose the right CRF value to meet the target data rate 65 percent of the time, with a tolerable bitrate error of under 20 percent. This would mean that 35 percent of the clips would have to be re-encoded to meet the target bitrate.
Figure 3. Training the YouTube neural network.
The researchers next evaluated the learning benefit of incorporating the results of a fast, low-quality CRF encode into the system. Specifically, the system encoded a 240 pixel height video file at a CRF value of 40, and incorporated data from this encode into the neural network training. This boosted accuracy to 80 percent, which means that only 20 percent of the files needed re-encoding.
It’s tough to say how the average compressionist might use this research, though it does provide a fascinating look into the scale of YouTube’s operations, and an interesting example at what neural networks are and the type of work that they can perform. Whatever technique you use to optimize encodes, however, if you’re not thinking about per-title or per-category encoding optimization, you’re behind the curve.
Subjective Analysis and Objective Characterization of Adaptive Bitrate Videos
Assessing the quality of a single video file via subjective and objective testing is well travelled ground. However, the Quality of Experience (QoE) of adaptive streaming is much more complicated, using multiple streams with different quality levels and different algorithms to determine when and how often to switch streams. This paper, authored by Jacob Søgaard (Technical University of Denmark), Samira Tavakoli (Universidad Politécnica de Madrid), Kjell Brunnström (Acreo Swedish ICT AB and Mid Sweden University), and Narciso García (Universidad Politécnica de Madrid), provides a great explanation about the types of testing performed to assess the QoE of adaptive streaming. Unfortunately, it shows that highly accessible and easy to apply objective tests are poor predictors of actual subjective ratings.
Rating the QoE of Adaptive Streaming
Near the start of their paper, the authors reference a highly useful paper entitled Quality of Experience and HTTP Adaptive Streaming: A Review of Subjective Studies, which I Googled and was able to download. I suggest that you do the same. As the title suggests, this paper reviewed previous studies and summarized their conclusions, which are relevant to all streaming producers.
By recognizing that some titles are more visually demanding than others, Netflix has revolutionized the way it encodes video and will dramatically cut down bandwidth requirements.
One-size-fits-all encoding doesn't produce the best possible results, so Netflix recently moved to per-title optimization. Learn why this improves video quality and saves on bandwidth, but isn't the right model for every company.
In this session, Jan Ozer presents a live video comparison that includes cost, stream redundancy, packaging flexibility, bandwidth requirements, DRM and captioning support, and scalability.