Save your seat for Streaming Media NYC this May. Register Now!

Video: How Netflix Optimized Encoding Quality for Jessica Jones

Learn more about codecs and optimization at Streaming Media's next event.

Read the complete transcript of this clip:

Anne Aaron: The goal of our team is to encode Jessica Jones at the best quality possible, so I'm using her because we're not so far from Hell's Kitchen where she lives. To be able to achieve this goal of encoding the best quality video, we do a few things. One is that we rerun coded comparisons, comparing encoders, Beamr, X.265, X.264. We compare codecs and standards. We develop our own encoding algorithms, optimization algorithms, for example I was mentioned for title optimization, more recently we developed dynamic optimization and last, but not least, we have to make sure that the quality is good end to end. Our developers or engineers are not introducing bugs in the system that cause bad video and at the end, after the adaptive streaming algorithm works, that we're still delivering high-quality video.

To do all this, it's just not really possible to do a subjective evaluation so we had to ... we knew ... and as mentioned, all compression folks know that PSNR is just not good for perceptual quality even though we all use PSNR. It was just really out of necessity that we had to develop a video quality metric that we could use for our jobs. There are a few goals that we wanted.

First, it was very accurate in reflecting human perception, and second, we should be able to run it at scale to achieve all these three objectives. Later on, we realized that it was also good to open it up to the industry, to open source it, primarily because in running these codec evaluations we would like to improve the standards for us to be able to influence and improve the standards, this quality metrics has to be accepted and validated by the entire industry. That's why we also wanted then to open source it.

How did we attack this problem? First, we actually generated ground truth data and we were very targeted in generating the ground truth data. First, our type of content, so of course not just action sequences like that but diverse content, cartoons, TV shows, movies and from the distortion point of view, it was also targeted to the distortion that we were interested in. Compression artifacts, as well as scaling artifacts because if you're adaptive streaming a low-resolution video, it's gonna be up-sampled at the receiver so there were the two primary focus.

Also in terms of the viewing condition, also very targeted, using a TV. Then using this ground truth data, we extracted features. From the feature point of view, we don't want to reinvent the wheel, so we dug through academic research and found the quality metrics that were based on the human visual system too, and to reuse those and then fuse them together. By fusing these different quality metrics that other folks had developed, hopefully we get a better quality metric that more accurately reflects human perception, so if one quality metric doesn't work well in a type of content, maybe the other quality metric can push the scores towards the right direction.

Because we had a very targeted setup, the current model supports 1080p video on a 1080p viewing, and later on we realized that part of our job again, we wanted to optimize the viewing for cellphones and tablets so we expanded our ground truth data to also capture cellphone viewing.

So now we have both a TV and a mobile viewing model. We also actually ran tests in 4K because we also stream a lot of 4K and so we also have that 4K model in-house which we plan to open source soon too. We didn't just develop this ourselves; we also had partners in academia like University of Southern California, University of Texas at Austin, where Professor Wang is from where they developed SSIM, and Universite de Nantes, which has a really good lab for subjective testing.

How does this work? Does this really reflect human perception. We tested the top two tables, our tests that we ran ourselves. To know whether it correlates with human perception you have to have a correlation score of one. That's perfect correlation. When we tested in our test data sets, so this is not what we use for training but quite similar, we have a correlation of 0.93 which is good, but as you can see it's not perfect
The live video databases are really, really hard databases to test on. We only took the one that's compression relevant in pyramids. As you can see, all the other metrics are pretty terrible to a 0.5 correlation, 0.4 correlation. We are 0.7, not yet perfect but getting close.

We're seeing a lot of other folks also validating VMAF since we've open sourced it and this is just something that was reported recently at the VQEG on the other data sets. As you can see, it's pretty good correlation, 0.95 which we're happy about and we're starting to see some adoption. I was recently at NAB and saw a bunch of folks on the floor starting to use VMAF too.

Okay, so here it says this is about best practices. We run a lot of codec comparisons. Here are just some of the things that we've learned over the last couple of years. First, make use of bitrate and QP-resolution pairs. It just makes sense. That's one of our pet peeves in a lot of academic fix QP evaluations. You test 4K with a really, really high QP. In the field, folks are not gonna use that. If you have low bitrate, you're probably not gonna stream 4K at a QP of 60. You're gonna go to a lower resolution. So when we do our tests, we make sure that we only test on the relevant parts, relevant QP bitrates and selection, so that results are actually relevant.

I mentioned that we really like VMAF. We developed it for our use case, but to make sure that we're not missing any corner cases in VMAF, because it's still under development, we cross-check with other quality metrics like PSNR and VIF and VQM, just to make sure that we're not missing anything. Another important thing is that you have to keep in mind the target viewing condition, because people will perceive the video in different ways depending on how they're viewing it, so you have to keep that in mind.

Eventually, if you're trying to optimize for your user base, your member base, you might want to average it based on the histogram of your members, but that's something to keep in mind though.

Last, but not the least, you have to use diverse content so a lot of tests out there too, standardization for example, use a handful of test sequences 10 seconds long. You're never gonna capture all the capabilities of the codex so that's not enough. You just have to use more, 30 at least, 50 or even more if you can.

Streaming Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Video: How Has Per-Title Encoding Evolved from 1.0 to 2.0?

SSIMWAVE Chief Science Officer Zhou Wang identifies the key evolutionary points in the development of per-title encoding to greater efficiency in this clip from his presentation at Streaming Media East.

Video: How Does Per-Title Optimization Improve the Streams You Deliver?

SSIMWAVE Chief Science Officer introduces the key benefits of per-title encoding optimization in this clip from his presentation at Streaming Media East.

Video: How Have HDR and 4K Changed the QC Process?

Netflix' Chris Fetner and Eurofins' Michael Kadenacy discuss what new challenges, if any, working with HDR and 4K for premium OTT entertainment media has brought to the quality control/assurance stage of content development in this keynote panel from Streaming Media East 2018.

Video: Making the Case for HDR at Netflix

Creatives and colorists in the Netflix orbit debate the value proposition of HDR and 4K for premium OTT content in this keynote panel from Streaming Media East 2018.

SME 2018: Technicolor-Postworks Colorist Anthony Raffaele Talks 4K and HDR

Streaming Media Contributing Editor Tim Siglin interviews Technicolor-Postworks' Anthony Raffaele following his keynote at at Streaming Media East 2018