New Codecs Are Coming; Here's How to Evaluate Codec Evaluations
As we transition from H.264 to VP9, HEVC, AV1, and soon VVC (Versatile Video Coding), it’s important to understand the fundamentals of codec comparisons and how to evaluate their effectiveness and utility. In this expanded column I’ll cover both.
Evaluating the Evaluation
Let’s begin with how to evaluate the evaluation. I start by identifying the evaluator and its affiliations, giving more credibility to actual users of the technology, like Netflix or Facebook, than to vendors. Though both are members of the Alliance for Open Media, and so have some degree of bias, a staffer who publishes a paper detailing a certain quality level knows he’ll have to deliver that quality when it’s time to deploy.
At the other end of the credibility spectrum are reports prepared by non-practicing companies affiliated with one of the HEVC patent groups. They’re not actually using any video technology at scale, and they have a clear financial incentive to find their technology superior.
When reviewing reports from research and technology shops, like Moscow State University (MSU), I focus on who funded the report. MSU funds most of its own reports, so I give that place great credibility. If a report is funded by a third party, I look at the interests of the that party.
Next, I identify which version of the codec is actually evaluated. Remember that there are multiple HEVC, H.264, VP9, and even AV1 codecs, each with different dynamics. HEVC proponents assert that the HEVC reference codec is the true gauge of encoding quality, though this codec isn’t commercially used. My preference is to compare commercially available codecs, particularly those used at scale, like x264 or x265, or AV1 as delivered in FFmpeg 4.x.
Then I consider the version of that codec, which is a concern for slow-moving academic papers that can take months to get from testing to publication. AV1 in particular will change significantly over the next few months, so a review that’s 9 to 12 months old may differ dramatically from what’s currently available.
Then I look at how the encoding parameters for each codec are derived. I ask the codec vendors to supply encoding parameters, eliminating any bias or learning curve. MSU does the same. I tend to discount any study that doesn’t consult with the codec vendors.
I also consider how many clips are deployed and their composition. More clips are better, and they should be diverse in terms of motion, complexity, and real-world and animated content.
Finally, I consider the operational bias of the tester. For example, Facebook evaluated AV1 for VOD distribution to millions of viewers, which minimizes the impact of encoding time/ cost. While useful for publishers with similar volume, this data isn’t meaningful for smaller producers and is completely irrelevant for live producers.
Once you’ve considered the pedigree and focus of the evaluator and study, it’s time to understand the components and results.
There are two ways to analyze encoded files—using actual viewers or using objective quality metrics like Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity Index (SSIM), Video Multimethod Assessment Fusion (VMAF), or SSIMPLUS from SSIMWAVE. Objective quality metrics exist to predict subjective scoring, but subjective comparisons are the gold standard. However, producing subjective evaluations is expensive and time-consuming, which is why objective metrics are so frequently used.
In my consulting work and writing, I prefer VMAF and SSIMPLUS over PSNR or SSIM, but that’s my idiosyncratic bias. If you’re familiar with objective metrics, you likely have your own bias. Otherwise, you should evaluate the metric based on who is using it. Obviously, Facebook wouldn’t quote PSNR/SSIM stats if it felt they were irrelevant, and PSNR hasn’t become obsolete in the 2.5 years since Netflix stopped using it to drive its impressive encoding engine.
When using objective metrics, the results are typically shown via a rate distortion curve (see Figure 1). To produce this, you encode a file or files at multiple data rates, score the different videos, and plot the results. Figure 1 shows the average results for two 1080p files encoded at six data rates using the x265, VP9, x264, and AV1 codecs in FFmpeg 4.x.
A rate distortion curve for AV1, x265, VP9, and x264
When reviewing a rate distortion curve, consider two things. First, are the data rates relevant to your codec usage? If a 1080p curve goes up to 20Mbps, it may be useful for live encoding, but not for VOD, where 1080p data rates for VP9, HEVC, and particularly AV1 should be 4Mbps or lower.
Second, find the quality bar for each particular metric. With VMAF, a score of 93 and higher predicts that the video is free from annoying artifacts. For PSNR, the magic number is 45 dB; with SSIM, it’s 0.95. Using this as a reference, you can gauge how much bandwidth the codec actually saves you at the quality level where you would typically seek to distribute your video.
Or you can use the BD-Rate result (Figure 2), which stands for Bjøntegaard metric, and calculates the data rate savings delivered by one codec over another. This is computed from the same data shown in Figure 1 and predicts that, within the range of curves displayed in the figure, AV1 will deliver the equivalent quality as x265 at roughly 82 percent the data rate, the same quality as VP9 about 69 percent the data rate, and the same quality as x264 at roughly 50 percent the data rate.
BD-Rate is the bottom line of codec comparison, and a great way to summarize the results.
As video quality measurement has become more important than ever, I hope this backgrounder offers you some guidance in what metrics to use, and when to use them.
[This article appears in the October 2018 issue of Streaming Media magazine as "Evaluating Codec Evaluations."]
Ahead of Streaming Media West, a meeting of codec experts offers new developments in leading-edge codecs, as well as field reports from companies already using them.
Is AV1 all that people expect it to be? How much better would HEVC be doing with a fair royalty policy? Look to these charts for the answers to tomorrow's codec questions.
FFmpeg 4.0 gives many video engineers their first chance to test the new AV1 codec against H.264, HEVC, and VP9. The results? In our tests, quality was impressive, but glacially slow encoding times make AV1 a non-starter for most publishers until hardware acceleration becomes available.
For the streaming industry, NAB will be all about HEVC vs. AV1. Here's a look at the most important issues to consider when evaluating the two codecs.