AV1 Beats VP9 and HEVC on Quality, if You've Got Time, says Moscow State
According to Moscow State University (MSU), AV1 is the highest quality codec available, besting both HEVC and VP9—when considering quality only, and not encoding speed,. More interesting is that in normal operating modes, VP9 produced higher quality than HEVC. These are just two of several compelling findings from the recently completed MSU 2017 Codec Comparison Report. As you'll read below, MSU also launched a new service for subjectively comparing videos and still images online, and made some interesting observations about the wonkish topic of tuning for SSIM when using objective tests to measure codec quality.
By way of background, MSU has produced codec quality comparisons since 2006, and released its first HEVC comparison in 2015. As in previous years, MSU releases the report in different versions, each testing different encoders using different files and different testing methods. This year, MSU released the report in five parts, all free except for the Pro version that costs $950. Figure 1 shows all the reports, which are available for download here.
Figure 1. Versions of the MSU report
AV1 Reigns, But Slowly
We grabbed the codec comparisons in the lead paragraph from Part 5, with the summary chart shown as Figure 2. Here you see AV1 producing the same quality as x264 at 55% of the data rate, with x265 running in three-pass and two-pass Placebo mode at 67% and 69% the data rate respectively. No producer uses Placebo mode for x265, though it's certainly fair here in comparison to AV1.
Specifically, here's what the report states regarding encoding speed; "AV1 encoder has extremely low speed—2500-3000 times lower than competitors. X265 Placebo presets (2 and 3 passes) have 10-15 times lower speed than the competitors." While MSU observes that the AV1 encoder hasn't been optimized, these differences indicate that AV1 has quite a steep hill to climb to become usable. With its launch imminent, we'll soon see.
Figure 2. AV1 produced the highest quality output, but you'll be waiting for a long, long time.
MSU encoded VP9 using the --good option, or the second slowest setting, which is what most producers actually use for delivery. Here, VP9 proved slightly better than x265 two-pass mode using the Veryslow preset, which is still slow but commercially reasonable. For perspective, MSU always consults with codec vendors when formulating their encoding parameters, and will use parameters supplied by the vendors if they care to do so. So roll that one around in your brain: In an extensive (31 HD videos) head-to-head comparison by an independent third party using settings supplied by the vendors, VP9 produced higher quality than HEVC.
The only caution is that MSU drew these quality-related conclusions using the YUV-SSIM quality metrics, not the subjective tests discussed below. As you'll see from reading the final section, it's tough to have a lot of confidence in these results, at least at the low end of the data rate spectrum.
Subjective Results via Subjectify.us
Significantly, this was the first report released by MSU that included subjective results, which MSU garnered via its newly launched service, Subjectify.us. As shown in Figure 3, Subjectify is a service that allows customers to upload different alternatives of still image or video processing for subjective comparisons by users recruited by the service. Users get paid for each comparison with frequent checks to ensure that they're actually studying the samples.
For example, each test run of ten samples might include two comparisons of original videos and highly compressed samples, where the original video should win every time. If a user chooses wrong on these tests, it's assumed that their results are invalid, so they are excluded from the sample and their services are terminated.
Figure 3. MSU's new service, Subjectify.us, could revolutionize subjective comparisons.
For the subjective comparisons included in the report, MSU collected 11,530 comparisons from 325 unique participants and converted their responses to subjective scores. The MSU team used these scores to compute final average bitrate saving scores (similarly to the method used in their objective report). Figure 4 shows how the subjective rating impacted the overall scores (including speed) for the tested HEVC codecs.
Figure 4. Overall scores (quality and speed) for the tested HEVC codecs
Subjective tests are time-consuming and expensive to produce, yet really are the gold standard. In this regard, Subjectify may be a great alternative for researchers and producers seeking to choose the best codec or best encoding parameters.
Getting Wonky with --Tune SSIM
The final significant finding related to a wonkish encoding setting called --tune SSIM. By way of background, x264 and x265 developers have long argued that certain encoding techniques used by the codec to improve subjective quality as viewed by human eyes will result in lower scores when measured by objective metrics like SSIM and PSNR. So if you're encoding for objective comparisons, these developers recommend that you use "tuning" options that disable these adjustments.
Here's the recommendation from the x265 documentation page: "The psnr and ssim tune options disable all optimizations that sacrifice metric scores for perceived visual quality (also known as psycho-visual optimizations). By default x265 always tunes for highest perceived visual quality but if one intends to measure an encode using PSNR or SSIM for the purpose of benchmarking, we highly recommend you configure x265 to tune for that particular metric. "
Accordingly, if you tune for SSIM, you would expect lower subjective scores, because it disables optimizations designed to improve perceived visual quality. However, via Subjectify, MSU found just the reverse; the tuned output of the x265 and x264 codecs displayed much higher quality than the untuned versions that included these psycho-visual adjustments (Figure 5).
Figure 5. Tuning for SSIM actually improved subjective quality in MSU tests.
MSU attempted to reconcile these results by attributing them to the overall low bitrates being tested, which ranged from 1Mbps to 4Mbps for 1080p video. MSU contacted developers from x264 who responded:
If you wanted to check psychovisual optimizations of x264 and especially psy-rd, then IMHO 1–4 Mbps is a very low bitrate for Full HD video encoding with it. At low bitrates it tends to produce ringing/blocking artifacts, which lower subjective quality. So, psy-rd is supposed to be used only with high-bitrate encodes, where it improves sharpness and ringing artifacts aren't visible.
Also --tune ssim changes --aq-mode from 1 to 2. And --aq-mode 2 needs less tweaking for the source owing to its auto-strength component, while --aq-mode 1 may need --aq-strength tweaking for the source. When tweaked correctly it can produce higher quality than --aq-mode 2, but this may need per-source tweaking.
The problem is that 4Mbps isn't really all that low for 1080p video, leaving those attempting to compare codecs with less than clear direction. For example, the MSU tests included in Part 5 ranged from under 2Mbps to over 18Mbps. At what data rate should researchers start to apply SSIM tuning? When, if ever, does it stop helping? Beyond these questions, the second comment from the x264 developer indicates that per-source tweaking may be required for optimal results when tuning, adding another challenging variable to the comparison process that objective metrics are designed to simplify.
Basically, the MSU results bring into question the validity of using SSIM or PSNR scores to compare codecs along a broad range of data rates like those typically used to create a rate distortion curve. It may be that other more advanced metrics, like Netflix's VMAF, avoids these issues, or that subjective comparisons are the only way to avoid the tuning vs. non-tuning issue. In this regard, we hope to review Subjectify.us within the next few months.
In rigorous tests of video codecs, only one produced vastly different results during objective and subjective examinations. What could cause that discrepancy?
This video analysis tool—VideoQuest for short—offers deeper metrics and is more configurable than others on the market, especially when exploring the differences between two encoded files.
Moscow State University makes subjective comparisons less time-consuming with an online service that enlists other people to watch and rank videos.
FFmpeg 4.0 gives many video engineers their first chance to test the new AV1 codec against H.264, HEVC, and VP9. The results? In our tests, quality was impressive, but glacially slow encoding times make AV1 a non-starter for most publishers until hardware acceleration becomes available.
Sometimes an industry benchmark—such as Netflix's ISP Speed Index—only tells half the story about video delivery. Before picking a codec, dig a little deeper.
Today the Alliance for Open Media froze the AV1 bitstream and released an unoptimized software encoder and decoder; AV1 decode should arrive in several browsers and some content from member companies over the next few months, with hardware implementations in about a year.
Moscow State University's Video Quality Measurement Tool was already good. Enhancements in the new version, including new metrics and the ability to run multiple analyses simultaneously, make it even better.
Thanks to a fractured HEVC licensing system companies no longer have the financial incentive to innovate, but Leonardo Chiariglione suggests steps to reverse the damage.
If you're not using a video quality measurement tool, you're behind the curve. Here's a look at the most popular tools and how they work.
In addition to comparing HEVC codecs, report also compares HEVC to VP9 and x264, with potentially controversial results
Never heard of it? Learn why we call this video encoding analysis tool invaluable, and a must-have for anyone serious about encoding or compression.