Moscow State's Dr. Dmitriy Vatolin Talks Codecs and Quality
Dr. Dmitriy Vatolin is the Head of the Moscow State University Graphics and Media Lab, which is the developer of the Video Quality Measurement Tool (VQMT), the developer of the crowd-sourced video rating site Subjectify.us, and the publisher of an increasing array of codec and encoder comparison reports during last 18 years. As such, he has unparalleled insights into objective and subjective metrics and codec quality. We asked him to comment on a variety of topics, including the accuracy of objective metrics and the status of VMAF hacking. We're pleased to publish his responses.
Streaming Media: You've performed a lot of metric-based codec studies, many with subjective verification. What are your observations regarding how accurately the various metrics predict subjective scoring?
Vatolin: The measurement of subjective quality critically depends on the size and quality of the dataset. Now we have about 3,000 sequences compressed by different codecs with subjective scores, and we continue to build up this dataset quite intensively (thanks to subjectify.us). In the next 3 months we are planning to publish two benchmarks on videoprocessing.ai separately for Full Reference and No-Reference metrics. We also collect sequences with different artifacts and processing methods (for example, Super-Resolution), which we also use to evaluate metrics.
We predict that in the coming years Super-Resolution will be used together with codecs. Also, these works allow us to examine practical approaches for evaluating neural network codecs, which have already started to appear (and with which we have already started to work).
For those who are interested in current results, I would recommend our paper Objective video quality metrics application to video codecs comparisons: choosing the best for subjective quality estimation published on the arxiv in August. There are many graphs in this paper with an evaluation of different VMAF calculation for example (you can see a significantly worse quality of VMAF NEG in Figure 1 for example):
Figure 1. The correlation of different VMAF calculations to subjective score. (Click for full-size image.)
In upcoming benchmarks, we plan to incorporate a lot of and their calculation options. The interesting thing is that we can notice a big drop in correlations of old metrics and VMAF NEG on new codecs like AV1, particularly relative to the older codecs, which you see in Figure 2, where the predicative accuracy of PSNR, MS-SSIM, and SSIM drop significantly with AV1 yet remain solid with HEVC. It seems that with the widespread use of AV1 the use of PSNR and SSIM will no longer be so common.
Figure 2. The predictive accuracy of older metrics drops precipitously with AV1. (Click for full-size image.)
Streaming Media: How much has the situation changed for the better for No-Reference metrics, and how applicable are they in practice today?
Vatolin: Recently, we have seen the biggest growth of NR metrics, many of which are replacing the classic FR metrics of years past. The result of the MDTVSFA is particularly impressive (Figure 3).
Figure 3. How full and no-reference metrics align with subjective ratings. (Click for full-size image.)
However, you must consider that the stability of NR metrics is lower. We also need to see how they behave for different codecs, for example, how much their correlations will differ for H.264 and VVC. Moreover, the selection of video sequences, preprocessing methods and other factors is very important. We plan to conduct an in-depth analysis of this topic in a corresponding benchmark.
AVQT is a metric from Apple, presented in May 2021; TENCENT is a metric from Tencent Holdings; and VMAF, as you know, is from Netflix. Since the development of good metrics requires serious investment in the size and quality of the dataset, and this dataset is very difficult to put in the public domain because of copyright issues (we know several cases when large datasets had to be removed from the public domain due to legal problems).
Streaming Media: You mentioned your research in Super-Resolution. What is Super-Resolution and how far is the technology there from real everyday use?
Vatolin: Super Resolution is the process of producing a high-resolution image or video from a lower resolution source. It is already in use in many applications.
We work with video, and we are most interested in practical SR methods for video. SR methods can be roughly divided into "beauty SR"(95% of methods) and "restorative SR." Google implemented restorative SR for video (with block-based Motion Estimation and other techniques) in 2018 to improve photos in its Pixel 3. Since photo quality is now 50% of a smartphone's value, the other manufacturers are now doing the same.
Actually, video SR in smartphones is already here, but currently only for single frames. And even though today there are limitations such as extensive power usage and lack of computing power, in the near future, we will most likely see the full use of these algorithms on entire videos. In addition, the share of 4K TVs and 2K+ smartphone displays is growing steadily.
We have published three video SR methods benchmarks during the last six months. SR for video from camera (handling of noise and artifacts), measurement of pairs SR+codec (for example, SR works better with H.264 than with AV1 - I think there will be a lot of interesting features within the new TVs and tablets soon), and finally general upscale (with SR in top) for different types of content. There are already 640 public repositories on GitHub on the topic of Super-Resolution, and new ones appear every 1-2 days. We plan to accurately evaluate all the most interesting contenders. The current results are already quite encouraging, in particular we have measured (thanks again to subjectify.us) subjective quality and we see quite an optimistic picture there (Figure 4).
Figure 4. The accuracy and speed of super-resolution models. (Click for full-size image.)
As you can see, there is a significant problem with metrics in this area. PSNR-oriented methods tend to blur the picture, which is bad for visual quality (Figure 5). We can even observe a negative correlation when using PSNR paired with codecs. Our new ERQA metric (try “pip install erqa”) looks quite promising, and we are currently working on improving it for SR.
Figure 5. Metric performance for super-resolution tasks. (Click for full-size image.)
Streaming Media: You were one of the first to identify the issue with VMAF hacking. What's the status of that?
Vatolin: This work has been successfully continued. Last summer, Netflix published VMAF NEG ("neg" stands for "no enhancement gain"). This summer we published "Hacking VMAF and VMAF NEG: vulnerability to different preprocessing methods," the article on how you can hack VMAF NEG with other enhancements. So far “no enhancement gain” VMAF has not been actually achieved yet, unfortunately. The biggest problem in the development of such metrics is that when the hack-resistance of the metric increases, its quality in terms of correlation significantly decreases, as shown in Figure 6.
Figure 6. Prediction accuracy of VMAF NEG is much lower than other VMAF models. (Click for full-size image.)
We can see in Figure 7 that VMAF NEG performs worse than MS-SSIM, while having a higher computation complexity.
Figure 7. MS SSIM is more accurate than VMAF-NEG. (Click for full-size image.)
Note that in the latest version of VQMT the fast version of MS-SSIM on CPU is faster than the version of VMAF on OpenCL/GPU. It is not obvious why we should measure the VMAF-NEG value at all, if the difference in speed on the CPU is more than 22 times. However, we should not forget why we brought up this topic in the first place. If we count the values of only those metrics that are easily hacked, our comparison cannot be considered objective. There is a serious question: Why count VMAF, if its correlation is about the same as MS-SSIM, while the speed on the GPU is 3.5 times less, and the CPU is 22 times less (Figures 8 and 9).
Figure 8. MS SSIM is much faster than VMAF-NEG on a GPU. (Click for full-size image.)
Figure 9. MS SSIM is much, much faster than VMAF-NEG on a CPU as well. (Click for full-size image.)
That said, our preliminary research shows that there are other ways to increase the value of VMAF. At the moment, we have shown that DISTS, LPIPS, and MDTVSFA (?urrent NR benchmark leader!) metrics, which are gaining popularity, are also not resistant to hacking. We plan to analyze the resistance of the metrics separately in new metric benchmarks.
Streaming Media: What are your bottom-line recommendations regarding when to use VMAF and how?
Vatolin: First of all, you have to be extremely careful when you see VMAF data without specifying how and on what videos it was calculated. Our measurements show that you can make a huge difference by simply selecting the "right" video sequences for compared codecs (Figure 10).
Figure 10. VMAF version accuracy varies by content type. (Click for full-size image.)
If you perform measurements for yourself, you need to look at a lot of things including different behavior of different metrics on different sequences (the detailed description is obviously beyond the scope of this interview, and we are now actively working on this topic). In any case, the complexity of accurate measurements has unfortunately increased significantly recently.
Streaming Media: I know you've looked at multiple AV1 codecs. Without naming names, have you ever suspected a codec of attempting to hack a better score?
Vatolin: Google programmers added tune_vmaf.c to libaom sources (which implements a method we published two years ago) more than a year ago :). In general, I would not want to disclose names, but again, we began to conduct in-depth research in this area when we encountered successful hacking of VMAF metrics in our comparison. And it is obvious that with the advent of neural network pre- and post-processing, as well as neural network codecs, the problem will become significantly more complicated.
Streaming Media: What about the integration of standard-based metrics like ITU-T Rec. P.1204 into VQMT?
Vatolin: First of all, we would like to test its resistance to hacking (joking). In all seriousness, this metric will be included in our Full Reference benchmark very soon, so you will be able to see results. We have already calculated its correlations and they are lower than expected. We would like to see ITU-T Rec. P.1204 tested by other researchers.
Streaming Media: I've been a big fan of the low-frame values in VQMT as a measure of the potential for transient quality problems. A recent LinkedIn Comment asked "perhaps is it worth to replace Low-frame VMAF with 5% percentile to eliminate an impact of outliers (or 'black swan' - in statistics jargon - very rare event). I'm not sure 5% is the right number, but is this a better approach? If so, what might be the right number and is this on the VQMT roadmap?"
Vatolin: Currently, VQMT can include output of a 95% confidence interval for VMAF values, i.e., percentiles of 2.5% and 97.5%. These are calculated by applying a bunch of models and obtaining statistical information from them. Adding any other percentiles is not complicated. We are now considering adding a setting that will allow you to adjust the length of the confidence interval and set any value you want. For a more detailed statistical analysis, you can use the data from the models on which the confidence interval is calculated. In VQMT it will be possible to include their output by setting "Per-model values."
The research of the adequacy of a particular percentile and its correspondence to the MOS value is a very interesting topic. At the moment we have no ready answer as to what percentage is best to use. But 2.5% seems to be a really low value which may be affected by outliers. The calculations in VMAF v0.6.2 and v0.6.3 combine about 20 models. Under these circumstances, the 2.5 percentile takes into account the model with the lowest result with the high weight. This model could be an outlier. Switching to the 5% value should smooth out any inadequate results.
Streaming Media: A bit off topic, but I've noticed that many "academic" codec comparisons in white papers found AV1 performance much lower than HEVC, where many of yours have the opposite findings. What's your explanation as to the difference?
Vatolin: To assess codec comparisons you have to consider three main things — which codec is being compared, with which settings and on which sequences. We receive codecs and settings from developers — this is very important. Many well optimized codecs are not available in free access. For example, this year's Tencent AV1 performed noticeably better than libaom AV1. The same situation with VVC—it's easy to show in "free versions comparison" how VVC is essentially worse than libaom AV1, but comparison with commercial versions of codecs shows us another picture (Figure 11).
Figure 11. Codec performance in MSU 2021 codec comparison shows that VVC has great potential. (Click for full-size image.)
Also, attentive readers of our comparisons know that even for perfectly tuned x264, long ago developers sent us encoding recipes that delivered better results than standard presets. Why this happens is a separate question, but it is an easily verifiable fact. In general, the choice of good presets is also a big complex subject, and we have several publications on it and the website Efficient Video Transcoding.guru, where we show the choice of presets better than the standard and even better than the developers' presets.
Finally, you can read about the difference between our datasets and some academic datasets in each comparison. In particular, we focus on sequences that are simpler for codecs, but more common in terms of complexity in real life. And thanks to the fact that we receive a lot of criticism and suggestions directly from the developers every year (and we are constantly implementing them) there is a reason to believe that our results are closer to real life.
Unlimited file analysis and Python interface highlight the new features
The video experience experts at SSIMWAVE compared titles across eight top U.S. streaming services, and the differences in quality were shocking. But what was even more shocking was that none of the services are delivering video at the quality that both subscribers and creatives expect.
The good news: As always, Moscow State's codec studies are some of the most comprehensive available. The bad news: Unless you're TikTok or Tencent, you won't have access to some of the best performers.
If you're serious about experimenting with different codecs and/or encoding parameters, MSU's Video Quality Measurement Tool is an essential tool, and version 13 brings some welcome improvements.
In rigorous tests of video codecs, only one produced vastly different results during objective and subjective examinations. What could cause that discrepancy?
Moscow State University's Video Quality Measurement Tool was already good. Enhancements in the new version, including new metrics and the ability to run multiple analyses simultaneously, make it even better.