December 1, 2017
By Jan Ozer Contributing Editor
Featured Articles

How to Choose and Use Objective Video Quality Benchmarks

In contrast, PSNR measures decibels on a scale from 1–100. Though these numbers are not universally accepted, Netflix has posited that values in excess of 45dB yield no perceivable benefits, while values below 30 are almost always accompanied by visual artifacts. These observations have proven extremely useful for my work, but only when comparing full-resolution output to full-resolution source. When applied to lower rungs in an encoding ladder, higher numbers are better, but lose their ability to predict a subjective rating. For example, for 360p video compared to the original 1080p source, you’ll seldom see a PSNR score higher than 39dB, even if there are no visible compression artifacts.

Though SSIM, and particularly Multi-Scale SSIM (MS SSIM), are more accurate metrics than PSRN, it’s scoring system anticipates a very small range from -1 to +1, with higher scores better. Most high-quality video is around .98 and above, which complicates comparisons. While you can mathematically calculate how much better .985 is than .982, at the end of the day, it still feels irrelevant.

VMAF scores also rank from 1–100. While higher scores are always better, individual scores, like a rating of 55 for a 540p file, have no predictive value of subjective quality. You can’t tell if that means the video is perfect or awful. That said, when analyzing an encoding ladder, VMAF scores typically run from the low teens or lower for 180p streams, to 98+ for 1080p streams, which meaningfully distinguishes the scores. In addition, VMAF differences of 6 points or more equals a just-noticeable difference (JND), which is very useful for analyzing a number of encoding-related scenarios, including codec comparisons.

The scoring range of VMAF over the diverse rungs of the encoding ladder makes it attractive for choosing the best resolution/data rate streams in the ladder. In contrast, PSNR might range from 30–50dB, with the lower four rungs compressed between 30–37. This reduces its value as a predictor of the perceptible difference between these rungs.

Before choosing a metric, you should understand what the scores tell you, and make sure it tells you what you need to know.

Accessing the Metric

Don’t choose a metric without understanding how you’ll access it and how much it will cost to do so. In this section, I’ll briefly discuss the tools that can compute the metrics above, starting with FFmpeg, a free tool that can compute both PSRN and SSIM.

The Moscow State University (MSU) Video Quality Metric Tool (VQMT, $999 direct) supports a range of metrics like PSNR, SSIM, MS SSIM, and many others, including VMAF in version 10, which is now in beta. The top window in Figure 3 shows the VMAF score for two 1080p talking head files, one encoded at 4500Kbps, the other at 8500Kbps, with the top graph showing the entire file, the lower graph the highlight region on the left in the upper graph. The scores are very close, indicating that additional 4Mbps spent on the highest-quality stream is a waste.

Figure 3. VMAF comparisons in the Moscow State University VQMT

You can drag the playhead and visualize any frame in the video, either side-by-side, as shown in the bottom of Figure 3, or one atop the other. This latter view makes it simple to switch between the two encoded files and the original, which is better for visualizing minor differences like the color shift mentioned above. VQMT offers perhaps the best interface of any tool for making A/B comparisons between two encoded files (Figure 3), and its batch operation is very flexible.

On the downside, VQMT can only compare files of identical resolution, so if you’re analyzing lower-resolution rungs on your encoding ladder, you’ll have to manually scale them to full resolution first, which takes time and lots of hard disk space. In the beta, the implementation of VMAF is painfully slow, literally using only one core of my 40-core HP Z840 workstation, though hopefully this will improve in the final shipping product. MSU offers a free trial that only works with files smaller than 720p, but it’s a great way to get familiar with the program. We reviewed an older version of VQMT.

Hybrik Media Analyzer

For high-volume analysis, the Hybrik Media Analyzer (Figure 4), which can compute PSNR, SSIM, and VMAF, is hard to beat. As an example, for a recent talk at Streaming Media West, I evaluated four per-title technologies with 15 test files and a seven-rung encoding ladder. I had to run each system twice, once to find the baseline, the other to deploy per-title encoding. This means I had to compute PSNR and VMAF about 840 times each and get the results copied into a spreadsheet.

Figure 4. Hybrik Media Analyzer

You can drive operation via the JSON API, of course, but the UI is even simpler. You load the seven encoded files at once, choose the source file and the tests to run, and the cloud encoder takes it from there, performing all necessary scaling automatically. That’s one input task, seven outputs. Once the analysis is complete, you can export the results into a CSV file and import that into your spreadsheet, reducing 30 or so copy-and-paste operations (resolution, data rate, PSNR, VMAF score for each rung) into three or four, saving time and reducing the potential for errors. Hybrik is also much more CPU efficient than MSU VQMT when running VMAF, so it can make more efficient use of all cloud instances.

The only problem is that Hybrik doesn’t offer analysis-only pricing, and the minimum charge for accessing the system is $1,000/month for up to 10 simultaneous cloud instances running the AWS-based system. If this cost isn’t prohibitive, or if Hybrik ever decides to offer analysis-only pricing, the service could be a lifesaver for compressionists on a deadline or those running high-volume tests.

Proprietary Tools

Most of the other metrics are available only in proprietary tools like the aforementioned Aurora, which offers far more than video quality metrics and is available in several editions ranging in price from $4,850 to $33,000. For all versions, TekMOS is a $4,000 option. The software runs on Windows Server 2012 R2 or later.

You can run Aurora via the API or UI. Either way, to analyze a file, you choose the file and a template with selected checks and verifications. TekMOS results are given in both numerical and graphic format as you can see in Figure 5, and tiling, noise, and blurriness can be shown separately to assist in score interpretation.

Figure 5. TekMOS results show an average score of 2.894, with blurriness and tiling the most significant issues.

Tektronix also sells a line of full-reference picture-quality analyzers with the PQR and ADMOS metrics discussed above, as well as others. Prices for these systems start at around $18,400, though you’ll need to spend another $9,180 for essential features like batch operation.

The SSIMPLUS algorithm is used throughout the SSIMWave product line, with the SSIMPLUS Analyzer providing the broadest analysis functions. The Analyzer is a very flexible product that can measure files with different resolutions and frame rates than the original, and you can compute scores for multiple devices simultaneously. In addition to text-based output files, the software outputs quality maps that you can use to compare different files. In addition to a Windows GUI for both batch and single file operation, the Analyzer is available as Linux, Mac, and Windows SDKs and command line interfaces. The company didn’t respond to our request for pricing information. We reviewed an older version of the Analyzer.

Finally, though I’ve never personally tested its products, Video Clarity sells a range of hardware, software, and cloud-based analysis tools with both full reference and non-reference video quality metrics. If you’re considering an investment in video quality control, be sure to check out Video Clarity as well.

Summing Up

In my experience, the more expensive the tool, the more idiosyncratic it is to operate. It’s impossible to get a sense for a tool or a metric by reading a spec sheet; you have to spend hours working with the metric and verify its results subjectively, many, many times, until you have confidence that the numerical scores represent real results. This may change depending upon the nature of the task. I would never spend big dollars on any video-quality analysis tool without a trial.

You may also find that different metrics call to you and that your preferences evolve over time or change by the project. In my journey with objective metrics, I started with an affinity for the Video Quality Metric (VQM), a basic metric that proved superior to PSNR and SSIM for identifying differences between the codecs I was analyzing for a consulting project. However, the raw score conveyed nothing about how a subjective user would rate the video. Plus, it was relatively unknown, so a VQM score meant nothing to clients or readers.

For more general work, I migrated to PSNR, which has easy-to-interpret scores and was universally known. Let’s face it, PSNR is still useful in some applications, as evidenced by Netflix using PSNR in its per-title encoding engine until replaced by VMAF in mid-2016, and continuing to cite PSNR results in most codec comparisons, along with VMAF, of course.

A later project involved choosing configurations for mobile devices, making SSIMPLUS a natural, since it has very easy-to-use device specific presets. Finally, once I started analyzing encoding ladders for clients, I began using and liking VMAF more and more; it’s accessible and was designed specifically for working with encoding ladders. Certainly the fact that it was developed by Netflix gives VMAF tremendous technical credibility.

Regarding VQM, better is always better, though something is almost always better than nothing. So if you have access to VMAF or some of the higher-quality, perceptual-based metrics, use those. If not, PSNR, SSIM, or MS SSIM should perform well for tasks such as evaluating encoding parameters like encoding preset, keyframe interval, bitrate control technique, and the like, or comparing the quality of like resolution rungs on your encoding ladder, as shown in Figure 3 with VMAF. I would be less confident in these metrics when comparing encoding tools and wouldn’t use them without confirming numbers from another metric when comparing codecs.

[This article appears in the November/December 2017 issue of Streaming Media Magazine as "Choosing and Using Objective Quality Benchmarks."]