How to Choose and Use Objective Video Quality Benchmarks
Whether you know it or not, many of the videos that you watch have been configured using video quality metrics. Oh, you disagree? Well, have you have watched Netflix recently? Over the last 2 years or so, Netflix’s encoding ladders have been driven by the company’s Video Multimethod Assessment Fusion (VMAF) metric and, before that, Peak Signal to Noise Ratio (PSNR). Not a Netflix customer? Well, how about YouTube? YouTube uses a neural network based upon constant rate factor encoding, which itself is driven by an internal video quality metric.
Simply stated, video quality metrics attempt to predict how a subjective viewer would rate a particular video, and metrics are comparatively rated based upon the accuracy of these predictions. Of course, there are many purists who insist that subjective comparisons are the only valid technique for gauging quality, and indeed, properly administered subjective tests are the gold standard.
However, when you consider that 400 hours of video are uploaded to YouTube each minute, you can appreciate that the service has a strong need to encode its streams as efficiently as possible and a total inability to deploy humans to make it happen. Even Netflix, with a comparatively paltry 1,000 hours of new content in 2017, can’t use human eyes to create the customized encoding ladders for each video. For both companies, and many others, objective quality metrics are essential.
The bottom line is that if you’re in charge of encoding for your organization, and you’re not using objective video quality metrics in one form or another, you’re behind the curve. Fortunately, you’re also in the right place. In this article, I’ll provide an overview of what video quality metrics are and how they work, introduce you to the most common tools for applying them, and tell you how to choose the best metric and tool for your needs.
What Metrics Measure (And How)
You’ve probably heard of metrics like PSNR, Structural Similarity index (SSIM), and perhaps even Netflix’s VMAF. To understand how they differ, it’s useful to understand how each came about and what each measures.
The first class of metrics are error-based. They compare the compressed image to the original and create a score that mathematically represents the differences between the two images, also called noise or error. The PSNR ratio is a good example. Metrics based upon this approach are simple and easy to compute, but scores often don’t correlate well with subjective ratings because human eyes perceive errors differently.
As an example, I was once testing an encoding tool, and the output files produced a dismal PSNR score. I played the compressed video several times and couldn’t see why. Then I compared the encoded image to the original and noticed a slight color shift that accounted for the poor score. During real-time playback without the original to compare, no viewer would have noticed the shift, so in that case, PSNR was a poor predictor of subjective performance.
Why do companies, including Netflix and Mozilla (relating to the AV1 codec), continue to publish PSNR results? First, because it’s the best-known metric, so the scores are easy to understand. Second, despite its age, PSNR continues to provide very useful data in a number of scenarios, some of which I’ll discuss below.
At a high level, perceptual-based models like the SSIM attempt to incorporate how humans perceive errors, or “human visual system models,” to more accurately predict how humans will actually rate videos. For example, according to Wikipedia, while PSNR estimates absolute errors, “SSIM is a perception-based model that considers image degradation as perceived change in structural information, while also incorporating important perceptual phenomena, including both luminance masking and contrast masking terms.” In other words, perceptual-based metrics measure the errors and attempt to mathematically model how humans perceive them.
Perceptual-based models range from simple, like SSIM, to very complex, like SSIMWave’s SSIMPLUS metric, or Tektronix’s Picture Quality Rating (PQR) and Attention-weighted Difference Mean Opinion Score (ADMOS). All three of these ratings can incorporate display type into the scoring, including factors like size, brightness, and viewing distance, which obviously impact how errors are perceived.
ADMOS also offers attention weighting, which prioritizes quality in the frame regions that viewers will focus on while watching the video. So, a blurred face in the center of the screen would reduce the score far more than blurred edges, while a purely error-based model would likely rate them the same.
While these metrics take years of research, trial and error, and testing to formulate, at the end of the day, they are just math—formulas that compare two videos, crunch the numbers, and output the results. They don’t “learn” over time, as do those metrics in the next category. In addition, depending upon the metric, they may or may not incorporate temporal playback quality into the evaluation.
Similarly, most of these metrics were developed when comparisons were full resolution compressed frame to full resolution original frame. The invention of the encoding ladder, and the decisions relating thereto, create a new type of analysis. For example, when creating the encoding ladder for a 1080p source video, you may compare the quality of two 1.5Mbps streams, one at 540p, the other at 720p. All metrics can compute scores for both alternatives; you simply scale each video up to 1080p and compare it to the source. But few of these older metrics were designed for this analysis. (More on this in a moment.)
MACHINE LEARNING AND METRIC FUSION
The final category of metrics involves the concept of machine learning, which is illustrated in Figure 1 from a Tektronix presentation on TekMOS, the company’s new quality metric. Briefly, MOS stands for mean opinion score, or the results from a round of subjective testing, typically using a rating from 1 (unacceptable) to 5 (excellent).
Figure 1. TekMOS metric and machine learning
In training mode, which is shown in the figure, the metric converts each frame into a set of numerical datapoints, representing multiple values such as brightness, contrast, and the like. Then it compares those values to over 2,000 frames with MOS scores from actual subjective evaluations, so that it “learns” the values that produce a good or bad subjective MOS score. In measurement mode, TekMOS takes what it learned from those 2,000-plus trials, inputs the numerical datapoints from the frame it’s analyzing, and outputs a MOS score.
Like the metrics discussed above, machine learning algorithms start with a mathematical model. However, it compares results with subjective MOS scores trains and fine-tunes the model so that it improves over time. Plus, the machine learning itself can be tuned, so one model could represent animations, another sports, and so on, allowing organizations to train the metric for videos most relevant to them.
Netflix’s VMAF is another metric that can be trained, using what’s called a support vector machine. Since the primary use for VMAF is to help Netflix produce encoding ladders for its per-title encoding, the Netflix training dataset includes clips ranging in resolution from 384x288 to 1080p at data rates ranging from 375Kbps to 20Mbps. Again, by correlating the mathematical result with subjective MOS scores, VMAF became much better at making the 540p vs. 720p decision mentioned above.
As the name suggests, VMAF is a fusion of three metrics, two that measure image quality and one that measures temporal quality, making it a true “video” metric. Similarly, Tektronix’s TekMOS metric includes a temporal decay filter that helps make the scoring more accurate for video. TekMOS also has a region of interest filter, which VMAF currently lacks. One huge benefit of VMAF is that Netflix chose to open source the metric, making it available on multiple platforms, as you’ll learn more about below.
Which Metric Is Best?
No article on metrics would be complete without scatter graphs like those shown in Figure 2, which were adopted slightly from Netflix’s blog post on VMAF. The scatter graph on the left compares the VMAF scores (left axis) with actual MOS scores (bottom axis). The graph on the right does the same for a different metric entitled PSNRHVS.
Figure 2. Scatter graphs comparing the metrics
If the scores corresponded exactly, they would all fit directly on the red diagonal line, though, of course, that never happens. Still, the closer to the line, and the tighter the pattern around the line, the more the metric accurately predicts human subjective scores. In this fashion, Figure 2 tells us that VMAF is a superior metric.
What’s interesting is that every time a metric is released, it comes with a scatter graph much like that shown on the left. SSIMPLUS has one, TekMOS has one, and Tektronix’s older metrics, PQR and ADMOS, had them as well. This is not to cast doubt on any of their results, but to observe that all of these metrics are highly functional and generally correlate with subjective ratings more accurately than PSNR.
However, accuracy is not the only factor to consider when choosing a metric. Let’s explore some of the others.
Referential vs. Non-Referential
One critical distinction between metrics is referential vs. non-referential. Referential metrics compare the encoded file to the original to measure quality, while non-referential metrics analyze only the encoded file. In general, referential metrics are considered more accurate, but obviously can be used in a much more limited circumstance since the source file must be available.
Non-referential metrics can be applied anywhere the compressed file lives. As an example, TekMOS is included in the Tektronix Aurora platform, an automated quality control package that can assess visual quality, regulatory compliance, packaging integrity, and other errors. Telestream subsidiary IneoQuest developed iQ MOS, a non-referential metric that can provide real-time quality assessments of multiple streams in the company’s line of Inspector products.
So when choosing a metric, keep in mind that it might not be available where you actually want to use it. Referential metrics are typically used where encoding takes place, where non-referential metrics can be applied anywhere the video on demand (VOD) file exists, or where a live stream can be accessed.
When choosing a metric, it’s important to understand exactly what the scores represent and what they don’t. For example, with the SSIMPLUS metric, which runs from 1–100, a score from 80–100 predicts that a subjective viewer would rate the video as excellent. These subjective ratings drop to good, fair, poor, and bad in 20-point increments. Most MOS-based metrics, including TekMOS, score like their subjective counterparts, on a scale from 1–5, with 5 being the best and 1 considered unacceptable. This type of scoring makes the results very easy to understand and communicate.
A working group overseen by the CTA is creating recommendations for measuring performance quality, and some of the biggest names in the industry are participating.
Latest codec quality comparison study finds AV1 tops on quality, but far behind on speed, and that VP9 beats out HEVC. Moscow State also launched Subjectify.us, a new service for subjectively comparing video quality.
How can publishers compare video quality at different resolutions? There's the theoretically correct answer and then there's how it's generally done.
Netflix's compact mobile download files look surprisingly great. Here's how video creators make their own low-bitrate files look just as impressive.
For both video encoding and quality of service, content providers need to make sure their viewers get the best experience possible. Here's how to select a quality control service.
Companies and Suppliers Mentioned