Video Quality Measurement Requires Objective and Subjective Tests
I spend a lot of time assessing video quality, sometimes to compare different codecs, but more frequently to identify the optimal encoding techniques. For example, does an I-frame interval of 3 seconds deliver substantially better quality than an interval of 1 second? How about VBR vs. CBR, or the very slow x265 preset compared to medium? The only way to know is to encode both ways and compare the results.
At a high level, there are two alternatives for assessing quality: objective and subjective comparison. Objective testing mathematically compares the source and encoded files and delivers a score for each. With subjective comparisons, you visually compare the encoded files. Each approach has its pros and cons, which I’ll discuss in this column.
Let’s start with objective comparisons. There are multiple metrics, like peak signal-to-noise ratio (PSNR), structured similarity index (SSIM), video quality matrix (VQM), and SSIMPlus, which as its name implies, claims to go beyond SSIM in what it can measure. Intuitively, the utility of each metric relates to its ability to predict how human eyes would evaluate the files, or the correlation with subjective results. A metric with a correlation of 100 percent would predict subjective results 100 percent of the time and would be incredibly useful. A metric with a correlation of 50 percent would predict the outcome half the time and therefore be useless.
Beyond correlation, metrics are judged based upon their ability to meaningfully distinguish between the various alternatives. For example, a consulting project last summer involved comparing 3 HEVC codecs at 5 different configurations using 16 test files. I measured the files using the PSNR, SSIM, and VQM metrics with the Moscow State University Video Quality Measurement Tool (VQMT).
In PSNR testing, the difference between the highest-and lowest-rated codec was 2.35 percent, and there was no test case in which the difference was greater than 7.5 percent, an artificial threshold I used to flag significant differentials. With SSIM, the best-to-worst difference was 0.77 percent, and no test files varied by 7.5 percent or more. With VQM, the difference was 7.2 percent, and eight test cases broke the 7.5 percent threshold, including two at 21 percent and 34 percent. These differences were pure gold, particularly because the VQMT made it a simple matter to view the differences in the files and confirm the results. The VQM metric accurately predicted subjective results and meaningfully distinguished the alternatives, and it’s become my go-to metric.
Now I’m testing the Video Quality of Experience Monitor powered by SSIMWave, which uses SSIMPlus, and we’ll see if that dethrones VQM. Either way, the downside of objective testing is that it produces a number that might lack perspective. For example, how much more would you spend for an encoder that produced a 5 percent better VQM score than another? If the very slow x265 preset improved VQM by 10 percent over the medium preset, but took three times as long, would you use it? Obviously, the answers relate to the actual visible difference between the alternatives. I also always supplement objective testing with subjective because at their core, these metrics test frame-by-frame comparisons, and some artifacts seem to appear only during real-time playback.
The obvious problem with subjective testing is that it’s time-consuming and expensive to test using multiple subjects. For this reason, most compressionists rely upon their own “golden eyes,” honed from years of quality comparison.
In the past, I performed still-frame comparisons by importing the compressed files into a Premiere Pro sequence so I could easily output comparative frame grabs. Now, for single-frame comparisons, I use VQMT, which can load a source file plus two encoded variations, and lets you toggle through all three files on any single frame. For real-time playback, I use Vanguard Video’s Visual Comparison Tool, which enables three playback views, side by side, split screen, and overlay.
As compressionists, we’re judged by the quality of our output. You can’t meaningfully fine-tune your compression settings without a combination of objective and subjective testing.
This column appears in the July/August issue of Streaming Media magazine as “Assessing Quality.”
Moscow State University makes subjective comparisons less time-consuming with an online service that enlists other people to watch and rank videos.
Objective quality benchmarks provide exceptionally useful data for any producer wanting to substitute fact for untested opinions.
For compressionists who want to see the image quality differences a tool measures, SSIMWave can feel incomplete. An upcoming update may change that.
Never heard of it? Learn why we call this video encoding analysis tool invaluable, and a must-have for anyone serious about encoding or compression.
Viewers aren't seeing 50 shades of gray because streaming video color calibration is outdated. Here's what else needs to impove.
When an encoding job is highly demanding, it takes powerful gear to create and judge all the files. Here are the right tools for the job.