It's Time to Retire PSNR
Virtually all experts on video quality metrics agree that the peak signal-to-noise ratio (PSNR) metric is a poor predictor of subjective quality. Yet, PSNR comparisons are included in almost all codec comparisons, most recently in the excellent IEEE white paper, "Comparing VVC, HEVC and AV1 Using Objective and Subjective Assessments." It's time to retire PSNR, at least for these types of analyses.
I was reminded of PSNR's poor performance, yet again, when I started reviewing test results for a new video quality metric from the ITU called ITU-T Rec. P.1204. Taking a step back, the reason streaming producers use video quality metrics is to help make encoding decisions that improve subjective quality. For this reason, the most critical performance feature for any metric is how accurately it predicts how human eyes will rate the same video.
To assess this, researchers compile databases of videos and subjective ratings from multiple viewers. Then they rate the score with the metric and see how these scores compare to the subjective scores for the same video. You can see three such comparisons in the graphic; on the left for P.1204.3, in the middle for PSNR, and on the right for Video Multimethod Assessment Fusion (VMAF). In all three, the vertical axis presents the results of subjective ratings, while the horizontal axis is the metric score.
If the metric score matched the subjective rating perfectly, you'd see a solid line from the lower left to the upper right. No metric is perfect, so you never see a solid line. However, the more closely packed the datapoints are around that line, the more accurate the predictions. Looking at the graphs, P.1204.3 is the most accurate, with VMAF next, and PSNR the least by far. You can see a similar graph for SSIMPLUS.
The pattern in the graphs is verified by Pearson's correlation coefficient (PCC) for each dataset. Briefly, PCC measures the linear correlation between two variables: X and Y. According to Statistics Solutions, "If the coefficient value lies between ± 0.50 and ± 1, then it is said to be a strong correlation." So despite the seeming randomness in the PSNR plot, a mathematician would say that the correlation is strong. Still, when more accurate metrics like VMAF, SSIMPLUS, and P.1204 are available, PSNR as a measure of quality is a waste of time and space.
Interestingly, PSNR has a "canary in a coal mine" utility, which is to identify VMAF hacking methods. That is, hacking techniques like pre-encoding sharpening and contrast adjustments can send VMAF scores through the roof, but will also send PSNR scores through the floor. If you see VMAF scores that seem excessively high, you should run a quick PSNR test to verify the hack. Even that use is waning, however, as Netflix recently introduced a no-hacking model that all codec testers should explore.
Keep your eye out for more on P.1204.3, which is a "no-reference" metric that can compute a score without comparing the encoded file to the source. This makes it much more convenient than full-reference metrics like PSNR and VMAF. If P.1204.3 finds its way into tools like FFmpeg and the Moscow State University Video Quality Measurement Tool, this will provide even greater justification for dropping PSNR from codec comparisons.
Jan Ozer discusses the pros and cons of three key objective quality metric tools: Moscow State University, SSIMplus, and Hybrik (Dolby).
Average scores can be deceiving, so be sure you're using a tool that gives you a more accurate assessment of your video quality
Tektronix Applications Engineer Andrew Scott discusses objective and subjective methods and tools for picture quality measurement in this clip from his presentation at Streaming Media West 2018.