Facebook Introduces New 360°/VR Video Quality Metric
In a blog post called "Quality Assessment of 360° Video View Sessions," Facebook will announce two objective quality metrics—SSIM360 and 360QVM—that the company created to help guide its 360° video development efforts. The post discusses many of the problems inherent to 360° production, and provides a glimpse of how these metrics may help simplify these issues for all developers in the future.
As background, understand that 360° video development has all the compression-related issues of 2D (or flat) videos, plus many unique challenges. For example, vendors use different layouts like equirectangular or cube maps to store the 360° video. Quantifying the qualitative differences associated with these layouts is critical, but extremely cumbersome.
In addition, many companies deploy content-dependent technologies that attempt to predict the regions of interest in a 360° video, so the video can be encoded to optimize quality where the viewer is watching. Of course, if the viewer is looking elsewhere, these predictions actually reduce quality. To our knowledge, prior to the 360QVM metric announced by Facebook today, there simply weren't any technologies to score and quantify the accuracy of these predictions.
These challenges are exacerbated by the sheer size of many 360° video input files, which are usually 4K or larger, and the need to stream to mobile devices at relatively low bitrates. Due to the viewing distances involved in a head-mounted display, scaling to lower resolutions for lower bitrate streams—which works well for most flat videos—can seriously degrade video quality.
Simply stated, 360° video is exceptionally challenging, and most producers have been working with blunt instruments in terms of quality metrics. This reality makes Facebook's announcements welcome news to all 360° video developers or producers, but as we'll discuss, Facebook has no immediate plans to open-source these metrics so that others can use them.
The first metric is called SSIM360. As you may know, SSIM is a well-respected full-reference metric that compares the encoded image to the original image and produces a score. Technically, when scoring a video, SSIM compares each frame using multiple blocks of pixels, averaging the score of each block and the blocks as a whole for each frame. All frames are similarly assessed and the overall score averaged. A perfect 1.0 means the two images are identical, while a score of 0 means the two are totally dissimilar. Most high quality videos rate .95 or above.
The problem with SSIM (and any flat metric) is that it gives equal weight to all blocks in the frame; it's a simple average. However, during 360° display, blocks in the middle of the frame take up much more space than blocks on the top and bottom, so the quality of these blocks should be given more weight. In essence, SSIM360 weights the blocks to compute a score that reflects the quality of the rendered 360° image, not the flat representation of that image.
Figure 1. Facebook produced the benchmark score by comparing multiple flat regions between the source and encoded 360 video.
In the blog post, Facebook technologists prove the accuracy of this weighted approach by comparing its scores to those achieved by rendering the source and encoded equirectangular frames and computing hundreds of SSIM scores within comparison blocks over the complete image (Figure 1). This approach weights each block as displayed in the final image, rather than as stored in the compressed frame, so essentially converts the 2D metric to a 3D metric. However, since it involves hundreds of comparisons per frame, it's simply too cumbersome to use as a standalone metric.
Instead, Facebook weighed the blocks mathematically, and compared the scores to the theoretically perfect scores (called the ground truth score) produced using the technique shown in Figure 1. The researchers found that SSIM360 was 50% more accurate than traditional 2D SSIM.
If you're a believer in SSIM, SSIM360 is a solid advancement. However, I asked Shannon Chen, the author of the blog post, if Facebook performed any subjective comparisons to confirm their objective analysis, and they did not. Technically speaking, Facebook's tests prove that SSIM360 is more accurate than 2D SSIM when computing SSIM scores for 360° video, but not that these measurements accurately predict subjective evaluations.
For perspective, note that while this is the first spherically weighted SSIM-based metric that I've seen, there are multiple spherically weighted PSNR-based technologies available, including a weighted spherical PSNR you can download for free from Samsung. These metrics have met with mixed success in academic comparisons, though that can be said for all quality metrics, which always seem to have their detractors.
Those interested in an introduction to previous evaluations of spherical metrics should check out the presentation "Benchmarking Virtual Reality Video Quality Assessment" (FTP PDF download). Digging through the papers, you'll note disparate findings regarding the benefits of spherical weighting. For example, in the article "On the Performance of Objective Metrics for Omnidirectional Visual Content," the authors stated "Objective metrics specifically designed for 360-degree content do not outperform conventional methods designed for 2D images." In "An Evaluation of Quality Metrics for 360 Videos,"the authors concluded, "It is found that most of the objective quality measures are well correlated with subjective quality. Also, among the evaluated quality measures, [traditional flat] PSNR is shown to be the most appropriate for 360 video communications."
On the other hand, the summary to the article "Weighted-to-Spherically-Uniform Quality Evaluation for Omnidirectional Video" states, "Our method makes the quality evaluation results more accurate and reliable since it avoids error propagation caused by the conversion from resampling representation space to observation space." During several recent consulting projects, I've worked with the spherically weighted Samsung PSNR metric, and found the results almost identical to flat PSNR. So the jury is still out on the benefits of spherical weighting.
During my discussions with Chen, I didn't explore how SSIM360 might be different than the PSNR-based approaches discussed in these articles, and I would expect the technical merits of the Facebook approach to be explored in future technical articles. As a practical matter, given the potential benefits of an effective 360° metric, anyone working in 360° VR would welcome the opportunity to test what might be a better mousetrap. In my view, the biggest problem with SSIM360 isn't the academic discord on technically similar antecedents, it's the fact that I can't get my hands on it.
At a high level, SSIM360 measures the quality of the entire frame. In contrast, 360VQM uses SSIM360 as a tool to measure how accurately the encoding technology predicted where the viewer would be watching. Those seeking a fascinating primer on these prediction technologies should check out another Facebook blog post entitled "Enhancing high-resolution 360 streaming with view prediction."
Let's start with a simple example. Assume you were watching a 360° video of a model walking in a circle around the camera. Multiple technologies can predict that most viewers would follow the model during the walk and optimize the encoded frame to improve quality. That is, while all frames have the same number of pixels, rather than spreading the entire image evenly over the frame, you could allocate 90% of the pixels to the model, with the remaining 10% containing low resolution sections of the rest of the frame.
When projected to 360 degrees, the entire image gets covered, of course. But the model looks fabulous, because most of the pixels are allocated on him or her. However, if you suddenly look behind you, the image would look awful because that area was represented by much fewer pixels in the encoded frame.
Here's a blurb from the blog post. "View-dependent optimization techniques like offset projections, saliency-based encoding, and content-dependent streaming are essentially biasing bit allocation (equivalent to pixel allocation in most cases) toward perceptually more important regions in a video. They do not enhance the quality of the frame as a whole but instead optimize the parts where viewers are most likely to look."
How can you judge the quality of such technologies? Here's where Facebook appears to break new ground.
To compute 360VQM, Facebook first derives the SSIM360 score. Thinking back to our circling model example, it's clear that different portions of each frame would have significantly different scores. The blocks containing the model would have a high score because the large number of pixels preserves the quality. Everywhere else would have a low score, because these sections are stored at very low resolution.
Over time, Facebook's player anonymously tracks the field of view actually watched by the viewer and maps that back to the actual pixels in that field of view. If you're watching the model the entire time, you'll increase the 360VQM score because you're watching a high-resolution, high-quality group of pixel blocks. If you're consistently watching the other side of the room, the pixel density in the field of view is extremely low, leading to a penalty factor applied to the 360VQM.
Figure 2. Measuring the accuracy of technologies that predict where viewers will look in the frame.
This is shown in Figure 2 from the blog post, where the sun in the center of the frame is the predicted field of view. Understand that without any optimizations, the 360° technology will give equal weight to all areas in the frame. No matter where the viewer looks, he or she will see the same number of source pixels in the field of view, and the same relative quality. In contrast, optimization technologies attempt to predict that the viewer is focused on the sun, and will allocate more pixels to that region.
On top in Figure 2, the red circle indicates that the viewer is actually watching the predicted area. Since the optimization technology accurately predicted this, the viewer is rewarded with a higher resolution image, which improves the 360VQM score. On the bottom, the reverse is true, and the viewer is watching on a region that wasn't optimized. Here, the V360VQM score is penalized for viewing lower-resolution content.
In this fashion, 360VQM measures the accuracy of the view-dependent optimization technologies, which previously were unmeasurable at scale as far as we know. Since you can't improve what you can't measure, 360VQM is truly groundbreaking.
In the introduction, the blog post states, "We're sharing our QA workflow as a first step toward establishing an industry standard that we hope will enable the broader 360 developer community to quantify and share their developments in creating more immersive experiences." I asked Chen if Facebook was open sourcing the metrics, and responded that this wasn't in the short term plans.
He did mention that developing a metric similar to SSIM360 could be done pretty quickly simply by adding spherical weighting to one of the many open-source SSIM implementations available. Hopefully, a metric vendor like Moscow State University could take this on, since it would be very helpful for many VR developers. On the other hand, 360VQM would be much tougher to reproduce since it requires access to both player stats and the weighting mechanism used by the encoding tool.
Overall, Facebook's two metrics bring to mind the old saw that necessity is the mother of invention. The challenges associated with 360° production necessitated the pioneering AI optimizations produced by Facebook and others, and now the metrics need to benchmark their progress. By defining the techniques so thoroughly, the blog post will help others utilize these or similar approaches. Hopefully, down the road, this will lead towards standardization and open sourcing for the benefit of the greater 360° development community.
VR faces huge production and consumer challenges and is still evolving at a rapid rate. So why is the industry already talking about standardization?
A select group of video creators can now post recorded videos to Facebook Live and create buzz-worthy events for fans.