Video Compression for Machines: The Next Frontier
You're on a Sunday drive a few years from now, touring through hilly farmland in your self-driving car. As you approach a hill, an impatient motorcyclist foolishly decides to pass, despite the double-yellow line. He chooses wrong, and a tractor-trailer appears in the opposite lane, bearing down on the hapless rider. Instantly, your car slows and pulls to the right, allowing the motorcycle to swing in front of you and avert disaster. All while you're reading a book or watching a movie, perhaps not even facing the direction you're traveling.
Think about the technology that made all that happen. Today, some self-driving cars have as many as eight video cameras perceiving the world from all angles. This requires a sophisticated high-speed network, compression, and some method of object recognition. Object recognition is sophisticated stuff, often delivered by non-real-time neural networks hosted in the cloud. But for self-driving cars, the system needs to be local and immediate, requiring application-specific artificial intelligence (AI) CPUs and more compression to reduce the size of the data sets.
Now, think about making this interoperable. Perhaps interoperability might not matter for a car company—I'm sure Elon Musk thinks the neural network chip Tesla designed (shown in the image at the top of this article) is a competitive strength not to be shared. But imagine a defense-related operation where video data is shared by equipment from multiple vendors, and if one system "saw" an enemy tank, they all need to "see" the same tank.
It turns out that the Motion Pictures Experts Group, or MPEG, has created several standards in this space and is currently formulating two additional standards to improve performance and ease implementation of machine-to-machine video creation and consumption and the required neural networks.
This Isn't Your Father's MPEG
Most of us equate MPEG with video codecs like MPEG-2, H.264, and HEVC—the last leaving a bad taste in our mouths because of epically poor licensing policies. But it turns out that MPEG is also tackling the standards related to the operation described above, using existing standards for image and video search and in-process efforts for neural network compression and video coding for machines.
Beyond the fact that you'll probably be driving an autonomous car in ten years, why should you care? Because machine-to-machine video transmission and processing is becoming a big thing, with applications ranging from smart cities (facial and license plate recognition) and surveillance to factory automation (defect detection and grading), smart retailing (restocking, emotion detection), and many others.
In their Visual Networking Index published in February 2019, Cisco estimated that global machine-to-machine (M2M) traffic will increase more than sevenfold from 2017 to 2022, in part because "of the increase of deployment of video applications on M2M connections." In short, where five years ago virtually all video was consumed by humans, going forward an increasing share will be consumed and processed by machines. So compressionists practicing in five years will have to know how to optimize video for machine consumption as well as human consumption.
Standards to Know
A buzzword-level knowledge of the appropriate standards is a great place to start. First is the Compact Descriptors for Visual Search (CDVS), a still-image standard that is technically Part 13 of the MPEG-7 standard and was adopted by ISO as the ISO/IEC 15938-13:2015 standard. CDVS works by extracting a set of "local features" or mathematical representations of the image, as opposed to compressing every pixel, which is much more compact than a JPEG-compressed image.
Still, even at very low data rates, image matching performance is quite good, as you can see in Figure 1, where a 16-kilobyte image delivered an accurate matching rate of over 95% with larger image sizes delivering even better accuracy. This is from an MPEG white paper that also mentioned it took 0.2s to extract the local features and 2.5s to match the image in a database of 1 million images (the article didn't provide machine type or speed).
Figure 1. Compact Descriptors for Visual Search (CDVS) performance at various bitrates
Now is a good time to reflect upon the differences between encoding for human viewing and encoding for machine viewing. CDVS reduces still images to fabulously small files that provide the information needed to identify objects with impressive accuracy rates. However, you can't decompress that file into an image a human could recognize-it's a completely different focus.
Compact Descriptors for Visual Analysis (CDVA)
The next standard is Compact Descriptors for Visual Analysis (CDVA), which is designed for video where CDVS was designed for still images. The goal of the standard is to achieve more compact descriptors for video than can be derived for still images by using the temporal redundancy in video. Obviously, this is similar to how interframe compression can be used to efficiently compress video with lots of redundant information. The other goal was to add "a descriptor component based on features extracted using a convolutional neural network (CNN) in order to benefit from the recent progress made in deep learning." In other words, to make the descriptors neural network-friendly. CDVA is listed as Part 15 of MPEG-7 and ISO/IEC 15938-15, and was finalized in 2018.
Figure 2. The operational schema for Compact Descriptors for Visual Analysis (CDVA).
The operational schema for CDVA is shown in Figure 2 from an MPEG white paper. Briefly, there are three components, with global and local feature descriptors extensions of the CDVS descriptor components. In addition, a local neural network extracts a "deep feature descriptor" with a procedure called nested invariance pooling (NIP) applied to improve accuracy. CDVA is applied to a "temporal video segment" with "visual homogeneity" between the frames, which in most instances will be a shot or scene, with temporal efficiency achieved by encoding only a single frame for each segment. That's the keyframe extraction shown on the top of Figure 2.
CDVA is designed to enable pairwise comparisons despite changes in vantage point, camera settings, lighting, and video resolution. The standard is designed to support a range of hardware implementation strategies from low complexity/memory environments to massively parallel execution like that provided by GPUs or ASICs.
Early reference software performance has been impressive. On a diverse content set, the CDVA descriptor averaged between 2-4 kilobytes per second with an extraction time of about 0.7s per second of video on a single-core computer. At these rates, the descriptor delivered a correct matching rate of 88% with a 1% false matching rate, but the white paper didn't specify a matching time or the size of the database.
Compressing Neural Networks
CDVA is designed to be deployed on a range of devices, from standalone computers to video cameras, smartphones, or even smartwatches. In these cases, the device would analyze and produce the CDVA-encoded data which would be transmitted elsewhere for analysis. As shown in Figure 2, a neural network is required for CDVA processing. As you may know, neural networks "learn" and process via large data sets that can easily exceed several hundred megabytes in size. Not only would this dataset likely be shipped with the device, it would also need to be periodically updated.
For this reason, MPEG is also working on a standard for Neural Network Compression, with one of the starting points a data set including CDVA-based data. In terms of status, the latest evaluation framework was published in July 2019, and MPEG Chairman Leonardo Chiariglione shared that MPEG has received nine submissions in response to the call for proposals.
Video Compression for Machines
The final specification is for Video Compression for Machines (VCM), with a group formed to explore the topic in July 2019. Patrick Dong of Gyrfalcon Technology is the co-chair of the group, with Yuan Zhang of China Telecom appointed as chair. According to the press release, the group will create standards for "compression coding for machine vision as well as compression for human-machine hybrid vision." The standard will be designed to be implemented in chips for broad use with any video-related Internet of Things (IoT) devices.
In the words of Chiariglione, the group was formed to answer the following question: "So far, video coding 'descriptors' were designed to achieve the best visual quality—as assessed by humans—at a given bitrate. The question asked by video coding for machines is: 'What descriptors provide the best performance for use by a machine at a given bitrate?'"
Seeking a bit more definition, I asked Dong, "What does VCM do that CDVA didn't/doesn't?" He responded, "CDVA is a standard for highly compressed video descriptors, mainly targeting object search and retrieval applications. However, deep feature descriptors lack the locational information for objects. VCM is an emerging video for machine standard, and the next iteration of this idea that is a superset of CDVA. Through combining multiple feature maps of a neural network backbone, we can additionally perform object detection and segmentation tasks, amongst others."
In short, it seems that many companies, including Gyrfalcon and China Telecom, are offering a range of products to accelerate neural network performance at the edge. Today, at least as it relates to video-related data, there is little interoperability, which hinders broadscale adoption. Once finalized, these last two MPEG standards should do a lot to accelerate development and deployment in this space.
In the meantime, where today we struggle to identify the optimal parameters to improve VMAF scores and QoE, tomorrow compressionists will be tweaking compression setting to improve identification and retrieval accuracy for purely machine viewing. On a different level, it's interesting to learn how MPEG, the standard-setting body, is tackling much more difficult and important projects than simply identifying the next "it" video codec.
RealEyes Media CTO Jun Heider discusses MLaaS and how to leverage it in this clip from his Video Engineering Summit presentation at Streaming Media East 2019.
For events like the royal wedding and the World Cup, machine learning and AI are taking center stage.
Microsoft's Andy Beach and IBM/Watson Media's Ethan Dreilinger break down the differences between machine learning and AI in this clip from their panel at Streaming Media West 2018.