AI-Based Scaling as the Key to Cost-Efficient 4K UHD Content Delivery
Practical artificial intelligence for edge computing—in the form of compact software and chip-level hardware acceleration for neural-network inferencing—is upending the way operators of many networks think about providing user-facing services. Among multichannel video programming distributors (MVPDs), network operators, and architects of content delivery networks (CDNs) over 5G wireless systems, discussions are flaring up about local personal assistants, cognitive and predictive user interfaces, and some completely novel services. All these ideas could differentiate a provider network, potentially increasing the subscriber base and reducing churn.
But the most important near-term impact of AI for providers of video content may lie in another direction entirely. AI-based Super Resolution—an emerging technique for using deep-learning inference to enhance the perceived resolution of an image beyond the resolution of the input data—can give viewers a compelling 4K Ultra-High Definition (UHD) experience on their new 4K displays from a 1K-resolution source. This rather non-intuitive result translates into users delighted by the range of 4K content suddenly available to them and operators delighted by significantly reduced storage, remote caching, and bandwidth needs—and consequent energy savings across their systems--compared to what they would have observed with native 4K files.
This might seem a rather academic point, as to be most effective, the receiving side of Super Resolution must be executed at the extreme network edge: the user premises. But the receiver’s deep-learning inference task can be highly compute-intensive, especially with the real-time constraints of streaming video. At Synaptics, we have been able to demonstrate that the Neural Processing Unit, a compact deep-learning inference accelerator integrated into our recent set-top SoC, can in fact perform Super-Resolution image expansion in real time, to the satisfaction of critical viewers.
How It Works
Today, operators who want to offer 4K content to users must store at least two compressed versions of each program: one in 4K UHD resolution, and one in 1K high-definition or full high-definition (HD/FHD) format. The much larger 4K file will be streamed to users who have 4K displays and who enjoy the necessary network bandwidth. The HD or FHD file will be streamed to users with lower-resolution displays or with bandwidth constraints. At the headend, the core datacenter, or the remote cache, the system must select, segment by segment, between the two files to meet the demands of adaptive bitrate control. This switching potentially creates jarring changes in image quality for viewers using 4K displays.
Worse, the 4K files are relatively huge, and the duplication of program content costs storage and remote-cache space. And streaming the full 4K file gobbles precious network bandwidth. These considerations have kept a tight cap on the number of programs providers are willing to offer in 4K at any one time.
AI-based Super Resolution changes this picture entirely. It allows operators to provide not only HD/FHD but a compelling UHD experience from only a 1K-sized program file. It works by teaming spatial image compression performed by convolutional neural networks (CNNs) with the mainly temporal compression of the HEVC or AVC codec.
Under the Hood
Researchers have found that a pair of CNNs trained on a piece of content can achieve very significant reduction in file size by down- and up-scaling each individual frame, with little or no perceived loss in image quality on the final display device. In practice, a content provider would create two CNN inference models for each piece of 4K content they wished to provide. For each piece of content they would use the 4K video file, frame by frame, as input data to train the two CNN models: one for downscaling each frame of the content from 4K to 1K, and the other for up-scaling the resulting 1K frames back to 4K. Remarkably, this intensive training process creates a quite compact upscaling CNN model that can actually restore edge sharpness, surface textures, and some fine detail not explicitly present in the transmitted 1K frames.
Because the two models are trained—ideally, together—on the actual piece of content, the downscaling model has learned (to use an inappropriately anthropomorphic term) how to remove detail in such a way that the upscaling model will correctly restore it. Imagine, if you will, tracing over a photograph to create an outline drawing, and then asking a skilled painter to create a photo-realistic artwork from your outline drawing. Because you can tell the artist about the content of the drawing—which lines should be smooth and which jagged, which surface texture should be a feather and which stainless steel, that the little blob in the sky was a seagull—the artist can correctly fill in detail that is not present in the line drawing.
This deep-learning training process results in scaling of significantly higher quality for that particular piece of content than could be achieved with pre-defined heuristics or purely mathematical compression techniques. And it is compatible with conventional video codecs.
Like any training of a deep-learning network, the training process that produces the two CNN models is complex and lengthy. It is best done in a datacenter or cloud. But the trained models themselves—the pieces that actually do the down- and up-scaling—can be quite compact and fast. The upscaling model in particular can be made small enough to execute in a smart streaming device or set-top box. And with powerful chip-level neural-network inference accelerators now entering the market, the upscaling model can process frames in real time.
The Process in Action
In practice, a content owner or service provider would first train the downscaling and upscaling CNN models for each piece of content. Then they would use the downscaling model to transform each frame of the video content from 4K to 1K resolution. Next, they would compress the 1K video stream as usual with their HEVC or AVC encoder, encrypt it, and distribute the compressed stream and the content’s relatively tiny upscaling CNN model to content storage sites.
On demand, the provider would first transmit the upscaling CNN model to a receiving device if the content is to be viewed in 4K resolution. This model is compact—typically around 1 Mbyte—so the download would normally be too quick for the user to experience any delay. Next the provider would begin streaming the compressed video. The receiving device would decrypt and decode the HEVC or AVC stream into 1K-resolution frames. Then it would apply the upscaling CNN model to scale each frame back to 4K resolution for display. The resulting video experience will be indistinguishable, to most viewers, from a full end-to-end 4K transmission.
But Wait …
Readers familiar with set-top box architecture have probably already spotted a serious issue with this scenario. For copyrighted content, guidelines such as the MovieLabs Enhanced Content Protection scheme require that any unencrypted content steam be handled only in a so-called secure media pipeline. No open software environment, such as a set-top-box CPU, may have physical access to the unencrypted stream.
In today’s set-top boxes, this means that decryption, decoding, and frame buffering data-path hardware must be physically isolated from CPU-accessible hardware. But with Super Resolution we are adding another complex functional block, under control of the CPU, to this data path.
This makes it necessary to maintain leak-proof, hardware-enforced separation between the control and data planes in the inference accelerator. Figure 1 (at the top of the page) illustrates this separation of the trusted from the rich execution environment in a secure media pipeline system. This requirement makes implementation of the inference accelerator data path in a CPU or GPU extremely problematic. Even with a hardware root of trust and a secure boot process, it would be extremely hard to prove that the video stream was safe when it was passing through a device that could execute code from RAM and had any path to the outside world. But it is possible to implement a hardware Neural Processing Unit in this separated environment without creating a path for data leaks, as shown in the SyKure block diagram of Figure 2 (click for full-size version).
The Bottom Line
The technology exists in production today for content providers to deploy AI-based Super Resolution. With it, they can eliminate use of 4K content files, allowing them to offer a far broader range of 4K-quality content while saving storage, cache, and bandwidth. But to do so, they must specify receiving-end devices with neural-network inference accelerator hardware: fast enough for guaranteed real-time upscaling of each frame, and with hardware security that will withstand the intense scrutiny of copyright owners. The opportunity awaits.
[Editor's note: This is a contributed byline from Synaptics. Streaming Media accepts vendor bylines based solely on their value to our readers.]
Tony Jones of MediaKind discusses the steady adoption rate of Versatile Video Coding (VVC) and its numerous benefits, such as significant cost savings and reduced energy consumption for streaming.
The sheer volume of video generated in many workflows is impossible to manage without artificial intelligence and machine learning. Here are examples of how to leverage it to make your video production workflow more efficient and productive.