Apple Got It Wrong: Encoding Specs for HEVC in HLS
When Apple releases encoding specifications, most producers (including myself) tend to give them great weight. After all, Apple invented HTTP Live Streaming (HLS), and produces most of the devices that play the format natively. Apple knows HLS and its devices better than anyone or any other company. While this makes Apple highly credible, however, it doesn't make the company infallible.
Though I didn't recognize it at first, when Apple released its encoding specifications for HEVC and HLS, they got it wrong, though it took a comment from a true encoding expert to help me realize it. More on the identity of the expert, and a more detailed technical explanation, below.
Here's the story. Table 1 shows the HEVC and H.264 encoding recommendations from Apple's HLS Authoring Specification for Apple Devices from 1080p and below. Though the data rates are different for H.264 and HEVC, the resolutions are the same.
However, because HEVC can incorporate larger block sizes and has other technical advantages over H.264, the resolutions should be different. Specifically, you should use larger resolutions much deeper in the ladder and eliminate the lowest two or three resolutions. I'll prove this using a process debuted by Netflix when they first announced their per-title encoding technique.
Table 1. The SD-HD rungs in the Apple HLS encoding ladder
Finding the Optimal Ladder
At its most simplistic level, Netflix finds the optimal encoding ladder by encoding the source file at multiple resolutions and data rates and computing the VMAF score of each encode. I show this in Table 2, which is a series of encodes for the 1080p version of Tears of Steel. At each data rate, the entry in green is the highest quality of the tested iterations. Using this data, you would build your encoding ladder by first choosing the data rates for each run, and then choosing the resolution that delivers the best quality at that data rate.
Looking at the ladder, you would deploy at 1080p down to 3800 Kbps, then 720p to 1400 Kbps, and so on down 100 Kbps where the 270p version actually delivers higher quality than the 234p. If I was focusing on H.264 in this article, I would recommend dropping the 234p rung in favor of a 270p rung, but I'm focusing solely on HEVC.
Table 2. Finding the optimal encoding ladder for Tears of Steel using H.264
Table 3 shows the same H.264 analysis on the left, with an HEVC analysis added to the right. Here you see that with HEVC, you'd deploy at 1080p down to 2000 Kbps, 720p down to 500 Kbps, and would eliminate the bottom three rungs.
Why is H.264 different than HEVC? Here's the non-technical description. In the encoding ladder, we scale down from the original resolution to minimize compression artifacts that reduce visual quality on that rung. However, scaling to lower resolutions results in loss of detail and aliasing that also reduces quality. At each rung, we balance the loss of detail associated with scaling against compression artifacts. Again, because HEVC is a superior codec to H.264, it allows you to scale less, improving detail, while still avoiding compression artifacts. For this reason, you should use higher resolutions farther down the encoding ladder and avoid the lowest resolutions altogether.
Table 3. Choosing the optimal bitrate ladder with H.264 (on the left) and HEVC (on the right). Click on image above to see full-size table.
Table 4 shows the total improvement in VMAF score using the ladder suggested by Table 3 rather than following Apple's guidelines. Specifically, the "Was" columns show the resolution recommended by Apple and the VMAF scores of files encoded at that resolution, while the "Should Be" columns show the resolution suggested by Table 3 (on the right) and the associated VMAF score. The "Delta" is the improvement in VMAF score associated with the "Should Be" encoding ladder. Note that when the target data rate was between two numbers, I rounded to the closest (so for 145 Kbps I used 100 Kbps results), and for 1700 Kbps I rounded up to 1800 Kbps.
Looking at the numbers in Table 4, you see that the more significant differences are at the lower rungs where the higher resolution retains more detail. Because HEVC is superior codec, this isn't offset by compression artifacts as it might be with H.264.
Table 4. Improved VMAF score by using the HEVC ladder suggested by Table 3 for Tears of Steel
As with H.264, the recommended latter will change according to content. Table 5 shows the HEVC ladder for animated video Sintel, which suggests a ladder that deploys 1080p resolution down to 1400 Kbps, 720p down to 400 Kbps, and so on, and again eliminating the bottom three rungs.
Table 5. The recommended encoding ladder for Sintel
The results shown in Table 5 reflect that fact that synthetic content, more so than most real world videos. contains detail and sharp edges that retain better quality at higher resolutions. When encoding these videos, along with screencams, PowerPoint-based videos, and other synthetic content, you almost always get better results at larger resolutions than with lower.
Table 6 computes how much additional quality the encoding ladder suggested in Table 5 delivers over Apple's recommendations. When considering the impact actually perceived by the viewer, note that a VMAF differential of six points equals a "just noticeable difference" which is defined by the IEEE as "the point where 75% of the viewers prefer A video over B video or vice versa. So the differential shown in the table is definitely worth chasing.
Table 6. Improved VMAF score by using the ladder suggested by Table 5 for Sintel
As you'll see in the video, version 10.2 of the Moscow State University (MSU) Video Quality Measurement Tool (VQMT) computes VMAF scores, and allows you to compare two videos to the original. In the the Camtasia-captured screencam, I'm cycling through the original frame, the 540p frame, and the270p frame.
What's the bottom line? When creating an encoding ladder for HEVC video, don't duplicate the resolutions of the same ladder used for H.264. You'll optimize quality by pushing higher resolutions lower down in the encoding ladder and eliminating the bottom few rungs.
Though I haven't tested with VP9 and AV1, I'm guessing the same resolution dynamic is true, so if you're creating new ladders for these codecs you should run the same analysis. Finally, before you deploy any ladder for any adaptive bitrate technology, you should thoroughly test it under constrained bandwidth conditions that force switching between layers. It feels highly unlikely that my suggestions would "break" the switching mechanism in HLS players, but it's easy enough to test, and you should.
Meet the Real Expert
A couple of shout-outs. First, the true encoding expert mentioned in the opening paragraph is Tom Vaughan, VP and GM, Video at MulticoreWare, which develops the x265 codec. Tom attended a session on HLS and HEVC at Streaming Media West, and suggested that Apple's recommended ladder was suboptimal. This put the idea in my head to run the analysis and write the article.
I asked Tom to provide a bit more technical detail regarding why HEVC performs better than H.264 at higher resolutions, and these are his comments.
Most of the compression efficiency in video codecs comes from something called "prediction" (the first basic step in the encoding process). It takes relatively few bits to describe a block of pixels as being similar to an identical-sized block of pixels elsewhere in the same frame (intra-prediction), or elsewhere in another frame (inter-prediction). H.264 can describe up to 16x16 blocks of pixels. H.265/HEVC can describe up to 64x64 blocks (16x larger).
As your bitrate target value goes lower (or your CRF setting goes higher), x265 will tend to select larger block sizes automatically, encoding a high-resolution video as efficiently as possible. It does this with an internal weighting algorithm called "rate distortion optimization" (RDO). Bit rates are also minimized by the second encoding step, where the residual error after prediction is encoded (because prediction is rarely perfect). The residual error is the difference between the original source video pixels and the predicted block.
This error is transformed from the spatial domain to the frequency domain through a Discrete Cosine Transform, and the encoder then looks at the relative strength of all of the frequency components. The most significant frequencies are kept, and the less significant frequencies are dropped, or "quantized" away.
At higher bitrate or quality settings, there are more bits available to encode the residual error after prediction, and more frequency components are kept. At lower bit rates, the encoder keeps fewer spatial frequencies, and so the video becomes softer. Because HEVC has a wider range of encoding tools available, it can handle more pixels at low bit rates, describing each area in each frame with a much wider range of block sizes and shapes, making more accurate references to other blocks, and reducing the bit rate as required through quantization of less important spatial frequencies.
Giving the encoder more pixels to begin with gives it more accurate spatial information, letting it do all of this compression more accurately. Of course, there are limits to how many pixels an encoder can handle at very low bit rates. When it runs out of tricks to use, you will start to see blocky compression artifacts (what many people have referred to as "macro-blocking"). To avoid that, the video must be scaled down to reduce the number of pixels.
Simplifying the Process
The other shout-out is to the Hybrik cloud encoding platform, which dramatically simplifies the analyses presented above. To explain, each test video involved 114 separate encodes to H.264 and HEVC, so the two videos totaled 456 encodes. Encoding is simple enough to setup and run overnight with FFmpeg, but then I have to compute VMAF with the VQMT tool, which involves another batch file, and yet another FFmpeg operation to convert all sub-1080p files to Y4M format for the analysis, which takes time and strains my hard disk space. Even worse, then I have to copy and paste 456 VMAF scores from the individual text files output by the MSU tool into my results spreadsheet, which is mind numbingly dull and error prone.
With Hybrik, I can configure tests for up to 20 files at once, and after processing them, download a CSV file with all results to input into a spreadsheet, saving huge chunks of input time. In the interest of full disclosure, note that I consult with Hybrik from time to time, including helping them configure their analysis tool, which really (IMHO) streamlined operation for these types of analysis. So, if you're performing high-volume analysis work, you'll find Hybrik invaluable, though unfortunately Hybrik doesn't offer analysis-only pricing, so you'll have to pay $1,000/month minimum for the privilege.
What's better than single sign-on? Never signing on, at all. Also, people will be able to get up to 32 participants on a FaceTime chat starting this fall.
Many who heard that Apple is adding support for HEVC playback in HTTP Live Streaming were left with more questions than answers. Here's what developers need to know.
The surprise royalty payments on HEVC have some publishers looking for alternatives. Learn how one company created a better codec price plan.
If you're adding HEVC to your HLS video, you're likely concerned about the playback frame rate and battery live on the iPhones, iPads, and computers to which you're delivering. We tested a range of devices, and found the CPU impact to be negligible on most of them.
We're looking to find out how publishers are using HEVC with their HLS streams. Take our survey to help us find out—and get a copy of the report when it's published.
Hardware acceleration and field programmable gate arrays may be the answer to the rising costs of encoding for multiple codecs including H.264, H.265, VP9, and soon AV1
At its Worldwide Developer Conference this week, Apple announced it would support HEVC/H.265 in High Sierra and iOS 11 in a combination of hardware and software decoding, depending on the device. Here are the details of how Apple will implement it.