Demuxed: A Video Engineer's Nirvana
Demuxed is billed as "The Conference for Video Engineers," and it's an offshoot of the San Francisco Video Technology Meetup group that was largely produced by employees from QoE vendor Mux and multiple volunteers, and sponsored by companies like Bitmovin, Brightcove, Google, Netflix, DLVR, Beamr, Conviva, Wowza, and JWPlayer. This year's event on October 5 was a jam-packed day that featured 19 speakers in talks ranging from 10 to 45 minutes. The eclectic mix of topics offered something for everyone, with most talks a highly technical, authoritative, and useful mix of slides and commentary. Fortunately for me, many encoding-related talks were presented towards the end of the day, and after the last session I was ready for more. Afterwards, there were beverages and heavy appetizers, and chance to mingle with speakers and the 300 attendees.
Why should you care? Because all these talks are either available online now, or soon will be, so you can check them out from the comfort of your office or home. What I'll try to do here is to highlight the presentations I found most interesting, and share some of the information that I picked up during the event. For the sake of brevity, I won't mention all the talks, so you should check out the topics here to see if any interest you.
Video references in the article are to captured live streams. This effort did not go smoothly, so there are many gaps in the VOD clips converted from the live stream. Fortunately, all sessions were recorded offline, so should be available in well-labeled pristine format over the next couple of weeks. All videos will be posted at demuxed.com. [Ed. Note: An earlier version of this story included links to the raw videos. Those versions have been taken down, and will soon be replaced with the final cuts.]
Netflix's director of streaming standards Mark Watson kicked things off with a 45-minute keynote presentation. Watson started by discussing the evolution of the electronic program guide (EPG), which transitioned from a text-heavy menu to an image based guide to the current video-based discovery experience that Netflix documented in this blog post. (Figure 1). The transition to video wasn't to produce eye candy; the videos shown are "specially designed video synopses that help members make faster and more confident decisions by quickly highlighting the story, characters, and tone of a title," Watson said.
Figure 1. Netflix's new video discovery experience uses video to aid selection, but variations of dynamic range of those videos presented makes this very challenging.
Then Watson switched gears and discussed how Netflix was working to add HDR encoding to its content and EPG, with a thorough review of the various HDR delivery technologies (HDR10, Dolby Vision, HLG, HDR10+) and their components (gamma curves, dynamic vs. static metadata). Anyone seeking to quickly get up to speed on how these technologies work will find this section extremely useful.
Finally, Watson wove the two topics together, delineating the technical issues involved with presenting videos with fundamentally different levels of brightness, color, and dynamic range on the same EPG, yet how crucial it was to do this well, lest viewers choose the wrong video for them, or even worse, click away to a different site. Overall, it was a fascinating talk about how challenging it is to integrate HDR into an OTT discovery and delivery services.
While on the topic of HDR, there were a couple of other noteworthy sessions. In ten minutes, Beamr's Dr. Greg Mirsky covered the four most prominent HDR technologies, Dolby Vision, HDR 10, HLG, and HDR10+ (HDR10 with dynamic metadata), with a useful mix of graphs and tables that highlighted their key differences and similarities. Beamr and Dr. Mirsky were kind enough to let us use two summary slides in the presentation, which I've combined into one (Figure 2). If you're looking for a great quick primer on HDR, check out Dr. Mirsky's talk.
Figure 2. Summary of the pros and cons of four prominent HDR technologies; courtesy of Beamr and Dr. Greg Mirsky.
Later in the day, Vittorio Giovara, a senior engineer at Vimeo presented a detailed 30-minute theoretical talk on HDR technology that started with the color and brightness sensing cones and rods in our eyes, and ended with the strengths and weaknesses of how each codec handled HDR. Along the way, he also covered all the standards involved, going deeper than either Watson or Mirsky.
After Watson, Mozilla video codec engineer Thomas Daede provided an update on the status of the AV1 codec from the Alliance for Open Media, reporting that features will be frozen by the end of October with a hard freeze scheduled for the end of 2017. He then gave a deep technical discussion of the many unique features of the new codec, which he said all contributed to AV1 matching HEVC quality at 75% of the bitrate (Figure 3). The downside was encoding time, which Daede put at 200x the encoding time of VP9 video, though this was for unoptimized code. The decode picture was much brighter, with AV1 decode about 50% slower than VP9.
Daede also pointed viewers to a Bitmovin/Mozilla demo where they could play AV1 video and measure decode complexity themselves, though you'll need to download a "Nightly" version of Firefox to obtain the decoder.
Figure 3. AV1 delivers the same quality as HEVC at about 75% of the bandwidth according to Mozilla's Thomas Daede.
All this is preliminary, of course, as the different companies that deploy AV1 will configure and optimize the codec differently. Still, it was good to hear that AV1 is still moving along and should hit the streets around the end of 2017.
Next came a ten-minute presentation by Twitch's Tarek Amara on S-frames in AV1. Briefly, S-frames are a new frame type that enables stream switching between segments without the need for an IDR frame, which improves system responsiveness to changing conditions and reduces latency. We've been working with I-, B-, and P-frames since MPEG-2; you'll be hearing a lot more about S-frames going forward.
After these two AV1-related discussions, we switched to the dark side of HEVC, with attorney Hector Ribera identifying the licensees, licensors, and pricing policies of the three patent pools (MPEG LA, HEVC Advance, and Velos), and other companies, like Technicolor, that claim to own HEVC-related IP. Ribera reported that Velos still hasn't released its planned royalty model, which may or may not include content royalties. He also shared that at least one of the licensors in the Velos group claimed to own IP that was deployed in VP9 and perhaps even AV1.
Though unstated by Ribera, this may mean that Velos will attempt to form a patent pool for AV1, much like the unwelcome MPEG LA pool for DASH, or perhaps will sue the Alliance for patent infringement. All this and more in 2018, or much later, considering the speed with which the Velos IP owners have moved to date.
Netflix and Dynamic Optimization
From a pure encoding perspective, one of the most interesting talks was by Netflix's Megha Manohara, who described Netflix's improvement over per-title encoding called dynamic optimization. In her talk entitled "Streaming at 250Kbps, Raise Your Expectations," Manohara started by detailing how critical it was to deliver acceptable quality low bitrate video to many markets around the world.
As you may recall, Netflix debuted per-title encoding in late 2015. At a high level, Netflix's per-title encoding schema works by producing multiple test encodes of the source video at different quality levels and different resolutions to find the ideal bitrate ladder for that video. Based on these findings, Netflix customizes the encoding of the entire video file to optimize quality and bandwidth on a per-title basis.
Seeking to harvest even greater gains, Netflix tested encoding on a shot-by-shot basis, though the permutations involved in such an analysis proved too large even for Netflix's encoding capacity. So, they intelligently narrowed the analysis to the combination of quality levels and resolutions most likely to yield a usable output, and started combining multiple shots into longer chunks. Still, it can take anywhere from 12 hours to nine days to encode a Netflix movie, depending upon length, output format (H.264 or VP9), and the encoding backlog.
The benefit? Depending upon the codec and movie, dynamic optimization delivered up to 55% savings over per-title encoding, reducing the data rate of "enjoyable quality video" from 600Kbps to 270Kbps. Manohara displayed several before and after videos at the end of the presentation, and the results looked phenomenal. Netflix will be detailing this technique is a future blog post, which I personally can't wait to see.
Wowza's Low Latency Video
Latency has been a hot topic over the last twelve to eighteen months, and Wowza's chief architect and VP Scott Kellicker and senior product manager Jamie Sherry were on hand to present their talk, "3 Second End to End Latency at Scale," which detailed Wowza's efforts to create a service offering with sub-3 second latency. The execs started by presenting the current state of the state of latency in various delivery mechanisms, as shown in Figure 4, from the Wowza presentation.
Figure 4. The current state of the state of latency.
They also noted that Apple HLS typically delivered 30+ seconds of latency (using ten second chunks), while DASH delivered between 10-30 seconds of latency. While this performance is acceptable for many live events, it clearly isn''t for others, particularly when wagering or auctions ar involved.
The presentation walked through the Wowza team's analysis regarding the best way to reduce latency, which first concluded that chunk-based streaming was simply too slow for certain use cases. Instead, their solution would use "old school" server-based streaming. Next they considered whether to use WebRTC or websocket delivery, ultimately deciding on the latter because it was more stable across different browser implementations, and was more flexible regarding codec offerings. The fact that metadata wasn't time -ynced with the audio and video was another WebRTC limitation.
Kellicker and Sherry said that, when building the system, they had solid building blocks already in place, including a robust streaming server, a mature infrastructure for a cloud offering, and a Media Source Extension-based player. Next they had to choose a streaming protocol, and it turned out that they were able to use an in-house, low-latency, bi-directional protocol called WOWZ used for their server-to-server communications. So they added this protocol to Wowza's media server and player and deliver it via websockets. They also designed a structure that could be used over origin, midgress, and edge servers since that was how their large-scale customers would use the technology.
The annual conference for video engineers by video engineers produced must-watch sessions from Akamai, YouTube, Mux, and many others. Here's a survey of useful sessions, along with a link to the video library.
Companies and Suppliers Mentioned