How to Build a Scalable Low-Latency Streaming Solution
In this article I’ll look at low-latency live streaming at scale, and the standards and tools that make it possible. As a player developer, I'm particularly interested in the concept of low-latency live streaming and interactivity. For too long, we've limited streaming players to the same controls found on a VCR: play, pause, rewind, fast forward, and so on. But having low latency and interaction expands what it means to be a video player.
Figure 1 (below) shows some examples of players that incorporate some interactive elements. With Periscope, you have the idea of a live streamer on their phone and their fans watching and hitting the little heart button. The streamer sees that and can react to that feedback. With Twitch, you have a video game streamer. Their fans are chatting along with the stream, the gamer can respond to what’s happening in the chat, and the people watching get that feedback. HQ is a popular trivia app where you have a host who is asking questions, and people in the audience are responding, answering those questions and the host is reading off the results. You need that kind of quick feedback with Q & A.
Figure 1. Interactive streaming apps
When Low Latency Matters (and Why)
Traditional broadcast media doesn’t require this kind of interactivity. With something like the Super Bowl or the Olympics, it's more of a lean-back experience. You might have a Twitter feed going on the side, but it’s meant to sit back. You’re not interacting with the people on the field or in a newscast. You’re just consuming the stream.
Of course, latency can detract from these experiences as well, when you’re watching the game at home and you hear your neighbors cheer for a goal 10 seconds before you see it yourself. It spoils the moment. That’s a real issue with broadcast media, but I would argue that latency is not as big of an issue as it is with interactive video.
With an event like the Super Bowl or the World Cup, the producers are going to introduce as much as 30 seconds of latency to begin with even before the stream leaves the venue. They're going to have some latency in there just so the director can cut away to a commercial if the commentator starts going off the rails and spouting crazy things.
In some scenarios, low latency isn’t worth the trouble. To get to lower latency, no matter what you’re doing, you’re introducing some level of cost and instability in the network, getting from the venue to the player. At a big event, producers are reluctant to take a lot of those risks or added cost. You’re going to stay with something that you know is stable and consistent and reliable.
And then finally, the thing that streaming audiences hate even more than latency is rebuffering. We’ve all experienced this, when a player is about to take a shot on goal and the player has to rebuffer.
Most of the time, when this happens, the concept of lower latency is introducing more rebuffering into the chain. Reducing the latency lowers the buffer that the player is keeping in order to protect itself against further rebuffers. When you’re doing that, you are potentially and most likely introducing more rebuffering for a wider part of your audience.
No one streaming the Super Bowl or other large events would attempt such a low-latency stream unless they were confident that they weren’t going to introduce more rebuffering by lowering the latency. Avoiding rebuffering is a higher priority.
Also, when I say “interactive live streaming,” I’m not talking about real-time, two-way audio communication apps like Google Hangouts, Skype, or Zoom. With these, you’re aiming for 0.3 seconds or less of latency. You get beyond 0.3 seconds in that scenario and the speakers start talking over each other.
With interactive live streaming we’re not scaling up a Google Hangouts room to 1,000 people. It's not meant for that. What we're talking about, low latency live streaming, has scale, which is interesting, because it falls right between these two things, and kind of picks up the challenges of both sides.
Building an Interactive Streaming App
When building an application that involves both viewer interaction and low-latency streaming, you need a couple of things. First of all, you need a real-time data framework, built on mature technologies like web sockets and WebRTC data channels. Lots of readily available proprietary services such as Firebase, Pusher, or HubNub can help you build an application that’s snappy and reliable enough that when a participant sends a chat message, other viewers can see it almost right away. Today, real-time data is a relatively solved problem.
On the other side, we have the low-latency video at scale, for which we don't really have a clear-cut solution, although viable standards and solutions are starting to emerge.
How Low Is Low Latency for Streaming Video?
The graph shown in Figure 2 (below) illustrates what constitutes low latency for video. The left-hand side shows high latency, 30-60 seconds. It’s relatively common, especially on the web or iOS devices. When Apple first introduced HLS, they recommended first that you chop your video into 10-second segments, and that the player should buffer three seconds before it starts playing. So, three seconds times 10 seconds equals 30 seconds. That's just in the player itself, so that's not even including the glass-to-glass latency that it takes to get from the camera to the player.
Figure 2. Live streaming latency spectrum
Over the last few years, people have begun to reduce the length of segments to 2-6 seconds to start the player faster and reduce latency to 6-18 seconds. As mentioned earlier, you introduce a little more rebuffering. But if you take this approach and your player and network are reliable, low latency becomes easier to achieve.
There are a lot of demos of low-latency live streams with one-second segments. The problem is, those demos don't always work so well outside of the demo environment. Once you get to real-world networks, an issue arises when you’re making a request for data every second. The requests themselves can have as much as half a second of overhead; even just starting to receive the data for those segments can have overhead. So you’re essentially introducing air into those requests that is going to make it even harder to keep up with that stream, and harder to not enter a rebuffering state.
When we go down to one-second segments at scale, they start to fall over a lot more quickly, especially on mobile networks and more difficult networks. I don't recommend trying to go down to that. You’re certainly welcome to test it, but as you’re looking at demos showing how to get down to this level by one second segments, just be wary that in production, it might not work so well.
With two-second segments, we can start to get into the idea of low latency. If you get down to six seconds--assuming that you have latency from the camera going out to the player less than four seconds--then you're in that 10-second range. These numbers aren’t perfect, but they work in the real world. Recently, I spoke with a developer who is building a Twitch clone. They have chat next to the window, and today they’re pretty happy with 10-second segments. Of course, no latency would be preferable, but for the type of interaction involved and the ability to react with chat, sub-10 seconds is what people, in my experience, have asked for.
When Twitch first got started, they had as much as 20-second segments, and users started getting accustomed to a 20-second delay between when they initiated a chat and when they received a response. Now Twitch is down to the sub-six second range, and expectations might change as people get more used to that.
But today, when I talk to developers building these applications, knowing that the challenge is to get down to lower latency, I hear people asking for 10 seconds for those applications. When you get into something like an HQ trivia app, where you actually have a host whose main job is to respond to what’s happening from the viewers, that’s when I hear requests for four seconds or less.
As for sub-second latency, that is really the realm of WebRTC. RTMP also fits into this area. These protocols are great for enabling real-time audio communication, but they’re relatively expensive to scale. By contrast, HTTP--which comprises everything to the left of sub-second in Figure 2--is cheap to scale, but makes it difficult to get down to lower latency.
In this ultra-low latency range, two different mechanisms are competing to be the protocol that everyone uses to get to low latency at scale—scaling up WebRTC, using a bunch of media servers, versus trying to figure out how to hack the manifest in some way to get those chunks into the player as quickly as possible.
Companies and Suppliers Mentioned