How to Build a Scalable Low-Latency Streaming Solution
Designing an Ingest Workflow for Low-Latency Streaming
In this section, we’ll walk through the live stream workflow and explore some of the options at each step. As shown in Figure 3 (below), you can start with a camera plus an on-premise hardware encoder, such as Elemental’s. That's a great way to get a high-quality livestreaming video into your network. On the software encoding side, the most popular desktop software for Twitch live streamers is Open Broadcast Software (OBS). The third alternative is streaming from a mobile device. This is typically the preferred—or sometimes the only—option for streaming to Facebook Live or a Periscope.
Figure 3. Key elements of streaming ingest
As you work down the capture list, you'll find yourself more and more constrained on the level of video quality you can push. With a high-quality hardware encoder, you can push 1080p. With some of the newer ones, you can push 4K and 8K, and obviously it depends on your network, but a handware encoder can keep up with higher quality.
With desktop software on an average laptop, you might struggle to push out a 4K stream and keep the encoder up without dropping frames, without creating a buffer before you even enter the network. When streaming from a mobile device, obviously you're heavily restricted on the amount of processing you can do and the amount of video that can send out. Periscope limits the video that you can send to 540p. Facebook made the leap from 720p to 1080p this year, but I don't see a lot of people going up to 1080p on that platform, especially on mobile devices, which tend to be fairly locked down on how much you can stream out, both because of the device itself and also because of the network that you’re working with.
On the network side, you have the option of a wired network versus a mobile network. Whenever possible, you should avoid streaming over WiFi networks, because the last thing you want to have to diagnose is whether issues in your stream originate outside of your network or in your local WiFi. This type of confusion arises with particular frequency in UGC streaming. Of course, the support requests come in like crazy when it happens. Often, with a mobile application like Facebook Live, a mobile network is all you have. You definitely want to be conservative on the size of video that you’re sending in that case.
Protocol-wise, you have a handful of options for transmitting the video from the capture device through the network to the origin. RTMP is the venerable protocol from Flash live streaming. It’s a widely supported and stable protocol. Next you have WebRTC, a newer protocol and an open standard in the browser. WebRTC is not supported in nearly as many places as RTMP, and as a result it’s not as commonly used in streaming today.
But WebRTC has been wholeheartedly designed for real-time communication and keeping up with the live edge of audio and video. It will aggressively throw away quality to avoid rebuffering if it thinks it can’t keep up. If your top priority is the quality of the stream coming in--even if there might be some rebuffering might result--WebRTC may not prove the best approach for you, because out of the box, it’s going to reduce the quality to keep up.
The newest protocol, Secure Reliable Transport (SRT), is reportedly twice as fast as RTMP (which is already pretty fast). If low latency is paramount and you’re trying to trim every possible millisecond, it’s worth looking into. It’s not nearly as widely supported as RTMP--especially for mobile devices--but for more professional streamers, SRT is becoming a more popular solution.
Processing the Live Stream
Let’s assume we’re using a wired connection with RTMP going into the origin/transcoder (Figure 4, below). Our first option is to pass the video through from the encoder through the origin into the network. With WebRTC, if you want more renditions of the video, the client creates multiple versions of the video, streaming all of those out at the same time. That fills your network and reduces the amount of quality at the top that you can send with the highest quality version. But that’s a passthrough, so you’re not doing any transcoding in the center, and it just goes straight out to your network.
Figure 4. Processing the live stream
It’s the same with UGC content. When you’re talking about a platform like Twitch, where 95% of the views get one viewer, it would be ridiculously expensive to transcode every single stream that they’re getting in. What Twitch does, essentially, is take that stream, put restrictions on what their streamers can send to them, and then send that stream through to their viewers until it hits a certain amount of viewers, and at that point they take it and transcode it to add the different renditions.
That becomes the next option, which is introducing transcoding in the middle. Realistically, if you’re trying to support audiences that are not all U.S-based and on a wired connection, you're introducing multiple renditions of the video so that the player can make a better decision, based on the user’s network, which quality of video it should be playing to avoid rebuffering.
There are a couple of different adaptive manifest options, and newer mechanisms built into the standards to make low latency possible. DASH incorporates the concept of a segment template, which allows a player to know what the segment name and URL are going to be without having to make a request for another manifest file. As the video is streaming, the player downloads a master manifest file text file that shows where all the different renditions are, then requests more text files to know where the actual chunks of video are. Each of those steps introduces latency in to streaming playback, and so with this concept of a segment template, you can skip that second request for a rendition manifest and go straight to requesting the chunks of video.
DASH also uses an approach called chunk-transfer encoding, which is essentially the magic behind the low-latency approach. Originally used for transferring data objects, chunk transfer encoding is a mechanism that’s been built into HTTP protocol since the HTTP 1.1 iteration. If you’re transferring a big data object to the browser but you want the browser to be able to start working on that data and showing rows--say, if you’re listing countries with a bunch of data, you want to be able to show that data country by country on-screen as it’s coming into the webpage instead of requiring the user to wait for all the data to come in before seeing any of it.
As effective as chunk transfer is for other types of data, it’s surprising that we haven’t used it for video until recently. Without chunk transfer encoding, when you request a segment--whether it’s two seconds or 10 seconds—you’re waiting for that entire segment to download before you can even start processing or playing the first frames of that segment. With a two-second segment, that's close to two seconds worth of latency that you are introducing into the chain just waiting for the segment to download. Chunk transfer--now supported by all CDNs--allows us to start processing, transmuxing, or decoding that stream of data and sending it through as the first frames of the video are coming in, and cut out a good number of seconds from that chain.
The challenge with this approach, however, is adapting to the stream. In these adaptive frameworks, each time a segment comes down, the player marks that it’s starting to request a segment, and measuring how long it takes to get that whole segment. It needs to see how big the filesize of that segment is in file size and estimate the network speed off of that. If the network speed is faster than the speed at which the video is currently playing, and fast enough to handle a jump up to the next quality level, then it can do that. Or if it's noticing that the network is dropping a little bit, then it can switch down to the lower-quality rendition.
The problem with chunk-transfer encoding is that the player is actually requesting the segment as it’s being created. As the encoder is creating segment 1 and putting that on the origin, these bytes can actually be streamed through all the way through the CDN, so the encoder is still creating the last part of the segment while the player is receiving the first part.
That means that the segment is being passed through in real time. Obviously, an encoder can’t capture video faster than it's happening in real life, so it's sending the video frame by frame as the frames are coming in. The network speed will never be measured any higher than the speed at which the video is playing back. So how does the player know if it can switch up to a higher-quality rendition? It can still switch down easily, but this means that once the player starts to notch down after encountering a playback issue, it will never work its way back up.
There are a few ways to address this issue. One is to find a cached segment from the past, finding an opportune time to make a measurement, requesting a small chunk, seeing how long that takes to transfer, and using that as your estimation. That's not ideal, because it puts an additional burden on the network you’re using pull the live stream, which might impact streaming delivery as you’re running the test.
Another proposed approach is to test whether the micro-chunks that are coming through the network are cached, or if they’re coming through live at the live stream’s speed. This method requires the player to do a lot of complicated math to determine if the chunks are coming through at the actual speed of the network. I have heard that some people have tried to put this approach into production and haven't quite got it to work yet, but--at least in theory--it’s the best approach that we have moving forward to getting this mechanism working.
Companies and Suppliers Mentioned