Sweet Streams: Optimizing Video Delivery
One foundational technology we need to cover is HLS, also known as HTTP live streaming. It relies on the hypertext transfer protocol, or HTTP, But HTTP itself has a further dependency on TCP, the transmission control protocol, in order to get the object over there in the first place.
So it’s layer upon layer upon layer upon layer of different protocols, all of which have to operate and inter-operate correctly, and all of which have to do their job properly in order for your end user to get a lovely continuous video. As I said, this is a very, very different world from the way it used to be. You have to optimize all the possible layers if you’re determined to get a quality product out there with a high quality of experience.
Let’s start with TCP and work our way back up the stack. The goal of TCP is to maintain network quality because you can’t have a high-quality experience without high network quality.
Break that up into flow control. What’s the fastest data rate we can negotiate? Reliability is key; if we’re dropping packets, we have to re-transmit information. This is a method of doing that automatically, and also congestion control. It goes back to one of the very earliest points I mentioned: If I have to divvy up this bandwidth amongst a bunch of people, who gets what? Congestion control is a way for everybody to fairly participate, get their fair chunk of bandwidth, and not let somebody steal it all and starve everybody else.
The reason why latency and bandwidth actually inter-operate, and why they’re essentially two sides of the same performance coin is that TCP is bursty. It doesn’t start sending a constant, constant stream of information. Instead, it takes some chunk of information, call it a packet if you'd like, and then it sends it out, and then it waits. It waits to get a response back from whatever device it was that it was trying to talk to.
That’s a problem. The longer the latency, the longer the time-to-first-byte, the longer it takes to go between the server and the client, the less bandwidth we can get because we have to wait until that information is acknowledged.
What happens if that information is not acknowledged? Then the server has to retransmit, which is terrible in a long latency because it doubles the wait time between when you send some information and when you know that it has been received. It’s also why, when you’re walking around with your cell phone, it’s a miracle that this stuff works as well as it does.
Let’s say you want to stream video from a server to an iPhone. Your video is encoded at 1 megabit per second (1 Mbps). Obviously, these numbers are deliberately bunched together to make the math super-simple. In the real world, you’ll get slightly different measurements.
We know we’re going to send from to the server to the iPhone. The iPhone takes one second. We have one-second latency. So it takes one second to get to the iPhone, and one second for the iPhone to come back to the server.
As long as TCP can negotiate a speed of 1Mbps, the iPhone can play the 1Mbps encoded video at its intended rate, and we do that by sending 2Mb and a pause of one second to go over. The iPhone sends its acknowledgement right back, and 2Mb divided by two seconds to send that 2Mb gives us the 1Mbps that we needed. Great.
What happens if we have loss? We’re still encoded at 1Mbps. We send 2Mb of data. The iPhone never receives it, so it can't send an acknowledgement. The server waits two seconds, and says, “I know you’re a second away, so it should’ve taken two seconds for me to get that packet and I didn’t get it. I better retransmit.”
This time, the iPhone gets it, sends it acknowledgement back, and another two seconds has passed. Now do the math. We just sent 2Mb in four seconds, which is only a half-megabit per second. If you’re watching a video encoded at 1Mbps, what happens? Delay. Stuttering. Buffering, buffering, buffering, buffering. This is why retransmits are so incredibly important.
This is also why it is amazing that we can send video over a 4G network, which is intrinsically lossy, with intrinsically variable latencies. If we ever have to retransmit, we could be waiting several seconds, or we could be waiting milliseconds. It depends. It’s really quite remarkable that this stuff works as well as it does.
When you’re doing your video encoding, it’s important to have a variety of encodings available. If you have a smart protocol like HLS (more on that later), you can negotiate down as well as up. So if you are in this sort of lossy, variable-latency case, you can actually negotiate down to a lower-bandwidth encoded video stream. That way your end viewer can at least still see your video, even under lossy and variable-latency conditions where the quality of experience isn’t quite as high as you would like for it to be.
I mentioned earlier that HLS stands for HTTP Live Streaming. It depends on HTTP. It also has a dependence on TCP, moving back up the layer. When HTTP sends an object, we call it an HTTP object. It’s called an object for a reason that comes from the development world: You have this lovely discreet thing which is entirely self-contained. Within that, we have metadata, we have information about the data that we’re trying to send--in the form of a header versus a body, because headers are where you would store the information about the body. The body is the actual file transferred.
This is a big deal we can pass this unitary object from server to server to server and it can get passed unchanged. This means I can chain servers together. If I have a server that’s local to that iPhone, I can use that server to serve the data even if my content server is half the world away.
This means I can get a very low-latency connection between the iPhone and my server while at the same time, having my initial origin data half a world away. This is called a caching, or proxy server. If the proxy server is allowed to cache, then all of a sudden, a whole new world of performance opens up to me, and all I need to worry about is the connection between my iPhone and my caching server. I don’t care about the connection between the iPhone and the origin server.
The joke in computer science for 30 years now has been that there are only two hard problems. One is naming things. The other is caching policies. If I’ve decided that I need some caching servers, then I have to determine, how am I allowed to actually cache? The more servers, the more locations, the lower the average latency. If I don’t know exactly where in the world I am, I just have to be close to a given POP. So I better have as many POPs as I possibly can to make sure to maximize the probability that you're going to be closer to one.
So, the shorter the physical distance, the better the performance because of the lower the latency and the lower the latency, the higher the bandwidth I can actually push and therefore the better the quality of experience of the whole thing.
This article is Sponsored Content
Learn how StackPath is building an extensible platform at the cloud's edge to deliver on the past and future promise of edge services—including CDN and all its applications—and make the Internet safe.