How Can CDNs Hit Their SLAs?
Learn more about SLAs at the next Content Delivery Summit.
Read the complete transcript of this clip:
Steve Strong: How do you start hitting some of these numbers? What's the approach you can take? You can kind of ignore the "99% and below"--most people in this room probably aren't interested in those numbers. And frankly, they're easy. You've got so much time to respond to stuff that you can just have a single box set in your cupboard doing the job.
Adrian Roe: I met an encoding vendor a year or so back whose name should definitely remain nameless, and I said to him, "What happens? What do you do when an encoder fails?" And he said, "Well, what we do is we send an engineer into the room and they take the SDI cable out of that one and they put it into the replacement and they turn it back on again." They called it "orchestration by human being." Kubernetes it's not.
Steve Strong: But it works, and if you can cope with that, it's a really good way to do it. At 99.9, your time is getting limited. Knowing what period it's measured over starts becoming very interesting. If you're measured per day, frankly it's still pretty easy though at 1.4 minutes. If your SLA is over the course of a month--and assuming you're not going wrong every single day--you've got 10 minutes to spare. So, 99.9, if you're very organized and you plan what you're going to do, you can probably still get away with a fairly simplistic approach to delivering that sort of number. That's not particularly hard. Once you get up to 99.99, it starts becoming out of the realms of human control. You can't have orchestration by human beings at that point and there's simply not time to respond.
You've got to have multiple systems live the whole time. You've got to have things active, ready to take over when stuff goes wrong. At 99.99, you might get away with something like an n+1 model where you've got a bunch of live systems and a couple of hot spares that are sat there running. And if one of the live one goes wrong, you re-commission one of the hot spares and you're up and running quickly enough. But you're certainly in a world where you're going to have to have some form of distributed system. You've got multiple computers, you've got multiple things ready to go. The ability to deploy on the fly, ability to spot the errors automatically and recover from them automatically.
Adrian Roe: You get some interesting cultural challenges there. We had one customer, we were deploying an n+1 solution for them for that sort of level of SLA, and they said, "But when we're showing customers around our data center, how do we tell them which is their encoder?" And we said, "Well, you don't because there isn't a 'their encoder.'" Because as soon as there's a "their encoder," when that one dies, what do you do? It's going to have to move somewhere else and in order to have it move a running job--move from one server that's just died to another server and have that happen in the course of a fraction of a second or maybe a second or two--ain't no human being involved in that. At which point the whole notion of this piece of infrastructure is delivering this part of my server just doesn't make sense anymore. And,I kid you not, what we ended up having to do for them was we came up with this concept of "preferred encoder" that, if things were all normal and we had lots and lots of green lights everywhere, then this would be the BBC's encoder and you could show them 'round your data center and say, "It's that one." They had a label maker and everything.
Steve Strong: We suggested they just stick the labels on it. They actually really wanted it to be real. A 99.999, you've got to have multiple live systems. You've got to have A/B systems running the whole time. You've really got no choice. It's almost impossible. Again, depending on where your SLA is measured. But if you shoot me a download, that's on the daily level, the reality is you've got 0.86 of a second to respond. And that's not just to respond; that's to detect and respond. So you've got to spot that something's gone wrong, fix it, and have the service back up and running before your 0.864 seconds is up. That's very little time to do anything, particularly on wide area networks and so on. You've got ping times measured in hundreds of milliseconds. You've only got 860 milliseconds to do it. Well, that's not long. So you really need A/B systems. You've got to have multiple systems live delivering the service the whole time.
id3as Directors Steve Strong and Adrian Roe explain that high-availability content delivery isn't just about meeting SLAs with overall uptime as not having your streams go down at critical moments.