The Algorithm Series: The Math Behind the CDN
For the delivery portion of the overall streaming equation, CDNs use refined content caching and content replication, optimized network paths—including ingress, egress, and midgress data transport—and strategic server placement at the core (such as the origin server) or at the edge (often referred to as caching content at a point of presence). Underlying the key CDN components are a number of fundamental algorithms used to balance strategic core and edge architecture demands.
This article, the first in our new Algorithm Series, dives into the math behind the magic of streaming media delivery to highlight significant mathematical concepts—and even a few equations—that power the infrastructure to deliver live and on-demand streams on a global scale.
If you read my November/December 2019 Streams of Thought column, you'll remember that this article series came about from a discussion at IBC in Amsterdam among me, my fiancée, and one of the several Ph.D. streaming solutions architects. During that conversation, two of the three of us with math degrees—Michelle Fore and Yuriy Reznik, head of research and a fellow at Brightcove—began delving into the math at the intersection of media player performance and multi-codec manifest files.
The conversation led to an initial idea to cover four key areas—delivery performance (CDN), player performance (OTT or OVP bitrate and rendition ladder optimization), live event scaling (including authentication and other potential bottlenecks), and DRM—and was fleshed out in a subsequent interview I did with Reznik at Streaming Media West 2019 in Los Angeles.
Along the way, I was introduced to people in the industry with whom I've had either no or limited interaction over my past 2 decades in the space, but whose contributions to those four areas are essential to the road map of not just how we got to where we are today, but also to the future of media delivery.
What is CDN math anyway? The most common math for content delivery, at least from the paying customer's perspective, is billing algorithms. These aren't new by any means, having been around during the area of telco-based data networks (think dial-up, ISDN, or even fixed-line long-distance services).
Beyond these and other customer-centric algorithms like 95/5 (aka 95th percentile), one key area that all CDNs measure and optimize is caching server utilization, including ways to prevent overloading a single server's capacity, often referred to as "swamping" a server.
Server Utilization for Proper Caching(aka Consistent Hashing)
There's quite a bit of math that's gone into modern CDNs, but one of the fundamental algorithms for web acceleration and streaming can be found in a design patent filed way back in 1997. U.S. Patent No. 8458259 is titled Method and Apparatus for Distributing Requests Among a Plurality of Resources and is, itself, a continuation of several prior patents dating back to a March 13, 1998, patent application for what became U.S. Patent No. 6430618 in 2002. The original patent, granted to MIT, was based on research presented by members of its Laboratory for Computer Science at the 29th Annual ACM Symposium on Theory of Computing (STOC97) in May 1997. Their paper is titled "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web" and can be found on pages 654–663 of the symposium proceedings.
Several of the paper's authors—David Karger, Eric Lehman, Tom Leighton, Matthew Levine, Daniel Lewin, and Rina Panigrahy—are now famous in content delivery circles. Leighton, for instance, who has retained roles in both the MIT computer science lab (now called the Computer Science & Artificial Intelligence Laboratory) and in MIT's department of mathematics, went on to co-found Akamai the following year with the late Lewin.
So what was this "consistent hashing" idea that the paper's authors developed and presented? The money quotes from the patent spell it out:
Two causes of delay are heavy communication loading a part of a network and a large number of requests loading a particular server. When part of the network becomes congested with too much traffic, communication over that part of the network becomes unreliable and slow. Similarly, when too many requests are directed at a single server, the server becomes overloaded, or 'swamped.'
To address both the network congestion and the swamping of the original server, often called the origin server, the patent raises the concept of a caching server:
Caching can reduce network traffic if the cached copy is close, in the network topology sense, to the requester because fewer network links and resources are used to retrieve the information. Caching can also relieve an overloaded server because some of the requests that would normally be routed to the original site can be served by a cache server, and so the number of requests made of the original site is reduced.
But how many caching servers are needed and in how many locations? More importantly, from a CDN design standpoint, are there other issues that could still cause congestion, even if there are a whole lot of caching servers? It turns out the answer is that just throwing caching servers at the problem is not an effective solution. And that's what consistent hashing attempts to solve.
To understand the basis for consistent hashing, we first have to understand hashing. And to do that, we also need to understand modulo mathematics.
Modularity in Mathematics
Modulo is a mathematical operation to find what whole number (any natural numbers plus 0) remains after division takes place. If you remember high school mathematics, the test of whether any two numbers will have a remainder is called synthetic division or Horner's method.
Still don't remember how to do it? OK, here's how modulo works. For instance, 7 modulo 2 (often written in shorthand as 7 mod 2) has an answer that is either zero or some natural number (any positive number above zero). While we might typically write 7/2 as 3.5 when using the decimal to represent half of the divisor, for modulo mathematics, the answer would be 1 (essentially 2 to the power of three with 1 remaining).
Since modulo can have a result that equals whole numbers, there are scenarios in which the result of the modulus would be no remainder. For instance, if the formula were 6 mod 2, the answer would be 0.
The importance of modulo for hashing is that the remainders help determine which server a given piece of data might be assigned to and retrieved from. More on that in a moment.
Hashing It Out
The simplest definition of hashing is to chop or divide, but for our purposes, hashing is "a function that maps one piece of data—typically describing some kind of object, often of arbitrary size—to another piece of data, typically an integer." Put a slightly different way, according to the Wolfram MathWorld site, "A hash function (H) projects a value from a set with many (or even an infinite number of) members to a value from a set with a fixed number of (fewer) members." In other words, it's a way to notate an infinite number of values via a set, more-limited number of values. In practical terms, for content, this also allows variable-length content to be represented by fixed-length representations.
Let's use the Social Security number as a form of hashing variable data to a fixed-length number. If you remove the dashes, a Social Security number is a unique, fixed-length-value-of-9-digits natural number (a positive integer above zero). Ignoring the initial limitations of the Social Security numbering scheme (3-digit Area Number, 2-digit Group Number, 4-digit Serial Number) and assuming that the Social Security number starts at 100,000,000 to accurately fill all nine slots, the total number of possible fixed 9-digit values would be 899,999,999. While the number is a fixed length, the name attached to each Social Security number is a variable-length piece of content. It could be Mary J. Blige or Franklin Delano Roosevelt or even John Fitzgerald Kennedy.
In our example, the fact that the name itself uses both variable-length and multiple-character types ("varchar" in database terms) means that a global name search requires much more computational power than does the entirety of 899,999,999 integer permutations in our Social Security number example.
A final benefit of hashing is how it positively affects more efficient searches and storage in a modern relational database. Most databases use a key structure—a unique value in any single field in a database table—as the primary key on which to not just search content but also to join content from multiple tables (the relationship) into a single query of content across multiple database tables.
What we've described in our hashing example—a combination of a fixed-length value (the Social Security number) and a variable-length value (any given name)—is, at its most basic, a traditional database construct called a key-value store, in which the key (hash) is associated with the value (a string of content, such as any given name). In the case of a hash key that is unique, it can potentially act as the primary key in any given database table.
Databases also perform very well with indexed content, which is essentially a road map to a group of key-value stores. The index effectively reverse-engineers an array of content stored in a database, with the index being searched, rather than the entire database. For a grouping of hashes, the hash table acts as an index and significantly reduces search times for the string of content associated with a particular hash.
Problems With Hashing
The major potential problem with hashing is what's referred to as a collision, meaning two variable-length values share the same defining parameters (e.g., two people named John Reginald Smith Jr., both born in the same part of the U.S. in the same year) that may result in the same fixed-length value representing both (in our case the Social Security number).
A rudimentary hashing approach, one that generates a hashing collision due to the fact that two strings of content with the same parameters share the same hash key value, could result in inadvertent swamping. In practice, it could also result in one server holding too much content, with other servers in the cluster holding too little.
In addition, a hashing collision could add significant delay in a browser receiving content. In the worst-case scenario, content cannot be served at all or the wrong content could be sent from the wrong caching server to the wrong media player at the wrong time. In our Social Security example, it would almost be like the Social Security Administration sending a check in the mail to the wrong John Reginald Smith Jr., who then might be legally free to cash it.
One way to address potential collision errors in a CDN architecture is to add more servers, replicating the same hashing table and key-value store combinations to multiple servers. This works if the content itself has limited collision likelihood; however, if one server in a group of servers fails, the hash table would then be obsolete, since some of the content of the failed server would need to be remapped to all the other servers. The end result would be a massive hit on the origin server whenever any single caching server, even one in a cluster, fails.
Streaming is still only a fraction of total video watched, even post-COVID lockdown. What happens when it's the de facto method for delivering video content? Will the existing CDN approaches be enough to handle the next phase of streaming growth?
Multiple algorithms for encoding, transfer, and playback intersect in the end user's player. So how can you make the numbers work in your favor?
With a clear understanding of traffic, key performance indicators, and what you need to store on an origin server, a multi-CDN strategy can improve a better experience.
The event brings together all the players in the content delivery sector—from power and interconnect to quality of service and edge compute—in the only conference of its kind. If you're interested in speaking, read on.