What Massive Video Streaming Events Can Teach About Team and Infrastructure Resilience
The Super Bowl, the Grammys, March Madness, and other massive events have at least one thing in common: huge streaming viewership across a diverse device landscape. Seven million fans streamed Super Bowl LVII on television, computer, and mobile applications. Flawlessly delivering large-scale streaming events requires extensive preparation and coordination along with resilient infrastructure and real-time decision-making.
Although these events may seem larger-than-life, many of the foundational elements of success in video streaming are highly relevant in healthcare, e-commerce, finance, and other industries. Every networking team has its own “Super Bowl” to contend with — whether it’s Black Friday, tax season, the college application deadline, a viral TikTok video creating a surge in demand, or simply the everyday challenges of running a popular website.
Taking a closer look behind the scenes at how huge streaming events succeed can provide valuable lessons that any networking team can use to improve its resilience and better prepare for inevitable fluctuations in traffic patterns.
For instance, planning for major streaming events usually begins a year or more in advance, involving a “critical mass” of technical managers, SREs, and backend engineers. Each is trained with a well-defined set of responsibilities and a clear chain of command in the event of a crisis.
Overcommunication is also key. Clarifying questions are common. While in the midst of the event, team members send hourly status reports and monitor social media for complaints about poor connectivity, escalating relevant posts to the right team members to act upon right away.
Companies streaming large events are learning to put an ever-growing focus on resilient infrastructure, as failing to do so can result in significant frustration and brand damage. One only need consider the frustration that resulted from the delayed Love is Blind live event on Netflix) to understand the value of resilience as an investment.
From overprovisioning clouds to leveraging multiple CDN networks and redundant DNS — these teams ensure there is no single point of failure. They also stress test systems prior to the event to identify potential areas of weakness and refine their traffic steering policies. Yet no matter how much thought goes into the months-long preparation phase, the actual days leading up to (and including) the event may present something unusual. It’s possible that peripheral domains for which no traffic is expected might see surprise traffic spikes due to a DDoS attack, for example. Many teams rely on deep DNS analytics functionality to determine whether unexpected traffic patterns are malicious and, if so, how best to neutralize them. They also leverage observability tools to monitor traffic and network conditions in real-time so that dynamic decisions can be made in the event of problems.
Key Takeaways for Every Network Team to Be More Resilient
1. Plan proactively and extensively. Ensure you have a “critical mass” of individuals well-versed in your network operations whose roles complement each other. Each team member should have a clear understanding of where their role ends and where those of their peers begin. And everyone should be aware of strategies for both “business as usual” and alternatives if disruptions occur.
2. Overcommunicate. Enable a regular flow of information between teams and team members. What seems mundane to one team member may be essential for crisis resolution a few hours later. Foster an atmosphere of learning, where there is no shame in asking a basic question when the alternative is remaining confused. Look to social media and other user activity as a way to spot new problems. And recognize the importance of having dedicated emergency Slack channels (or other tools of your choice) that can immediately escalate concerns to the highest level.
3. Implement resilience best practices. Ensure resilience throughout your stack by implementing redundancy throughout your stack. Because DNS is the foundational network layer, it is essential to deploy always-on, redundant DNS networks to prevent outages. Companies should also leverage a resilient, anycast network to dynamically divert DNS requests to an available server when there are global connectivity issues.
Delivering content consistently to widespread audiences, especially across many geographic regions, also requires a multi-CDN network. Not all CDN providers are equal — one CDN provider may excel at static asset hosting, while another may be better suited for video streaming, and some may perform better in one country over another. Leveraging multiple CDNs will ensure that users in all locations have a smooth experience. That said, if you have minimum commits for individual CDNs, you will need a way to track how you distribute your workload among each.
When 100% uptime is essential, you must eliminate all single points of failure and incorporate intelligent traffic steering capabilities within networking infrastructure. This will ensure traffic is automatically redirected seamlessly to a healthy endpoint if another is experiencing issues.
4. Integrate deep analytics into your network. Train team members in using deep analytics tools to identify potential threats and determine steps to remediation. Ensure you monitor both the main domain and other associated domains, even if they are not likely to be targeted with attacks or substantial traffic.
Recognize that your analytical capabilities need to be thorough and fast enough to generate real-time curated data on any scale — from individual DNS interactions to network-level patterns and trends. Depending on the issue that needs troubleshooting, it may be necessary to observe something as granular as the activity at a single node in your network, or something as high-level as the most active domains, top error and response codes, and most common query types across the entire network.
Whether bracing for day-to-day traffic spikes or once-a-year chaos, keeping resilience in mind is critical. A few best practices can go a long way toward strengthening both team and infrastructure readiness. Strong team dynamics, extensive communication, resilient infrastructure, and deep analytics capacities are all essential to delivering experiences that meet rigorous user expectations.
[Editor's note: This is a contributed article from NS1. Streaming Media accepts vendor bylines based solely on their value to our readers.]
Adrian Garcia of Applause discusses the challenges of testing live-streaming broadcasts and the new testing approaches that should be taken to minimize disruptions.
From slow internet to troublesome routers and modems, this guide covers your live event streaming issues once and for all.
As audiences continue to increase their streaming consumption, sports are ready to join the ranks of hit programs like Stranger Things, Only Murders in the Building, and Obi-Wan Kenobi. And with new streaming-exclusive deals involving the NFL and MLS in play, fans are ready, as more than one-third of U.S. homes now access TV programming strictly through an internet connection.