Creating NASDAQ’s Cloud-Based Virtual Workflow
Many people view content delivery networks (CDNs) as SaaS clouds -- and thus, they are seen as some of the first commercially available clouds. You are not, for example, buying time on a computer to run your own media servers. Instead, you are buying just the software service of (for example) real time messaging protocol (RTMP) Flash streaming.
At the time, only three operators had significant IaaS infrastructure available for global services, and we really struggled with another well-known provider. While the company touted its cloud proposition, the reality was that its offering was more of a virtual hosting environment. Its control API was limited -- certainly compared to Amazon EC2’s -- and while the pricing per hour of compute time was competitive, we had many occasions where the servers we ordered through its system were taking not seconds or minutes to provision and become available; they were sometimes taking hours and even days! To be fair, I would imagine the other vendor has addressed this issue by now -- we did provide an extensive debug for its staffers to work against at the time -- but the lack of confidence was already instilled in our client’s mind, and we were left with only one option: to build an Amazon-Amazon resilience. Essentially, this means that we always launch the workflow in two totally separate regions and availability zones and are ready to instantly deploy a third workflow in a third region within a few seconds.
To summarize and answer the question directly, the key reason we use Amazon EC2 is that we had no real option at the time. We are still open to vendor diversification, and, to be fair, we will review other vendors during the next 12 months. However, the reality is that for IaaS, Amazon EC2 is an extraordinary platform -- way ahead of the pack as far as public IaaS -- and has proven to be extremely reliable and cost-effective, as well as truly global.
A still from an animated model representing actual activity in the cloud shows Amazon East (red) and Amazon Ireland (green) scaling up to meet demand during peak usage. The blue and yellow represent the 24/7 central management systems. Video of this visualization can be seen by clicking on “Data” at id3as.co.uk.
This data model shows the 24/7 central management systems during quiet times; the Amazon nodes have been spun down to save money.
What Sort of Scaling Issues Did You Face?
The NASDAQ OMX Corporate Solutions platform has extremely volatile usage patterns. At the end of the financial year, we may see thousands of companies reporting at pretty much the same time; this requires, along with several management and reporting servers, at least double that number of encoders to deliver. This level of spikiness is an outlier even for a company the size of Amazon EC2, so we had to negotiate specific permission to create such large demands on its infrastructure on short notice.
One thing that was crucial was the ability to move encoders from one availability zone to an alternative in the circumstances that the initial target zone didn’t have capacity. Our technology completely abstracts NASDAQ OMX Corporate Solutions’ working processes from all that underlying complexity.
How About SLA?
Amazon EC2 offers at least 99.95% (aws.amazon.com/ec2-sla) availability. This translates to a target of 4.38 hours annually that the entire service may be unavailable. Our application always runs in at least two regions all the time. Broadly speaking, this means we end up with an overall service-level agreement (SLA) for our application of (100-(0.05% x 0.05%)) = 99.9975%. The key to maintaining this availability is the autonomy of the different regions and the applications. The chances of something going wrong on a server in a public cloud data center are hardly different from the odds of something going wrong on a machine you own and host in your own location. In the case of an IaaS public cloud, however, you have instant access to many thousands of other resources to use in place, and you can -- and should -- be using multiple systems for redundancy all the time.
I have written before about people who claim Amazon EC2 is not reliable after famous Reddit and Netflix outages. Amazon EC2 is usually operating well within its SLA; the issue was that Reddit and Netflix did not code their applications well to respect outages or failures. In contrast, the platform we delivered to NASDAQ OMX Corporate Solutions is automatically operating hundreds of servers in multiple regions of Amazon EC2, and we only knew that the previously mentioned outages had had any effect on our delivery by inspecting our logs. Our applications simply fail between machines (in a single frame of video or audio), the downstream origination, and the upstream CDNs, and the clients would have been unaware.
What About Signal Acquisition?
In simple terms, signal acquisition into Amazon EC2 has to be facilitated on IP. We use both videoconferencing and broadcast MPEG transport stream video sources to contribute signals into the infrastructures directly from events. Likewise, we use voice over IP (VoIP) for audio-only signal acquisition. The Amazon web services data centers are all extremely well-connected, so contribution issues have usually been down to remote connectivity at the origin (on-site at the event). We have, at times, experienced network splits within the Amazon infrastructure, emphasizing the importance of being able to automatically select alternative zones for encoders on-the-fly.
What About OS Choices and Stacks?
The initial implementation was built using Windows, as it made use of several Windows-only third-party tools, such as Windows and Flash media encoders. Once it was live, we found that almost all of the operational challenges we faced arose from these third-party tools. Over time, id3as developed its own end-to-end suite of technologies (id3as.media) capable of replacing all these third-party tools (apart from codecs) and doing so in a platform-agnostic way. This has allowed us to greatly improve platform stability and significantly reduce IaaS costs by moving from Windows to Linux, which is considerably cheaper by the hour.
How Is the System Controlled?
Agents in a NASDAQ OMX Corporate Solutions operations center based in Manila, Philippines, and Leipzig, Germany, supervise the entire system, although there are disaster recovery capabilities across the world, since the management layer is entirely web-based. Schedules for events are delivered by the NASDAQ OMX Corporate Solutions management systems into a resilient and distributed (cloud-based) database. These management systems are in multiple availability zones, and they receive instructions about when to fire up encoders for various events. Notification is presented on the web GUI to the supervisors.
This same web GUI allows the operator full control over the events, including cutting and editing on-the-fly (using DVR-like techniques to mark up the edit decision list) to ultimately produce on-demand files within moments of the live event finishing. The GUI also alerts the operator to any signal confidence failures and enables a single operator to supervise multiple events at the same time, greatly increasing the productiveness of the call center team. The id3as.media management and control subsystem is all written in Erlang, a language designed specifically to produce very large scale, extremely available/reliable distributed systems.
How Does it Report?
As the id3as.media software replaced all the third-party technologies in the workflow, it provided an opportunity to consolidate reporting into a clean, clear, and dynamic reporting model using graph-like output. This means that id3as can drill into any issue in the workflow in a graphical way, with an almost unlimited amount of detail available. This ensures that stand-alone, real-time monitoring is possible out of the box; it’s also tightly integrated into NASDAQ OMX Corporate Solutions’ own management and reporting systems.
To summarize -- and I have said it before -- the IT craze for all things cloud would, at first, seem to be all about technology, but it’s not: It is really a broad term for dynamic economic models that are underpinned by a variety of technologies.
Yes, you can generally (technically) do in the cloud what you would do on private dedicated equipment, but the key advantage should be cost and flexibility. It should, such as with the example in this article, be driven by the real value proposition to the customer and not clouded with trendy buzzwords.
This article appears in the August/September 2013 issue of Streaming Media magazine as "NASDAQ's Cloud-Based Virtual Workflow."
Companies and Suppliers Mentioned