Online Video Jumps on the Big Data Bandwagon
It's the era of big data, and online video publishers are embracing this trend in a big way. Learn what deep analytics have to offer the streaming video industry.
Learn more about the companies mentioned in this article in the Sourcebook:
Every year, there are terms that are so nebulous, rapidly reaching a level of hype, that the average human would be forgiven for simultaneously thinking, “Why don’t I know what this means?” and “How can I keep from having to explain what it means?”
The breakout term of 2013 is Big Data, which comes at us streaming types from the far reaches of the often esoteric world of databases. How do we know it’s the breakout term? Analysis passed on by a colleague of mine who works for a Big Data analysis start-up shows that almost as many articles were written on Big Data in the first half of 2013 as were written during all of 2012.
"For the whole of 2012, the number of articles in technology publications written about big data was just under 19,000 articles,” says Susan Puccinelli, director of communications at Datameer. “In the first half of 2013, there have already been almost 14,000 articles, and the numbers are growing each month.”
In other words, every 10 days there are almost 500 articles written on this one topic alone.
What Is Big Data?
A little later, I’ll cover the parts of Big Data that are relevant to streaming, but, first, it helps to set a definition since the topic itself can be hard to pin down. To do that, let’s first look at it from a pure database standpoint.
The term Big Data, when applied to typical database scenarios, boils down to three major areas: aggregation of a number of disparate databases, inclusion of schema-less data, and a set of analytical tools to derive meaning from the data.
The king of the database world, in terms of overall deployments, is the relational database. This type of database is found in almost every software and hardware product, even down to the level of operating systems, as it is very good at managing data that fits within a structure or schema.
There’s a problem with most relational databases, but the problem has been masked by decades of management system tools (relational database management system; RDMS): Not all data fits within a specific schema. In addition, some data may be better used in two locations or tables in a traditional relational schema, such as a table on houses on a given street and also a table on automobile ownership of cars parked on that same street.
To handle this need to spread information around to different tables, relational databases use primary keys or one specific bit of data that links two tables together. But the case of cars parked on a street, which is very fluid data that changes frequently, isn’t as easy to fit into a meaningful schema as a fixed asset such as a house.
To combat this issue, a number of “dirty data” options arose in the database world, from the simplest XML markup documents to more powerful document-based databases that use on-the-fly indexing to map out the similarities and reduce them down to dynamic “table” structures. Many of these are known as map-reduce databases, in which the answers derived from queries to the database are presupposed and a basic schema is formed around queries for the schema-less data.
A third area of databases, one that’s especially popular in social media networks, is the concept of a graphing database. In this instance, the relationships are the key element, and the graph is a newer way to do complex searches.
In our previously mentioned example, the proximity of a car’s parked position on the street relative to a given house might allow us to make a map-reduce index, but a graphing database would allow us to find friends of a particular homeowner that own the same model of car that’s parked on the street, in addition to confirming if that particular friend lives in town or in another country. Facebook’s rollout of graph searches will allow its users to do exactly these kinds of searches on friends that meet particular demographic or geographic criteria.
A graph showing the relations among pieces of data in a relational database.
In many ways, Big Data in general -- and graphing databases in particular -- rely on the use of tags to create relationships between objects, persons, and additional disparate bits of data. To get to the point where any person or object is tagged enough to have value in a search graph, a good deal of indexing needs to be done.
How Does this Fit Into Streaming?
So if Big Data is about combining databases and running the proper analytics to find the answers behind the questions, how does this all fit into streaming?
You’ll get widely divergent answers to this question, but the map-reduce index that I derive from Big Data in streaming comes down to three things: content management, mission-critical delivery, and indexing and metadata usability.
CONTENT MANAGEMENT AND STORAGE
Without a doubt, the percentage of video in terms of total data traffic has grown by leaps and bounds. Some studies suggest that on-demand internet video traffic itself accounts for almost one-third of all internet traffic during primetime hours, thanks in no small part to Netflix. Some traffic estimates for 2014 reflect that the majority of all data delivered across the internet will be video content.
This provides an opportunity for content delivery network (CDN) providers, some of which have risen to the challenge. But the Big Data issue for these CDNs is less about video content management and more about management of all the content surrounding the video.
Interestingly, for all the content management issues that face CDNs, the issue of streaming content management is actually becoming a fairly simple one: keep track of the multiple versions of an on-demand video file that are needed to form adaptive bitrate (ABR) delivery. Whether the ABR delivery is via Apple’s HTTP Live Stream (HLS), Microsoft’s Smooth Streaming, or the emerging Dynamic Adaptive Streaming over HTTP (MPEG-DASH), progress has been made on all fronts in allowing on-the-fly segmentation of these various ABR technologies.
In 2007, 50% of all internet traffic came from several thousand sites, but by 2009, 50% came from 150 sites (left). Today (right), 50% of all internet traffic comes from 35 sites or services. (Graphs courtesy of DeepField)
This means that we no longer have to keep track of thousands of 2-second segments in permanent databases, something that, a few years ago, Netflix projected would exceed 10 billion assets if they were required to store premium content in pre-segmented form.
The second area that’s receiving attention is delivery of content, especially as the number of “hyper-giant” sites grows.
A presentation by Craig Labovitz, co-founder and CEO of DeepField, at the 2013 Content Delivery Summit, narrowed in on the growth issues facing CDNs when it comes to content management.
“CDN traffic now represents more than half of all consumer traffic in the United States,” says Labovitz. “That’s a very dramatic change from our last published report in 2009.”
Labovitz notes that the consolidation of traffic to a few key CDNs has been an ongoing trend, with 50% of traffic in 2007 coming from several thousand sites. By 2009, the number of sites required to reach half of the data consumed on the North American internet was down to several hundred, and the new report’s initial data suggests the number of sites required is now less than 40 CDN or Top 10 sites.
“[W]e’re increasingly moving to a very flat, dense, highly interconnected network,” Labovitz notes in a previous session. “[M]ost of the traffic isn’t flowing up along [a] tree to reach the Tier 1s and back down. Most of the traffic today is interchange between what we’ve been calling the hyper-giants.”
Streaming video is entering the home through a variety of connected devices. Advertisers are following, relying on big data to reach the right targets.
After acquiring IntegralReach, Rovi is unveiling a big data analytics solution for targeted reach.
Some of the biggest video publishers around are sitting on several years' worth of viewer data that they're only now beginning to sift through.