October 24, 2018
By Jun Heider Director of technology, RealEyes Media
Featured Articles

How to Choose a Video AI Platform and Evaluate its Results

Why is artificial intelligence (AI) a buzzword right now? The more video content we create, the larger our media libraries get. We only have a limited amount of humans on our staff to ingest that media, process it, and add metadata tags to it.

Our organizations need help making this content more searchable so we can get some more ROI on our content. Let’s say you work for a surveillance firm, and you have 24/7 feeds running on 10 cameras or 100 cameras. You may not be able to get enough human help to search that media, but you could get AI help.

On the audio side, there’s great demand for transcription, and there are closed-caption mandates and translation needs as well. We don’t all have translators on staff, but if we have AI translation in our system, even an imperfect translation will help a lot.

The Big Four

There are four key players in video AI: Amazon Web Services (AWS), Google Cloud, IBM Watson Media, and Microsoft Azure’s Video Indexer. Each has numerous services.

The four core services that AWS has for AI are Rekognition, for computer vision and adding deep learning-based visual search and image classification; Comprehend, which tries to understand sentiment and what’s going on within the language; Transcribe, which converts audio to text files; and Translate.

The Google Cloud platform features Video Intelligence, Natural Language API, Speech-to-Text, and the Translation API.

IBM Watson Media is AI for media workflow and video processing, but the big difference between IBM’s platform and the others is that it requires breaking the video into individual segments. The three products created under the banner of Watson Media are Video Recommendations, Captioning (speech-to-text), and Video Enrichment (computer vision).

The fourth is Microsoft Azure, aka Video Indexer. To access Microsoft’s cognitive and indexing features, go to videoindexer.ai, where there’s a cool user interface you can log into and upload videos to, and start getting AI-generated metadata right away for free.

This article will look at these platforms and services in terms of four main features: face/object/scene recognition, sentiment analysis, surveillance, and transcription/translation. When choosing a platform, keep in mind that it’s all about the use case. All four platforms return JSON (JavaScript Object Notation) data, so there’s nothing stopping you from using all of them if you want to. You can see a demo of a custom app that my company, RealEyes Media, built on top of AWS and Google Cloud, in a clip from a presentation I did at Streaming Media East 2018.

How to Get Started

If you don’t have developers in-house at your organization, a lot of these vendors offer demo versions of their platforms that you can play around with (Figure 1). The free version of Microsoft Azure Video Indexer, for example, is quite robust. You can actually take widgets and embed them into your own systems. Of the four platforms we discuss in this article, Video Indexer is the most complete product you can access without having to go through contract negotiations.

Demo versions of leading video AI platforms, clockwise from left: Google Cloud, IBM Watson Media, Microsoft Azure Video Indexer, AWS

Google Cloud and AWS offer demos on their pages, but they don’t really expose the services to the degree that you can embed them into your own system. In this context, “embed” essentially means writing some code to get an iFrame to load on a page inside of your site. The online demo versions of Google Cloud and AWS aren’t really conducive to that, but you can upload your own videos, wait for the apps to process them, and see how well the service will work with your videos.

IBM sells a couple of Watson products, such as Video Enrichment (top right in Figure 1). They are API-first. Theoretically, you can buy this paid product and communicate between your system and IBM’s to pull metadata into your system.

Generating metadata is one of the big use cases for AI. Let’s say you’ve archived multiple seasons of a TV show, and you have a media asset manager (MAM) that houses all this metadata about all of those stored assets. Wouldn’t it be great to have the computer generate some of that metadata for you, so that in addition to the title, season, and episode name, you’d have things like “car crash,” “leopard,” and “little baby girl” (or whatever terms apply to your content) that you can actually use to search and discover your content much more easily? The first thing your developers are going to want to know about is, “What’s the learning curve? What software development kit (SDK) is available for each of these platforms?”

AWS has a client SDK, which basically acts as an accelerator so your developers don’t have to write all of the code from scratch. They can build their code on top of the boilerplate AWS provides, with SDKs for Android, iOS, Java, .NET, etc. My team of developers has played around with this SDK, and they have found it comprehensive.

One important step in choosing a video AI platform is to let your developers test whatever demo version is available. And listen to their feedback, because your team members are going to be more productive if they don’t have to spend as much time trying to grok how to work with the software.

Like AWS, the Google Cloud platform has a good number of client SDKs: .NET, Node.js, Go, Java, etc. But from my developers’ perspectives, Google’s API documentation is quite verbose, and it takes a lot of clicks to get to what you need—much more so than the AWS developer documentation, in their opinion. For instance, if you just want to know how to send a video up to the Google service, rather than seeing the signature of the payload that you want to upload, you have to wade through three or four paragraphs on every single data point within that signature.

IBM Watson Media's API reference lives behind a paywall. When you want to use it, you tell the IBM Watson Media people, “This is my use case. I want to play around with your system, because I think I’m going to buy it.” Then you sign some contracts. In my case, as an IBM partner, they were kind enough to share the API documentation with me. It looked pretty straightforward.

Azure has client SDKs, but to interact with the Video Indexer you will just use its API without an SDK. If you need the extra lift that an SDK can provide, then Video Indexer might not be the right solution for you. Amazon and Google have things like Java, Android, iOS, Ruby, and so on, as you can see in Figure 2. Your developers might be able to build something faster on those platforms. The Video Indexer documentation is outstanding. It’s well-laid-out, and you can inline-test it as long as you have an active account.

API Learning Curve/SDK rundown for AWS, Google Cloud, IBM Cloud/Watson Media, and Microsoft Azure Video Indexer

Testing the Platforms

I’ve tested lots of different kinds of footage, but because so much data comes back, I’ve narrowed the focus in my testing to emphasize specific use cases. When it comes to translation and transcription, video surveillance, or metadata and objects, which platforms work best in those cases?

One thing to keep in mind with AI is that AI is only as smart as the data you’ve trained it with. In the case of IBM Watson, if you go to the website and look at some of the case studies, you’ll see that the company did a project for the US Open tennis tournament which was instrumental in training its AI. If you’re in the sports industry, IBM Watson might have more accurate results for you when it comes to object detection or seeing when a ball is being hit across the court. On the other hand, if you do news footage, Google might serve you better, because users live stream their news over YouTube.

It’s important to consider these different kinds of use cases when assessing how well a particular platform will work for you, because ultimately, the data that’s being processed by all these machine-learning solutions is what’s training the AI to be smarter. It’s similar to any process of accumulating knowledge. I come from a developer background, but I’m in the streaming media industry. If you ask me about which camera is the best camera, I won’t have a good answer, because I don’t have that data in my background.

That’s a way to think about AI: You’re going to be sending it specific data, and it’s going to learn off that data. The type of data that it learns off of is going to make it more accurate in one way or another.

Interpreting the Data

Figure 3 shows an example of the kind of data you get back with these systems. Generally, AI solutions work like this: They have an API that allows your developers to hit an end point and pass some data in. It’s similar to you going into a browser, typing “google.com,” searching for “ducks,” and hitting Enter. A developer uses an API to programmatically hit an end point like google.com. It might be something like “myaiservice.com/v2/api/visionservices/detectObjects” and you would pass it a data payload that the API expects.

Sample data from a processed video payload on an AI platform. Note that, since this screenshot was taken in May 2018, Microsoft Azure Video Indexer has started providing confidence values in its response JSON (click to enlarge).

Again, it’s like when you go to Google and type in a search string and then hit Enter. A developer will send up a payload that says, “My video is here. Please process it.” Then you wait for the processing to happen, because it’s not instantaneous. Once the processing is done, you’ll have data that you can pull down, comparable to search results in Google.

All of these services use JSON, the data exchange format shown in Figure 3. This is great, because it enables you to get data from multiple services normalized and used in a custom application. When you’re working with one format, it makes it a lot easier to parse through the immense amount of data you get back.

In the example shown in Figure 3, I sent up a trailer of American Psycho. You can see the data it returned in AWS on the far left. It says, “I found a person at timestamp 18196 milliseconds, and I’m 97.8222 percent sure.” That’s the data you get back with object detection in AWS.

With Azure Video Indexer (second column from left), it says, “I found a person in the time range 00:00:21.438 to 00:00:22.14.” In the Google column (third from left), you can see that it identified a “human” “entity” in multiple instances, with time signatures in seconds and nanos, and confidence values in the 85 percent–96 percent range.

Keep in mind that the computer doesn’t actually “see” a person; it’s learning from all of the data that it’s processed historically and making its best judgment based on that, so knowing the confidence value is extremely important. If your primary interest is identifying objects and you want it to be accurate, you should definitely pay attention to the confidence values being reported.