How to Choose a Video AI Platform and Evaluate its Results
Finally, on the far right in Figure 3 is IBM Watson Media. Watson’s results include a "person" identified for a stated duration with a 64.6 percent confidence value. You can see that you have a number of services to choose from based on the type of the media assets you need to index.
Faces, Objects, and Scenes
One key feature of any AI platform is its ability to recognize objects, faces, and scenes In my research on AI systems, there’s one thing I’ve seen AWS doing that I didn’t really find in the JSON payloads for many of the other providers. AWS has a high degree of facial recognition. You get the X/Y of eyes, nose, roll, yawn, pitch. You can also do facial matching. If you have a high-resolution picture of a face, you can tell the system, “Keep an eye out for this face and tell me when you see it in the video.”
Unlike the other three platforms, which (at this writing) work exclusively with VOD, AWS handles streaming videos. AWS Elemental can be plugged in with the company’s AI services to do AI for live content as well.
Google Cloud platform’s advantage is a wealth of scene data. Let’s say you’re producing movies or TV shows and you have a lot of long-form content, and you’re curious about scene data. Google provides a lot of scene data with per-shot annotations. You could say, “I’m watching this movie and I want to know where all the car chases are.” You could do a search for “car” and it should break up the movie and show you, “Scene 1 has a car, Scene 10 has car,” etc. If you want to know the start and end times of those appearances, it will let you know, “When I say I saw a car in Scene 1, that was seconds 0–30.”
IBM Watson Media has very in-depth detection competence. This system features tons of labels, and it even has taxonomies within those labels, so your results will indicate, “this is an animal, this is a mammal,” and so forth.
IBM Watson Media also features extensive compliance-checking, with emphasis on adult content, trademarks, violence, fraudulent materials, etc. If your use case concerns making sure that nobody is ripping off Netflix movies, IBM Watson Media may be the way to go.
Microsoft Video Indexer does a particularly good job of recognizing celebrities. If you want to find out, for example, the last time Kanye West was at a beach, Azure will show you that kind of stuff. It does OCR brands, and a little bit of explicit content detection as well.
Let’s say you want your AI platform to recognize the difference between content with dramatically divergent moods, such as Teletubbies and Dexter. Amazon Web Services provides sentiment scores on key phrase mentions and on sentences. This applies to anyone speaking in a video clip. If I’m very, very angry and I’m yelling at somebody, it’s going to return a negative sentiment. If I’m reading this article aloud, it’s probably a neutral sentiment. AWS analyzes both the sound or tone of the audio and the words that speakers use. And if I use a lot of profanity, or talk about bad things like fires, the value might lean toward the negative.
Google Cloud Platform provides sentiment analysis based on sentences and nouns—not only for sentiment, but for prominence as well. It considers a word and its sentiment in the context of the video as a whole.
Watson may be the most mature system in this regard. As you can see in the far-right column in Figure 3, Watson scored the clip I uploaded on anger, sadness, joy, fear, and disgust. Watson scores text on a scale of 0–1 and evaluates overall sentiment as well.
Azure Video Indexer does a reasonably good job with sentiment, providing scores on sentences and videos broken down into percentages.
If sentiment is a very important thing to you, I recommend looking at Watson Media. If you just want to validate the functionality via a proof of concept, try Google or AWS.
Surveillance is an area that is going to drive a lot of AI development in regard to object detection. One of the biggest challenges organizations face is trying to determine, out of a 24/7 video stream that is generally very boring to watch, when there’s an incident.
Being able to use AI to notify us of anomalies in security footage is a very powerful thing, and some companies have sprouted up specifically to do that. Google now offers its Nest Cam security cameras that you can use at your house—inside or outside—and it can notify you when it notices that things may be amiss. This is definitely an up-and-coming area.
AWS offers Kinesis for processing and analyzing video streams, but it requires you to use its own SDKs. The nice thing is, if you or your developers know some .NET or C++, you could take AWS’s SDK and do what’s called a Producer and a Consumer. The Producer throws the live stream data into a stream that is processed through its services. The Consumer watches for whatever you want it to watch for, whether it’s a car going too fast, a specific face, etc.
Out of the box, the AWS SDK provides the ability to do facial detection. In the AWS booth at NAB 2018, they had a computer with a camera, and it used Kinesis video streams to count the number of people that walked by their booth during the day. That computer actually showed a graph on a timeline of how many people were walking by at noon, 2 p.m., 3:15, etc.
Because Kinesis has an SDK and AWS does have other services, you could theoretically build in additional features by taking that data down at periodic points and sending it over to the services that require VOD or other prerecorded content, and then processing it that way, using 30-second chunks and the like.
For stream monitoring, Google Cloud does a 5-minute streaming transcription. It’s done, essentially, from a dictation perspective. It will likely evolve in the future, but for now, that’s what’s publicly available.
Currently, Watson doesn’t offer any stream-monitoring features, but I’ve spoken with a senior product rep at IBM Cloud who said surveillance was on the roadmap. With Azure Video Indexer, we haven’t seen anything yet.
Transcription and Translation
For audio transcription and translation, Figure 4 provides a useful breakdown of what these big four services/platforms currently offer.
Current transcription/translation features available from AWS, Google Cloud, IBM Watson Media, and Microsoft Azure Video Indexer
AWS released its offering shortly after NAB. It supports U.S. English and Spanish for transcription. For translation, it currently supports 12 languages, with support for six more promised in the coming months. In my team’s testing, the language detection seemed very accurate. Per-word confidence is good as well.
If your use case is very, very globalized and you need 100 or more languages, then Google Cloud may be your best option. Google also features natural transcription, and if you don’t have highly sensitive data, you can opt in to public AI training. (To be clear, that’s not Google training you to use AI; it’s allowing Google’s machine learning to incorporate your data into its learning process to have more data to learn from.)
IBM Watson Media promises transcriptions that are 95 percent accurate “in the right conditions.” That qualifier is an important thing to keep in mind with audio transcription, regardless of the platform. If five people are all talking at once, or if the person speaking is a newscaster covering a hurricane from the eye of the storm, it might be hard for an AI to pick out an accurate transcription.
If the right conditions are there, 95 percent accuracy provides a good starting point, and then you can go in and edit the transcription as necessary and still save some time on typing. If you’re thinking about leveraging AI, it’s important to think about your content profile, because the AI is only as smart as the data that’s been sent to it. For example, Google has YouTube; IBM has tennis’ US Open and the Masters golf tournament; I’m not sure what Microsoft trained its system on. Using your own brain as an example, there are things you know and things you don’t know—for the most part, they’re based on what you’ve personally experienced. Like your own brain, these AI systems are only as accurate as what they know.
You also need to consider your functional use case. Are you trying to do transcription and translation? Are you trying to do object detection? Are you trying to populate a media asset manager with a bunch of metadata? Are you doing video surveillance? Thinking about those things will help you decide which AI service to engage with first.
How accurate do you need your results to be? Any video payload on any AI platform is going to return some false positives. Once again, we go back to that confidence rating. If you see that the AI detected a person’s hand, but it was only 40 percent confident, you don’t necessarily want to send that over to your MAM. You may want to have some kind of check within your workflow that says, “Let’s throw anything that’s 95 percent and above into our MAM. If it’s less than that, let’s keep that data, but we’ll have somebody go in there and take a look at it, maybe edit it, remove what they don’t need, and then put what’s left into the MAM as well.”
The next question is, do you have developers in your organization? All four of these platforms offer something different in terms of available demos, API documentation, and SDKs. Ultimately, your developers and their skillsets are going to determine which vendor you should work with, because not every developer knows every programming language. Each development team is going have its own preference on what the members can work with most efficiently. If you don’t have any developers at all, maybe it makes sense to go with IBM Watson Media so you only have to pay somebody to glue your system together with theirs.
Definitely play around with what’s available before you dive in. There’s plenty of opportunity to test these cutting-edge technologies, and much to gain for your business when you determine which vendor’s AI is best-suited for your development team and your most significant use cases.
[This article appears in the October 2018 issue of Streaming Media magazine as "How to Choose a Video AI Platform."]
For events like the royal wedding and the World Cup, machine learning and AI are taking center stage.
RealEyes' Jun Heider discusses the importance of training your AI to serve its specific purpose within your organization, and the types of customization leading AI platforms allow in this clip from his presentation at Streaming Media West 2018.
RealEyes Media Director of Technologies Jun Heider discusses Live Stream Analysis using AI in this clip from Streaming Media West 2018.
Despite rumors to the contrary AI won't render humans obsolete, declares RealEyes' Jun Heider in this clip from his presentation at Streaming Media West 2018.
RealEyes Media Director of Technologies Jun Heider discusses the visual detection features of popular AI platforms in this clip from his presentation at Streaming Media West 2018.
Look for artificial intelligence and machine learning to improve content delivery, video compression, and viewer personalization, strengthening the entire workflow.
Publishers can set the module to notify outside partners of QoE problems, automating a process now done by technicians and engineers.
Streaming Media's Tim Siglin interviews RealEyes Media Chief of Technology Jun Heider at Streaming Media East 2018.
Companies and Suppliers Mentioned