Video: AI-Driven Tools for Face and Object Detection

Learn more about video AI at Streaming Media's next event.

Read the complete transcript of this clip:

Jun Heider: Visual detection, what do you see, computer? It's a gorilla. Awesome, thank you, I didn't have to watch the video.

So when it comes to visual detection, object detection is a really big one. What do you actually see in that video, and how confident are you that you saw that?

So a lot of these slides here, on the left, is JSON. So this is the data that comes back when you use these services, so from a developer's perspective, they're going to say, "Hey service, please tell me what's in this video?" And then this data's going to come back with some very, very useful information.

In regard to video intelligence, Google does a really good job, and a really cool thing that they do is they offer categories as well. So it's like not only did I see a brain, but I also know that the brain is in the organ category. And I'm about 80% confident that that's a brain.

You can see that represented within that JSON to the left, if it's not too washed out. We have a category entity of organ, above that it says the entity was a brain, so you get the picture. Apparently, Google has 20,000 labels. So there's 20,000 objects that Google is capable of detecting. And that's a pretty good amount.

The other thing is, in recent-day, they've brought object tracking into beta. So not only could it detect the brain in this picture, but it could detect a brain if like, there was wheels on the table, and it was kind of going across the screen, that's something pretty new, it's still in beta.

AWS has thousands of labels, but they didn't specify the number. But one interesting thing is we threw a video up to AWS recognition, that's the name of their service, and we got a label back that was red carpet premiere. So it was, I think like an E with the exclamation point, and people were walking, and they're taking pictures of them in their dresses and stuff. So that was kind of interesting, because I didn't expect to see a label that was that specific.

Clarifai, as I mentioned earlier, they have a number of really interesting models, so Clarifai, they have an apparel model, they have a food model, travel, weddings, in addition to their general model. And then apparently Hive AI can detect commercials. So the stream's playing, and then all of a sudden a commercial plays, oh, I saw a commercial.

Valossa does expressions and injury. So if my arm's bleeding, that's something that Valossa, the Finnish AI company, might be able to see. AWS, when it comes to the face, they really, really focus on the face. So if you're looking for faces, if you're trying to figure out, hey I want to draw like, one of those curly-Q mustaches onto the video, this is the service you might want to use. Because not only do they detect the face, but they know where the eyes are, the noses, the corners of the mouth are. And they also can detect the roll, pitch, and yaw.

So all of that is represented there on the left in that JSON there. And what you'll see there to the right, where it says face and it's a big red box, that's what's called a bounding box. And a bounding box is basically telling you the computer saw a face right there.

That's actually something that a lot of these services haven't been providing, but are starting to provide. So, I think Google recently started providing bounding box, and some other services as well. Because you might want to do something creative with that bounding box information. And then in addition to detecting faces, a lot of these services detect celebrities as well.

So Microsoft Video Indexer, they've stated that they know about one million celebrities. And that's actors, world leaders, athletes, et cetera. And you'll see that here in a demonstration later. Video Indexer, for Microsoft.

You'll notice that I've highlighted the word Blender.org. So, there's brand detection, Microsoft actually detects brands from OCR, so Optical Character Recognition. Text on the screen. They can also detect it from audio as well. But if you're looking for that service that, oh I see the golden arches, that's McDonalds, you're going to want to look into a company like Valossa, Hive AI, Clarify, or Veritone.

Video Indexer does OCR, as you saw in the previous example. So in this case, you know, it found Nathan Vegdahl inside the text, and returned that back so that we didn't have to watch the credits, 'cause we don't watch the credits, right, sometimes? Other vendors that do OCR are Veritone and AWS. And as of November 2018, Cloud Video Intelligence from Google started doing OCR in beta.