Save your seat for Streaming Media NYC this May. Register Now!

Engineering the Video Experience for Alexa and Her Friends

Article Featured Image

“Deep linking and content discoverability is critical for translating action phases into your app for a given action,” says Dale. “An action could be ‘Tell $name to $action_phrase,’ where ‘$name’ is your voice app and ‘$action_phrase’ maps to your content search and deep link.”

What about development efficiencies? “Cortana supports reusing your Alexa skill code base so voice apps can easily be adapted there. For Siri and Google, integration into existing native apps for executing on a deep link into your content has to be top-of-mind. Managing these multiple voice integration contexts can be complex,” says Dale.

Error Handling

From a user-interaction perspective, voice control for video by Alexa and her friends may be a hard paradigm to master. It could be a very successful, leading-edge approach to controlling video playback, but if the system turns people away and makes them feel frustrated or embarrassed, then the technology could be dead in the water. “If the voice system doesn’t recognize what you say, if it doesn’t do graceful failure, people will use it maybe three times and then they will literally stop trying,” says Hoek.

Hoek recommends designing error handling for the interaction experience to react with probing questions. “Without the right feedback, users will feel too embarrassed to keep trying after certain point, and then you’ve lost them as a customer.”

Creating a Must-Use System

This brings us back to Comcast. Its users seem to like navigating by voice. According to the company’s fourth quarter 2016 financial report, 48 percent of residential video customers now have X1. The company had 80,000 net additions (new subscribers minus churn) in Q4 and 161,000 for 2016, with video revenue growth of 4.3 percent, which has been the best result in 10 years. “The reason why it’s becoming so popular is it’s very optimized for a specific application. It’s not as broad as Alexa or Cortana or Siri, but the advantage that you have is you use that solution every night,” says Vena.

“The first product we released was a voice-controlled remote that we launched in May 2015. Now we have more than 13 million active remotes in our customers’ homes, and those have generated over two billion voice commands in that time,” says Palmatier. The X1 can show viewers content that they otherwise wouldn’t be able to find. “’Show me free kids’ movies’ is a great [option] that our regular UI doesn’t navigate to,” says Palmatier. (To be fair, many of the other platforms can do this too.)

“I was really pleasantly surprised; when we launched this I thought we would have to invest heavily in a lot of customer education,” says Palmatier. “Turns out customers figured it out all by themselves. Most popular commands? Watch Y, or tune to channel X.”

Data Capture and Processing

The difference between finding content and coming up empty comes down to having good metadata. TV series, movies, and other content produced in advance tend to have detailed information about synopsis, actors, and a lot of other details. When Comcast broadcast the Olympics, it did a lot of optimization to ensure the company had content-specific metadata to give users accurate results when then asked questions.

“The way the system works is the remote sends a voice recording and the back-end system translates that recording into text. That text goes into another system that needs to determine intent,” says Palmatier.

In aggregate, all of these requests are being synthesized over time to tune the natural-language processor. This is where artificial intelligence enters the picture. Processing all the content requests provides the insight for a feedback loop that will analyze and constantly evolve and learn based on what people are asking for, says Palmatier. In other words, they are using machine learning to optimize results. This learned intelligence and data insight can also bring data owners great value by being able to promote better recommendations and to gain deep knowledge about their users.

Let’s get back to the topic of metadata. If there’s a query where a user says, “Show me Justin Timberlake,” sending them to his top music video might be the initial result. A more sophisticated system might say, the first 100 times, “Did you mean watch Justin Timberlake SNL [Saturday Night Live] appearances? Or do you want to watch Justin Timberlake music videos?” says Palmatier. If the system is smart enough over time and realizes that 90 percent of the people want to watch his SNL appearances, it will start directing everyone there first.

Amazon, which provided answers via its PR department, says voice navigation can make purchasing content you look at something like music, customers discover and play a lot more music when using Alexa because it removes an element of friction for them.”

Will voice control be a big profit center for media companies? If they get it right, sure, but it’s a long road.

Final Words

Planning for creating the new audio-controlled video experience is a multi-faceted task. The first question is likely whether to use near- or far-field communications. Who owns the user data that is the basis for building the AI? Which type of search approach is the most appropriate, subject or query? Should we build or buy input processing for automatic speech recognition and the natural-language processing? Does the UX/UI design engage rather than discourage users?

Right now there are few media companies that are publicly active in the field. Today, launching a YouTube app on the Fire TV means using the remote in the traditional way, by clicking to search or play content. Media companies need to design their content to be voice-enabled so Alexa and her friends can help customers go deep and find all sorts of viewing experiences.

Dale is starting to map out what he sees as important for Ellation and Crunchyroll. “To be well-positioned here as a publisher it’s important to have your SEO, media, metadata, and deep-linking strategy within [the] app fully developed,” says Dale. “This enables you to engage with the more advanced, conversational content-discovery engines that the large voice platform providers are building.”

The younger generation is very quick to adopt things, but also extremely quick to reject things if they don’t work very well, says Hoek. “Gestures and speech are very natural for people. If this starts working for people, I think adoption will be very wide, but they won’t admire you for it. They won’t see the enormous effort that has been put into making that system work.”

Alexa and her friends have the growing pains that very young products go through, but, just like children, watch for the voice-control assistants to grow up fast. Apple’s new HomePod speaker is a home audio device that will offer similar features to the Echo, all while collecting more and more AI data. A version of Echo is gaining a screen, and many more things are in the works for publishers. The X1’s customers have shown that voice works, and it’s likely to be appealing for all of the other platforms. “If I was advising someone who was about to embark on this development effort, I would say, ‘Go forth, because your efforts will be rewarded,’” says Palmatier.

[This article appears in the June 2017 issue of Streaming Media Magazine as "Engineering the Video Experience for Alexa and Her Friends."]

Streaming Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

SME 2018: Hey Google, What's Next in OTT?

Consumers find browsing for live linear content its own reward, and Google Assistant is setting out to make the experience more enjoyable and productive, according to Google's Rachel Berk in the Streaming Media East 2018 closing keynote.

HBO Uses AI to Combat Buffering, with Conviva's Help

Conviva says quality internet streams are only possible with intelligent realtime detection, and HBO is using Conviva's new Video AI Platform to do just that with HBO GO and HBO NOW

Upcoming Amazon Fire TV Stick to Include Alexa Voice Assistance

The Fire TV Stick With Alexa Voice Remote will sell for $39.99, adding voice assistance that can find a movie, check the weather, and order a pizza.

Apple TV Gaining Easy Authentication and Siri Voice Controls

In a software-heavy WWDC keynote, Apple announced several improvements to the Apple TV tvOS, but hardware with 4K video support was nowhere to be seen.

Roku Updates Roku 3 With Voice Search, Roku 2 With Faster Engine

In addition, Roku set-top-box owners will get updated software with channel search and a way to follow new movie releases.

SXSW Report: Voice and Gesture Will Control the TV of the Future

Soon, consumers will be able to pull any content they want on any device. To support that, televisions will need to be far easier to control.

Companies and Suppliers Mentioned