Engineering the Video Experience for Alexa and Her Friends
Engineering video playback for Alexa and her friends might sound like an obscure task, far from the core requirements of a successful video publisher, but in reality it’s an incredibly smart solution to the big search problem. Viewers now have so many video services to choose from, it’s hard to find which service has the video in question, and developers are finding it incredibly challenging to create an elegant, efficient navigation for the user interface (UI).
Solving the Search Problem
Voice control enables user navigation via speech instead of via a graphical user interface, with the result being that users don’t have to think about how to find their content. “Voice remote is a great way to flatten the UI. It gives an awesome experience and is a way to get access to what otherwise is a dizzying array of content choices,” says Jonathan Palmatier, VP product management, voice control, Comcast Cable. Comcast’s X1 TV box has a voice remote, which just might be the invention that will make people stop hating their cable companies (at least if their company is Comcast).
Audio control alone is only part of the story. When paired with AI (artificial intelligence), software should be able to learn viewer preferences, tune to the correct channel or service, and deliver increasingly appropriate search results and recommendations over time. So in the future, telling a device, “Play my favorite TV show” should do just that—but we’re getting a bit ahead of ourselves here. The road to audio control will likely prove long and winding. The cast of characters includes Amazon’s Alexa, Apple’s Siri, and Microsoft’s Cortana, as well as Google Assistant and Comcast’s X1.
Near- vs. Far-Field Communications
The X1 and any remote you can speak to uses near-field communication, a short-range connection standard for devices within a limited distance. Alexa (via the Amazon Echo) and other always-on devices use far-field communication. “(Far-field devices) are always on and listening for a keyword to wake up and then start recording and transmitting the voice command. Our voice remote [X1] only works when a user presses the microphone,” says Palmatier.
The difference between a voice remote search and playing content via a far-field AI platform can be a thin, moving line. Amazon’s Fire TV remote is Alexa-enabled and can respond the same way Alexa would on an Echo device, but the vast majority of the Alexa controls are for audio and connected home devices. One video playback control that’s available now is for Plex, and Alexa can play Plex content if there is a Plex server in a home media set-up.
Many of the voice platforms work well with their own content (i.e., Alexa works best with Amazon content) or when using a voice remote to play a movie. The problem develops when a viewer wants to seek content from another media source or app, or even make a more complicated request. Media apps need to be designed with voice control to benefit from the audio navigation available from the AI platforms.
For this article, Amazon sent written statements that I’ve condensed here, saying display card controls via “skills” can be added. It’s these “skills” that enable activity—connecting the NPR skill means Alexa can access content from the NPR app and deliver an audio news briefing (summary). Without connecting skills, Alexa will return an awful lot of “I don’t understand the question” responses. Siri has similar problems. (Contacts at Microsoft and Roku were unavailable for interviews.)
The days of “Alexa, turn on my TV and play tonight’s news” are still a ways off. “When you use an Alexa or other far-field device, you could be many feet away from it. That drastically increases the intelligence that the device has to have in order to mitigate against any ambient noise,” says Palmatier. Anyone who has tried to get a far-field device to recognize commands over a lot of noise can appreciate that, in some ways, near-field sounds like an easier problem to solve.
Subject vs. Query
Aside from being broken down into near- or far-field communication technology, there’s also two different types of voice recognition approaches, the first of which is subject-oriented voice recognition like Xfinity’s solution, says Mark Vena, former worldwide VP of marketing, Sling Media and EchoStar. Subject-oriented queries —like “Give me movies with George Clooney” or “Find local news on now”—seem to be an easier proposition.
Then there are query-based solutions that voice recognition devices use to handle more open-ended questions. If you ask Google Assistant, “OK Google, what’s on Sling TV tonight?,” Google brings up the Sling Media site. AI personas can get you closer to content, but in many cases launching content may still be a couple of clicks away.
Sling Media does not have voice control as an option, but it’s a topic Vena personally has a strong interest in, and so, it seems, do many other people. At CES in January 2017, it seemed as if everyone was pitching their devices as being Alexa-enabled, and Amazon itself featured a giant walk-in Echo mockup outside of its meeting space. If nothing else, Amazon has done an exceptional job of getting Alexa out there. Amazon says there are tens of thousands of developers building skills for Alexa.
What Percentage Are You?
Query-based systems should generate a wider range of data from user-specific searches. Will this matter? It could when it comes to showing you that you watched A, B, and C, plus searched X, Y, and Z, so you may like this and that kind of content. Amazon has built a whole business on its recommendation engine, making it an attractive approach to distributing more content.
On a very basic level, before even trying to determine what content a viewer may have an interest in, the initial question needs an accurate answer. Is it acceptable to provide information that’s right only 50 percent of the time? In theory, Alexa and her friends come from the smart school and have had the benefit of all the resources these big corporations could throw at them. In reality, these voice control systems are often as smart as young children learning how to talk. Alexa volunteered that she’s 2 in human years; Apple TV’s Siri says, “I feel like I was incepted yesterday.” When we talk to them and they get the answer right, we’re impressed. When they don’t, we have less-than-kind thoughts about them.
Our unscientific test of “Show kids’ movies” returned unexpected results with Xbox One, and Roku’s voice interface returned adult and violent movies within the results. The Cortana interface with Xbox One didn’t tap into its movies app; instead, it showed movies and YouTube videos from an online web search. All of these interfaces seem to perform better with more-specific questions. Development of system intelligence will take a while, regardless of whether the system uses query- or subject-based search.
The UI Design Conundrum
“One of the biggest challenges in many interfaces with deep libraries of content is there’s just too much there to find what you really like,” says Tjeerd Hoek, VP of creative at Frog Design. The company is known for helping many Fortune 500 companies with their product and user design, including developing interfaces for various video providers. The challenge for Hoek is how to display a library of content in a way the user can easily navigate in as few steps as possible. “There’s only so many things that you’ve seen or heard about, and therefore you will not even search for [different content]. Finding media is a good example of something that is much better done by voice than by giving people a search box and a number of filters on the left side to get to the one song, or the movie [they] had in mind.”
Modern Day Wizard of Oz
When Hoek worked at Microsoft many years ago, the giant was testing how users responded to voice navigation. “We would bring people in and tell them to talk to the computer,” says Hoek. “We had a person controlling the computer (in another room) like in The Wizard of Oz.”
Today’s systems incorporate the wizard into the operating system and use a natural-language processor (NLP) so users can ask questions in everyday English. “[Comcast] has invested heavily into building our own NLP because we believe that’s a strategic component to the system, because the magic comes from being able to tune that. You’re not just generating generic answers from queries,” says Palmatier. For the user, everyday English is easy. For developers at media companies, it’s a tough nut to crack.
“When developing for a voice-driven environment, ensure [you have] traditional in-app discoverability and deep linking strategy,” says Michael Dale, VP engineering at Ellation. Next up is targeting respective voice-controlled software development kits (SDKs). Dale says these map to standard behaviors on their respective platforms. “Adding a video to your watchlist for your iOS app works easily with a Siri kit integration. Casting a continuation of your favorite show on your TV via Chromecast may be easiest to facilitate via Google Actions.”
Consumers find browsing for live linear content its own reward, and Google Assistant is setting out to make the experience more enjoyable and productive, according to Google's Rachel Berk in the Streaming Media East 2018 closing keynote.
Conviva says quality internet streams are only possible with intelligent realtime detection, and HBO is using Conviva's new Video AI Platform to do just that with HBO GO and HBO NOW
The Fire TV Stick With Alexa Voice Remote will sell for $39.99, adding voice assistance that can find a movie, check the weather, and order a pizza.
In a software-heavy WWDC keynote, Apple announced several improvements to the Apple TV tvOS, but hardware with 4K video support was nowhere to be seen.
In addition, Roku set-top-box owners will get updated software with channel search and a way to follow new movie releases.
Soon, consumers will be able to pull any content they want on any device. To support that, televisions will need to be far easier to control.