Moving Beyond Text Search
Back in 2000, I assisted a streaming media product manufacturer in a competitive landscape analysis, surveying the marketplace for known and potential competitors and assessing the best marketing strategy and sales approach. The competitive analysis wasn’t extraordinary, in and of itself, but one finding stuck with me, and left me with a question for which I’ve been seeking an answer ever since.
The finding was that every video search engine in 2000 was based on textual modalities. More disconcerting, the client had difficulty grasping that anyone would be interested in searching in any other way. The implications are far-reaching: unless a particular video clip can be reduced to a piece of text that allows a user to type exactly that particular text into a search engine, the indexed video content could not be retrieved.
In practical terms, this means that audio content could not be searched based on its notes, unless the indexing system and, subsequently, the end user knew that a particular musical sequence was "A-D-G-F#" in 4/4 time. In a non-textual modality, the end user would be able to give the search engine a snippet of the song and then ask the system to find songs that matched the notes and tempo of the clip; they might even–if their "singing in the shower" skills were good enough–be able to hum the tune or sing a few lines and have the system search based on that audio.
The limits of textual modality also mean that significant metadata inherent in a visual recording, such as gesture recognition or focus of attention, had no practical way to be searched, even if it could be indexed. As a result, research into the other significant portions of visual metadata has been limited. Until recently.
In the last few days, I’ve visited a research institute in Switzerland and the beginnings of a training facility in Torino, Italy, both of which appear poised to provide new opportunities to extract and use non-textual video metadata.
The first is located in the picturesque town of Martigny in the French Alps. While lesser known internationally than the neighboring town of Montreux, which hosts an annual jazz festival, Martigny houses the Swiss research institute IDIAP (the acronym is from the French for Dalle Molle Institute for Perceptual Artificial Intelligence). Under the leadership of several early pioneers in the speech-to-text recognition field, IDIAP is breaking new ground in several areas of visual search. One area of research, called Augmented Multi-Party Interaction (AMI), coupled IDIAP under a European Union project with 20 other European research institutes led by the University of Edinburgh. AMI has numerous foci, including a browser interface that allows for rapid identification of multiple speakers and advanced optical character recognition (OCR) capabilities, but two of the most interesting areas for video search are gesture recognition and focus of attention.