The Past, Present, and Future of Metadata
"The problem with academic lecture videos," said Waitelonis, "is that they are very long (more than 1 hour) and geared very much toward a TV viewing experience. In other words, there are no tables of content, keyword index, relative importance of particular parts of video."
Automatic Metadata Generation
To handle large amounts of data, Waitelonis and Berry are both, to a certain extent, using "computer vision" where object detection, intelligent character recognition (which considers layout for weighting of text relevance), and speech to text are used to drive the metadata gathering for clips that have not been marked.
When asked about the quality of automated metadata, Berry answered that it depended on what content was being captured.
"We've worked with a variety of content," said Berry," and found we can get about 80-90% precision rates for news speech-to-text conversion, but step down one level to even premium content, such as TV shows, and the precision rates drop down in to 20% range. As a result, we deal with closed-captioning for more accuracy."
Waitelonis agrees that news precision is very good, but that speech-to-text often cannot be used on TV shows with audience or other background noise. Jackson added that we should expect to see incremental improvements.
"The good news is that the speed of computer vision is increasing," added Berry." We're now at about real time x 4 versus earlier systems that were real time x 20. So speed increases are outpacing accuracy, but will provide the opportunity for additional passes within the same timeframe, further increasing accuracy."
Beyond computer vision, Waitelonis also relies on tools to leverage human knowledge of particular clips and subject matter, including collaborative annotation, time-dependent tags, comments, and discussion notes integrated into the temporal metadata database at particular points.
"We are also working on semantic multimedia indexing," said Waitelonis, "but for now the search is via text terms with each resultant video showing two timelines: highlights of where the term occurs, as well as a timeline of user-integrated comments, tags or sheer popularity. Additional tags can be added at any temporal position." All the metadata is stored in MPEG-7 for interoperability, he says.
The Grammar Of Metadata
On a similar front as Waitelonis' educational metadata search engine, Andrea Rota discussed the uTagIt project, which he's working on along with the London School of Economics and the Tate Galleries in London.
"Tate has a large number of lectures but no way to search them," said Rota by way of introduction. "We found we could add value in tagging by moving beyond traditional tagging that are single word subjects, into the introduction of predicate tagging, such as 'provides' or ('reminds me') 'of' [particular URL]."
While Rota did not discuss this directly, the upside of predicates is that vocabulary expands dramatically, but also runs the danger of having similar words used for the same predicate. For instance, in English, we often have one to two subject words to describe something, but many interchangeable words to describe the verb or predicate: the use of 'like' or 'similar' is just one example.
Rota said his tagging system is unaware of the video, as it is also stored in a separate database.
"Time spots are stored as milliseconds," said Rota. "As a web service, uTagIt doesn't know anything about the video itself, only the URL at a particular millisecond. Tying together the time, the subject, and the predicate generates a Media Instance."
When asked about multiple Media Instances related to the same media object, Rota said it is very possible this will occur, but suggested using the [null] from traditional tagging, combined with a way to express relevance, such as a scale from 0 (no relevance) to 1 (complete relevance). Rota also said his current focus is on the server side, using a Jabber agent, a web app, a database, user management (OpenID) and network management (RPX).
"The system can be used as a way to build Edit Decision Lists (EDL) or even as a non-time-based tagging system," said Rota, adding he was interested in having participants on the open source project who want to build front-end interfaces.
The last presenter, Stern from Metavid.org, has created an open access database in conjunction with the UC Santa Cruz using Congressional video from C-SPAN.
"Our initial issue was that there was an access gap to content," said Stern. "C-SPAN was temporarily available for free RealMedia streams, but then implemented a $30-60/hour paywall. Our initial solution was to record it all from C-SPAN, but then found there was a lot of repetitive or consistent video, such as the opening prayer or leaving the camera on while at lunch. We ended up with roughly 35,000 hours, which runs on MetavidWiki, an extenion of Media Wiki, and no good way to find the relevant content."
Stern's tools include speech-to-text and closed captioning, as well as information from other sites, such as political contributions to each member of Congress via GovTrack. In addition, Stern says his project scrapes speaker names from Tesseract OCR, and that searches are bound by known members of Congress, allowing about a 95% precision rate.
"Each block of transcript exists as its own temporal wikipage," said Stern. "Additional annotation layers exist for semantic tagging and categorization, registered users can add layers."
User interface optimized to let users drill down through customized time slices