-->

AI's Streaming Stack: Meet the Media Workflows

Article Featured Image

How has AI entered the media workflow? For this new column, we’ll look at different applications used in the media industry. For this issue, we’ll start with asset management, asset storefronts, and localization. While some of this functionality—speech-to-text transcription, translation, voice synthesis, natural language processing, logo detection, facial recognition, and object detection—has been around for a while, the biggest improvement is that much of it is now available on workflows with live content.

Veritone Digital Media Hub

In a world awash in content, Veritone's Digital Media Hub is giving content owners and rightsholders the ability to create smart digital storefronts to monetize video content, which is unstructured data. Digital Media Hub works with customers’ existing media asset management (MAM) and digital asset management (DAM) systems by connecting to an S3 bucket and working via proxy, preserving the original files. Each AI engine can be swapped or updated without breaking workflows, insulating customers from vendor lock-in as new AI models emerge.

This multi-engine approach can, for example, have audio transcription paired with facial detection to provide synchronized multimodal metadata. Metadata is stored in a normalized schema, supporting timecoded annotations down to the frame level. This makes it possible to retrieve hyper-specific queries such as “all instances where Player X scores inside the three-point line when Sponsor Y’s logo is visible.”

Veritone Digital Media Hub
Video search results in Veritone DMH

Yes, We Searched for ‘banana’

Users can search content by keyword. I looked for “banana” and found it in all sorts of ways: in historical documentary clips, in stock supermarket images, a banana boat, someone yelling “Go Banana!” at a tennis match. A Spanish clip picked up an English translation “to putting nuclear potato treadmills the bananas in front of the door price.” If this is similar to your use case, you’ll want to check on model accuracy.

Metadata on thumbnails shows licensor, category, rights, resolution, and date. Content thumbnails show metadata such as title, duration, place, date, audio language tracks, source, restriction, content warnings, content owner, description, broadcast resolution, frame size, and frame rate.

Assets are stored in cloud object storage, with role-based access control per asset, user, or group. Digital Media Hub supports REST API to feed enriched assets directly into editing systems like Avid or Adobe Premiere Pro or into content management systems for web, OTT publishing, and social distribution. Veritone has customers in sports, film, TV, news, and media, including the NCAA, NBCUniversal, CNN, Sony, Guinness World Records, Court TV, Newsmax, and the Game Show Network. For an example, head to CNN’s open marketplace at cnncollection.com.

Ecommerce for Content

Knowing what is inside content is useful, but being able to monetize that content is a whole other activity. Digital Media Hub allows customers to create their own cloud-native DAM ecommerce service with AI-driven enrichment and licensing workflows. The system can find content that’s based on search results from AI engines, including speech-to-text transcription, natural language processing, logo detection, facial recognition, and object detection.

These capabilities are finding their way into many technology products, but the capacity to create an ecommerce site to monetize content means that customers can white label this for use on either an internal or external basis. While most of Veritone’s customers who use these features are enterprise-level, they do work with smaller clients as well.

Digital Media Hub provides all of the standard cloud-based architecture functionality in terms of multi-tenant, scalable support. The platform’s ingestion pipeline supports both batch and live ingest. High-volume file transfers can be automated via API or accelerated file transfer protocols, while live streams can be indexed in near real time.

At the U.S. Open, for example, Digital Media Hub processed multiple live video feeds, applying AI enrichment on ingest and enabling highlight clipping within minutes. Media comes in via bulk upload, automated API calls, or live capture. Veritone supports broadcast video, transcoded formats, audio, images, and documents.

Rightsholders can configure and process transactions for their custom storefronts for licensing, as well as royalty-free distribution. For clipping, Veritone has partnered with Grabyo and can work with HLS or RTMP streams for customers to create clips. Digital Media Hub is used by UI or API; I saw a demo using the product interface. Digital Media Hub has been on the market for 10 years; no public pricing is provided.

Grass Valley Framelight X

AI powers functionality in Framelight X, Grass Valley’s cloud-based, enterprise-level media management and production asset management (PAM) system. This is a cloud-agnostic, federated asset management system that supports content stored in any on-prem, file-based network-attached storage (NAS) system; storage-area network (SAN) block-level storage for performance-critical applications; Amazon S3 Glacier; Google Cloud; Wasabi; or Pixit. Users can access this storage by a browser-based endpoint, including use on a tablet.

Grass Valley Framelight X
Grass Valley Framelight X asset management

“Everything largely is moving to object storage,” says Adam Marshall, chief product officer at Grass Valley. “It’s more about a cataloging schema on top.” Customers can have any combination of on-prem and cloud usage.

“When you shift your DVA (digital video archive) over or you move everything to Glacier, we can update and refresh your entire metadata scheme with all your back catalog at no more cost than just the processing cost. That is hugely powerful. To do that with your back catalog is almost impossible, where AI will be able to go through that process for you,” says Marshall.

Marshall adds that with Framelight X, Grass Valley intends to bring “that Tier 1, time-accurate linear content production down to a true software-defined [commercial off-the-shelf] environment, which can run on any compute, any client, any host, anywhere, anytime.”

More than 100 alliance partner apps have been submitted to the Grass Valley app store. These apps can be deployed to the same com­pute, including Digital Nirvana, Gemma (LLM), TSL, and Singular Live.

One of the reasons the tech stack can consolidate is because, historically, you would have had different mezzanine formats. With this standardized approach, Grass Valley is vying to become the main source of truth, and yet, it does still have lots of customers that have its solution and other MAMs and PAMs.

Previously, broadcast was very high resolution with very restricted access to content because it was difficult to move media, Marshall says. “Historically, though, if you look at the end-to-end production facility, there would have been a MAM, a PAM, a transmission library, a deep archive system. There may be five, six, seven, eight different systems across the organization, all individually ingesting content, all individually managing content, and then finding some way to interface between each other.”

Similar to the strategy MovieLabs is working on in its 2030 Vision, Marshall notes that Framelight X users should “have one holistic environment that everyone can access based on rights permissions and not move media around, just move access around. That’s the most efficient and the fastest way of getting content published.”

Live

Marshall goes on to explain that Framelight X uses “metadata tagging to drive the [intelligence] layer and then have virtual folders and bins so that at the end of a soccer game, all of the tries, shots, strikes, or all of the content of this person is auto-managed into bins for the editor to bring into their edit and then publish as quickly as possible.”

This is available on live assets, which means that multiple operators around the world can all be working on something at the same time. Content is tagged and logged using AI to determine exactly what is within a shot or scene.

“That’s not something you get when you go to an Adobe Creative Cloud or those kinds of systems, which are more production-focused as opposed to live event-focused,” Marshall says. During a game, when streamers are trying to get their content out live as it’s happening, he continues, “You would have a room full of loggers and content managers going through and driving that. Now, you can reduce that to auto­mate as much of it as possible. ... People are more focused on the production quality and the publishing workflows as opposed to doing the administrative tasks over the top.”

Marshall goes on to say, “We don’t want to get into automatic edit timelines and those kinds of things at this point because, again, it’s the creative source of that organization. What we’re removing is the need for hours, days, and weeks of going through logging and content tagging.”

AI

Grass Valley customers have different approaches to using AI solutions, Marshall explains. “If they want to use one off the shelf, they can take that and run it locally. If they want to build and own one, like many of our enterprise customers are doing, they can put their own libraries in then.”

Marshall says he is aware of “at least three large enterprise organizations that are hiring best-in-class AI data technicians—or whatever
you want to call them—to develop their own secret source there because they see it as a differentiator for them.”

While understanding content has become table stakes more recently, the aspect that really resonates is the agentic help, where a customer can ask, “How can I deploy a workflow for content management with automated ingest and other specific requirements?” Trained on product documentation, back-end APIs, and services in the platform, the AI solution allows its agent to provide responses.

“How is an end user ever going to know how to do that without specialized training?” Marshall asks. “Largely, it will tell them exactly how to solve those challenges rather than having to call an engineer. They can say, ‘Give me the configuration schema for this type of workflow.’ Rather than having to think about how to build it out, they can just say to the system, ‘I want to do this, go build it for me.’ Then they can take that and just press play and the whole system will deploy for them.”

However, at that point, humans still can, and likely should, double-check the configuration.

No public pricing is available.

Deepdub

Deepdub has been building AI-powered dubbing software since 2019. The linear nature of localization—transcription, translation, voice recording, and final mixing—creates bottlenecks that previously made dubbing unfeasible economically for extensive library content. The core technical challenge extends beyond translation requiring a range of culture-, country-, and language-specific modifications. Deepdub’s models are designed for entertainment content and can generate emotion-based voices.

DeepDub AI
DeepDub AI

The system maintains low-latency performance—under 200 milliseconds time-to-first audio for API requests—enabling both batch processing and real-time applications. The platform supports accent control to modify regional characteristics.

The platform does automatic audio separation, character identification, transcription, translation, and voice generation. The platform has SOC 2 certification and TPN Gold Shield security standards to securely handle pre-
release content.

Live

Real-time applications introduce additional complexity layers requiring simultaneous transcription, translation, and voice generation. Deep­dub’s live processing achieves 10–15-second total latency, balancing speed with translation accuracy. The system incorporates contextual analysis to improve translation quality, as incomplete sentence fragments often require full context for accurate interpretation.

Currently, Deepdub focuses on the top 10 markets, including English, Latin American Spanish, Brazilian Portuguese, German, French, and Italian, and select markets, such as Russian, Arabic, and Hindi. It provides managed services to deliver multiple audio formats, subtitle generation, and broadcast-ready packaging. The platform also offers a subscription for smaller requirements.

There is partial API access for text-to-speech, accent modification, and low-latency generation suitable for agent-based applications. However, the complete platform—including collaboration tools, quality-control systems, and format conversion—can only be used by UI.

Voice libraries consist entirely of licensed content with full documentation trails, addressing entertainment industry concerns about voice rights and usage permissions. The system accommodates both synthetic voice libraries and, where contractually permitted, original-cast voice cloning for new content productions. A free demo is available.

Streaming Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Vevo's Natasha Potashnik Talks Metadata and Streaming Monetization at Streaming Media 2025

Almost live from Streaming Media 2025, Vevo Head of Data, Research & Measurement Natasha Potashnik discusses the many uses of streaming metadata with Streaming Media Contributing Editor Timothy Fore-Siglin.

Maximizing Content Value with Subtitles, Dubbing, and Localization

AI-driven dubbing has recently gained attention as major platforms like Amazon Prime Video and YouTube roll out new tools designed to expand their content's global reach. Amazon is testing AI-assisted dubbing on licensed content, while YouTube has introduced auto-dubbing for thousands of channels. Both efforts reflect a growing belief that dubbing can help platforms engage new audiences—but the results so far have been mixed.

AI and Streaming Media

This article explores the current state of AI in the streaming encoding, delivery, playback, and monetization ecosystems. By understanding the developments and considering key questions when evaluating AI-powered solutions, streaming professionals can make informed decisions about incorporating AI into their video processing pipelines and prepare for the future of AI-driven video technologies.

Cutting-Edge Uses for Artificial Intelligence and Machine Learning In Streaming

The uses for artificial intelligence (AI) and machine learning (ML)—from text identification to contextual advertising to cultural sensitivity—are multiplying. Here's a look at some of the most compelling offerings on the market today.