-->
Register now to save your FREE seat for Streaming Media Connect, May 12-14!

The State of AI In Live Streaming

Article Featured Image

As with all streaming workflows, AI has steadily crept into the live streaming technology stack. In some cases, the impact is incremental, in others, profound. From production to monetization, here's a quick overview of where AI has become relevant for live event producers and engineers, and some areas where, surprisingly, it hasn’t.

The companies mentioned in this article are meant to be representative, not exhaustive. If we discuss a category and your product/service isn’t included, feel free to each out to me at jan.ozer at streaminglearningcenter.com.

Production and capture

Automated camera operation is one of the clearest examples of AI performing a core live-production task. In lower-tier sports and structured events, computer-vision models track the point of interest and directly control pan, tilt, and zoom, eliminating the need for a human camera operator.

This capability appears in two related but distinct forms. Camera-centric systems like those from Veo Technologies embed AI-powered follow-cam tracking directly in the camera, automating framing while leaving production switching and editorial control outside the device.

While sports-centric systems aim to follow the ball and wide-field dynamics, Sony’s BRC-AM7 (see Figure 1) identifies and tracks subjects such as presenters, panelists, or performers within a defined area, maintaining a stable, broadcast-friendly composition. As one reviewer of Sony’s BRC-AM7 PTZ camera describes, “The movement is impeccable and also has adjustability for braking/easing when changing pan direction manually. The auto tracking/focus is superb with a little room for improvement as new firmware gets released.” This makes the BRC-AM7 well-suited for studios, houses of worship, and structured events where consistent subject framing and smooth, controllable motion matter more than fully autonomous sports coverage.

sony brc-am7 
Figure 1. Sony’s auto-framing BRC-AM7

In contrast, all-in-one automated production systems, like those offered by Pixellot, integrate AI-driven camera operation with automated switching and graphics. As stated in the Pixellot FAQ, the system “uses a multiple-camera array and advanced artificial intelligence algorithms to follow the action on the field and produce a live or recorded video feed.”

These systems are designed to deliver consistent coverage rather than full editorial direction, and are most commonly deployed in youth, amateur, and semi-professional sports. Humans define production formats and override rules in advance, intervening when conditions fall outside the models’ design range.

Transcription and localization

Automatic speech recognition (ASR) models are now routinely used to generate live captions for broadcasts, sports, news, classrooms, and events, often operating continuously for hours with human supervision focused on correction, accuracy, and regulatory compliance rather than basic transcription. Representative vendors include 3PlayMedia and Verbit for live captioning and transcription, Trint for transcription and captioning around live content, Wordly for real-time translation in meetings and events, and AI-Media for live captioning and translation services used in regulated broadcast environments (see Figure 2).

 ai-powered captioning
Figure 2. AI-powered captioning is one of the most significant uses of AI in live broadcasting.

Operationally, these systems are well-integrated into broadcast and streaming workflows and supported by human-in-the-loop models that are often required to meet accuracy thresholds of 98 percent or higher for live news, politics, and other regulated content. Fully automated AI captioning alone may not meet these thresholds because of errors ranging from simple misinterpretations of proper names to hallucinations, “where a model generates plausible but entirely incorrect words.” This makes human oversight mandatory in high-stakes live environments to mitigate the legal and regulatory risks of airing misinformation.

One technical issue for audio dubbing is source-target expansion, where a translated language requires significantly more words or time to express the same idea as the source language. For example, translating English to Spanish or German often expands text by 20% to 30%. To preserve audio sync, broadcasters may need to aggressively summarize the content or accelerate the dubbing. Pitch-preserving time-stretching and other speech processing can introduce artifacts that listeners often describe as “robotic,” and studies show that distorted or compressed prosody can reduce intelligibility, affect cues, and recall.

While caption workflows are more latency-tolerant than the primary video path, they are still subject to strict regulatory synchronization requirements. In broadcast environments, captions must remain closely synchronized with spoken audio to meet accessibility rules enforced by regulators such as the FCC and Ofcom, making latency management a legal constraint rather than a best-effort optimization.

Emerging AI-driven dubbing and voice translation systems aim to extend live and near-live content into markets that are difficult or uneconomical to serve with human voiceover teams. These systems combine speech recognition, neural translation, and speech synthesis to generate alternate audio tracks in other languages, allowing streamers to expand reach without duplicating production workflows.

Across vendors, capabilities are converging. Modern systems can convert speech into translated audio while preserving pacing and expressive intent, including emphasis and emotional tone. Many support voice cloning or the use of pre-cleared, broadcast-licensed synthetic voices, enabling consistent presentation across languages. Broad language and dialect support make multilingual distribution practical at scale.

In practice, these systems are usable today primarily in near live or tightly controlled live environments, such as events, conferences, and selected sports formats. They integrate into standard broadcast pipelines using protocols like SRT, HLS, and MPEG-DASH and often rely on cloud speech and translation services from platforms like AWS and Google Cloud

Encoding and Processing

Most major live encoder or preprocessing vendors reference AI somewhere in their product specifications, mostly referring to AI-assisted rather than AI-only functionality. In these applications, vendors apply ML components selectively inside otherwise conventional encoders, typically to classify motion or scene dynamics, guide configuration choices, or support operational automation. The encoder itself remains largely the same, outputting known codecs playable by compatible decoders.

Because these AI components are embedded piecemeal, isolating their specific contribution is challenging. For this reason, when evaluating live encoding systems today, it's best to simply analyze the encoder as you would have without AI features.

The primary exception to this pattern is the emergence of fully AI-based codecs, like those developed by companies like Deep Render. These represent a fundamentally different approach that's a clear break from the past. Deep Render has demonstrated low-latency, live-capable operation in focused tests and third-party studies, distinguishing it from AI-assisted extensions of conventional codecs.

At the same time, AI-based codecs are pre-nascent in deployment. Hardware requirements, determinism, operational tooling, and ecosystem integration are still evolving, and widespread adoption in large-scale live streaming workflows is unlikely in the near-to mid-term. Following its acquisition by InterDigital in late 2025, this class of technology is best understood as an important signal of where video compression may be headed, rather than something most publishers need to plan for over the next few years.

AI for Visibility and QoE/QoS Measurement

In this category, AI helps operators identify abnormal behavior and understand its origins. The next section will cover systems that turn observations into action like switching CDNs or servers. This group includes all the usual QoS/QoE suspects who have added AI-based functionality to their existing product lines.

For example, Conviva documents uses machine learning in its AI Alerts feature “to automatically detect anomalies in streaming performance.” AI Alerts identify anomalies and present impact information over relevant dimensions like country, ISP, device, and application.

NPAW NaLa AI Assistant (see Figure 3) also uses AI to categorize and structure streaming issues. In addition, NaLa enables natural-language queries, making information more accessible and actionable.

npaw nala ai assistant
Figure 3.
NPAW uses AI to analyze data and answer natural language questions from users.

Similarly, Bitmovin deploys AI in its Real-Time Observability product to generate automated insights from playback session data, including interpretation of playback quality and performance signals. The company presents AI as a mechanism for summarizing and explaining session behavior, rather than as a delivery control system.

In this category, AI’s value is diagnostic: turning large volumes of live telemetry into understandable problems and likely causes.

AI for Delivery Decisions (Network and Multi-CDN Control)

One area in which I expected to see more AI contributions was AI for delivery decisions. This category covers systems that influence where live traffic is sent, such as selecting a CDN, origin, or delivery path for a given session. In practice, most production systems in this layer are automated and policy-driven but are not explicitly described by their vendors as AI or machine-learning systems.

For example, Conviva Precision is positioned as a delivery control and optimization product that selects CDNs and origins based on measured experience metrics and operator-defined policies. The Precision documentation describes automated provider selection driven by QoE inputs and business rules but does not characterize this logic as machine learning or AI optimization.

NPAW describes a similar role for its CDN Balancer solution, which distributes traffic across multiple CDNs based on performance measurements and configurable rules. The Smart Multi-CDN documentation focuses on automated switching and policy enforcement using observed delivery metrics, without framing the decision logic as AI-based or machine-learning-driven.

So, while both Conviva and NPAW explicitly document the use of AI and machine learning in their analytics, observability, and diagnostic products, that language does not appear in their delivery control documentation. In these materials, delivery decisions are described in prescriptive terms like thresholds, policies, scoring, and rule-based selection.

AI in the Client (Player Behavior and Experience)

While streaming players generate detailed client-side playback telemetry consumed by analytics and monitoring systems elsewhere in the stack, they exhibit little AI-based behavior themselves. Most player implementations focus on deterministic playback and reporting, leaving interpretation and decision-making to external systems.

Historically, this telemetry was delivered through vendor-specific SDKs and APIs. However, the recently introduced Common Media Client Data (CMCD) introduces a standardized alternative that should expand these data flows and make them more accessible. To be clear, CMCD (see Figure 4) isn’t AI, but it is a pathway that will simplify the collection of player-related telemetry to feed machine-learning performed by other systems. The specification, published by the Consumer Technology Association as CTA-5004, defines the format and semantics of various player-related data points, focusing on how this information is expressed, rather than on how it is analyzed or acted upon.

cmcd 
Figure 4. CMCD isn’t AI, but it will enable many AI-related functions through player telemetry.

For example, AWS uses CMCD fields to deliver data to CloudFront, creating near-real-time logs that can be used for monitoring and analysis. In a separate post, AWS discusses how CMCD telemetry extracted from CloudFront logs can be combined with large language models using Model Context Protocol to analyze streaming performance through natural-language queries.

Bitmovin has similarly described CMCD as a way to correlate player-side playback state with CDN-side metrics, improving downstream analysis of streaming performance. While Bitmovin positions AI primarily in its analytics and encoding products rather than in the player, CMCD provides a standardized telemetry feed that those systems can consume alongside proprietary data sources. Besides Bitmovin, other player vendors have announced support for CMCD, including JW Player, THEOplayer via its CMCD Connector, and Google’s ExoPlayer/Media3.

By standardizing how playback state is transmitted from the player to other services in the stack, CMCD simplifies the flow of telemetry used by downstream analytics and data-driven systems. That telemetry can feed both traditional analysis and AI-based interpretation, while the player executes externally generated decisions.

Post-Event Processing and Highlights

AI-assisted content clipping and highlight generation is one of the clearest areas where machine learning is already operating directly on live and near-live video. For example, WSC Sports's platform uses AI to identify key moments in sports events and generate highlights for leagues, broadcasters, and digital platforms. Operationally, WSC’s system analyzes combinations of video signals, audio cues, and structured sports data to detect plays like goals, fouls, or other significant events without manual logging. Then it automatically creates clips and packaged highlights during or shortly after live events.

Similarly, Magnifi uses AI to analyze video content and associated metadata to generate clips, short-form highlights, and vertical video formats for distribution across digital and social platforms (see Figure 5). This includes clipping, formatting, and reframing workflows that would otherwise require manual editorial effort.

magnifi ai
Figure 5. AI automates highlights and social media production for sports and other broadcasts.

Both systems, and others like them, operate alongside live production and distribution workflows rather than inside the primary encoder or CDN path. In all cases, AI is applied directly to content understanding and transformation, producing clips and highlights intended for secondary distribution, social engagement, or archive use rather than controlling live delivery or playback behavior.

Security and Integrity

AI appears in several live content protection workflows, where the core challenge is scale and speed rather than delivery optimization. Live sports and other events are frequently pirated during broadcasts via unauthorized streams redistributed on illicit websites, social platforms, and IPTV services. The problem is identifying these pirated live streams quickly enough to take them down during the event.

Friend MTS positions its anti-piracy platform around large-scale monitoring of pirated live streams during active broadcasts. The core identification relies on traditional techniques like video fingerprinting and forensic watermarking. The company applies AI around these mechanisms to operate at live scale: classifying matches, filtering false positives, prioritizing high-impact infringements, and automating evidence collection across large numbers of concurrent streams. The goal is to shorten the time between detection and enforcement during live events, so takedowns can happen while the broadcast still matters.

friend mts 
Figure 6. AI accelerates piracy detection in live events where fast action is essential.

Verimatrix offers a competitive live content protection service, built on similar foundations of watermarking, automated monitoring, and enforcement workflows. As with Friend MTS, the underlying detection techniques predate modern machine learning, and the company’s public documentation does not clearly separate AI-driven components from rule-based automation.

What is clear is that both use AI to improve speed and scale during live events, where delayed detection sharply reduces the value of enforcement. In this context, AI appears less as a distinct detection layer and more as an accelerator for classification, prioritization, and operational throughput in time-sensitive broadcasts.

Monetization and Ad Insertion

Let’s conclude with a look at AI in monetization, where it is already in production use in livestreaming workflows. Here we see AI applied to ad decisioning, yield optimization, and contextual targeting, while the actual insertion and delivery mechanisms deploy traditional technologies.

For example, Google uses machine learning in Google Ad Manager for yield management functions like optimized pricing, where models adjust pricing and allocation based on demand signals and performance data. Google supports VOD and live monetization, where inventory scarcity and timing make automated, yet intelligent, optimization even more important.

Similarly, Magnite deploys machine learning–based optimization across its streaming platform, including traffic shaping for live streaming inventory, claiming that AI balances yield, performance, and buyer outcomes.

Networks and large publishers are deploying AI-driven contextual systems to improve the relevance of live ads. Disney has expanded its “Magic Words” capability to live programming, using scene-level and visual analysis to enable advertisers to align creative with specific moments, moods, or emotional context during live streams. NBCUniversal has introduced real-time contextual ad targeting for live content, combining content analysis with first-party datasets to create moment-based segments for advertisers.

Several vendors support this layer by supplying AI-generated contextual data rather than executing ad delivery themselves. Wurl’s BrandDiscovery product uses GenAI to analyze video and produce scene-level contextual signals, like emotion, genre, and brand suitability, which advertisers can use to align ads with content in near real time. IRIS.TV aggregates video-level contextual data produced by AI and computer vision partners into a centralized marketplace, allowing buyers and sellers to transact on standardized content segments across CTV and streaming environments.

verimatrix
Figure 7. AI-based contextual targeting increases advertisers’ ROI and publishers’ CPMs.

In this category, AI is concentrated in decisioning, optimization, and contextual interpretation layers, where it can operate continuously and at scale. The systems that execute ad insertion and delivery remain largely rule-driven, enforcing decisions made elsewhere in the stack.

Conclusion

AI is now integrated into much of the live streaming stack, but its role is uneven. In practice, AI is strongest where scale, pattern recognition, and speed matter more than determinism or creative judgment. That’s why it shows up so clearly in camera automation for non-premium production, transcription and localization, diagnostics, highlights creation, piracy detection, and monetization decisioning. These are areas where AI can augment or replace labor, compress timelines, or surface insights from volumes of data that humans cannot reasonably process in real time. 

Still, some of the most latency-sensitive and risk-intolerant parts of live streaming remain largely non-AI. Encoding, delivery control, player behavior, and real-time playback decisions are still dominated by deterministic systems, explicit policies, and well-understood tradeoffs. Where AI appears in these layers, it’s usually advisory or diagnostic rather than directly in control. This reflects the reality that live streaming tolerates very little ambiguity, and the cost of getting decisions wrong is often higher than the benefit of marginal optimization.

So rather than transforming live streaming end to end, AI is being selectively applied where it can deliver measurable value without destabilizing the workflow. Over time, some of today’s rule-based systems may absorb more machine learning, especially as telemetry improves and standards like CMCD mature. For now, the most effective uses of AI in live streaming are pragmatic rather than revolutionary, focused on making existing workflows more scalable, observable, and economically viable rather than replacing them outright.

Streaming Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Tech Challenges of AI-Driven Streaming Content Localization and Captioning

The buzz around AI for subbing and dubbing and localizing streaming content is that it makes it far easier than it's ever been before. But that doesn't mean it's without significant technical challenges-particularly for companies like Interra Systems who develop the enabling tech-as Interra Engineering Manager Sana Asfar explains as she enumerates the key challenges and how to overcome them in this conversation with IntelliVid's Steve Vonder Haar at Streaming Media Connect 2025.

Maximizing Content Value with Subtitles, Dubbing, and Localization

AI-driven dubbing has recently gained attention as major platforms like Amazon Prime Video and YouTube roll out new tools designed to expand their content's global reach. Amazon is testing AI-assisted dubbing on licensed content, while YouTube has introduced auto-dubbing for thousands of channels. Both efforts reflect a growing belief that dubbing can help platforms engage new audiences—but the results so far have been mixed.

AI and Streaming Media

This article explores the current state of AI in the streaming encoding, delivery, playback, and monetization ecosystems. By understanding theĀ developments and considering key questions when evaluating AI-powered solutions, streaming professionals can make informed decisions about incorporating AI into their video processing pipelines and prepare for the future of AI-driven video technologies.