QC’ing Live Streams at Scale in the Age of AI: A Q&A with Interra Systems' Anupama Anantharaman

As with any other complex, high-stakes aspect of live streaming workflows, questions abound regarding the uses, capabilities, and limitations of AI/ML in live stream monitoring and QC. While in some respects there’s no replacement for humans in the loop and real eyes on glass when it comes to quality assurance, the monitoring demands of live streaming at scale at each stage of the workflow provide ample opportunities for AI/ML to shoulder the load and transform streaming QC operations.

In this Q&A, I had the opportunity to discuss the fast-evolving requirements and best practices for live streaming QC at scale with Anupama Anantharaman, VP, Product Management at Interra Systems, developers of the ORION-OTT comprehensive monitoring solution for live and OTT. The 2025 Streaming Media All-Star, offered insights into how AI is changing live stream monitoring, how performing “deep QC” impacts streaming workflows, the best KPIs for predictive monitoring, and the continuing role of humans in an increasingly AI-supported QC world.

Anupama Anantharaman, VP, Product Management, Interra Systems

When it comes to live stream monitoring and observability, where is AI delivering the most measurable improvement, and how are those improvements manifested in practice for real-world streams?

AI is delivering the most measurable improvements in two areas: quality analysis and operational intelligence.

On the quality side, AI/ML models can detect complex audio and video issues such as compression artifacts, freeze events, lip-sync errors, and visual distortions with greater accuracy than traditional rule-based monitoring. This leads to earlier detection, fewer false alarms, and a better viewer experience.

On the operational side, AI helps make sense of the massive volume of monitoring data generated across ingest, encoding, packaging, delivery, and playback. By correlating alerts and metrics from multiple points in the workflow, AI can identify likely root causes and provide actionable insights rather than simply reporting symptoms.

The result is a shift from basic monitoring to more intelligent operations. Teams can detect issues easier, isolate faults faster, reduce mean time to resolution (MTTR), and manage increasingly complex streaming environments with fewer manual investigations.

Where are the most important points in the streaming pipeline for running deep QC, and are there cost, latency, or performance trade-offs associated with where QC efforts are shifted or focused? How, if at all, does AI change that equation?

Ingest is the earliest and cheapest place to catch problems, but the most important points for deep QC are ingest, post-transcode (when bitrate ladder renditions are created), and live delivery. Each stage reveals different types of issues. Ingest checks source quality, post-transcode exposes compression artifacts, and live monitoring catches problems that may only appear during actual playback across networks and CDNs.

The challenge is that running deep, frame-by-frame QC everywhere is expensive and can slow workflows, especially for large VOD libraries and live streams. As a result, teams have often had to choose between thorough analysis and fast turnaround.

AI changes this trade-off by helping teams prioritize which assets or streams require deeper inspection. Instead of applying full frame-by-frame analysis everywhere, AI can flag the content most likely to contain quality issues. Those can then be sent through more intensive QC or human review, while clean content moves through the workflow faster. The result is a smarter balance of quality, speed, and cost, with effort focused where it delivers the most value.

What is the typical time-to-detect for live stream quality issues with AI-assisted observability compared to traditional eyes-on-glass monitoring and threshold monitoring?

Anupama: Providing a single time-to-detect number is difficult because it depends heavily on the specific setup, what is being measured, what the team defines detection versus confirmation.

Threshold monitoring and eyes-on-glass are both inherently reactive. A threshold fires only after a metric crosses a predefined limit, and a human operator only spots a problem if it's obvious enough and they are looking at the right stream at the right moment. AI-based anomaly detection can identify subtle patterns and gradual drift that may not yet violate any threshold or be visible. It can recognize that something is trending in the wrong direction and flag it before it becomes a viewer-facing issue — catching the warning signs before a failure occurs.

The bigger impact, though, is what happens after an issue is detected. Traditionally, engineers had to pull logs from multiple systems — encoders, packagers, CDNs, networks, and other workflow components — and manually piece together what happened. AI can analyze all that data at once, correlate events across the workflow, and quickly surface the most likely root cause more quickly. So, while the improvement in detection speed can be meaningful, the most measurable benefit is often faster troubleshooting and recovery, which translates directly into lower MTTR and fewer service disruptions.

Are AI models able to predict livestream quality degradation before it becomes visible to the viewer? Which KPIs are proving most useful for predictive monitoring?

Anupama: Yes. AI-based anomaly detection can identify patterns that often precede visible service degradation, enabling teams to address issues before viewers are impacted. Rather than reacting to alarms after a failure occurs, predictive models can recognize subtle changes in system behavior and flag conditions that historically correlate with stream instability, quality degradation, or outages.

The most effective predictive monitoring solutions analyze data across multiple layers of the workflow. Useful indicators include IP-layer metrics such as packet loss, jitter, latency, and retransmissions; transport and streaming metrics such as bitrate fluctuations, buffer levels, manifest errors, and segment delivery times; and audio/video quality metrics such as freeze events, black frames, lip-sync drift, and perceptual quality scores.

Individual KPIs can be useful, but the greater value lies in AI's ability to correlate hundreds of metrics across the workflow, identify emerging issues, and anticipate their likely impact on viewers.

How does AI enhance streamers' ability to maintain quality oversight of multiple concurrent streams simultaneously? How dramatically has it increased the scalability of livestream monitoring?

AI helps because it can do much of the constant watching and first-level analysis that used to depend on operators staring at dashboards or screens. It can catch audio and video issues too subtle for simple threshold monitoring, pick up early signals that point toward a failure, and correlate an alert with logs metrics, and events from other parts of the workflow.

That correlation is especially important. Operators need to understand whether the issue matters, where it may have started, and what else was happening at the same time. If the system can connect those pieces, the operator gets something much closer to an explanation instead of just a flag.

The natural language layer can also lower the barrier for operations teams. Being able to ask what happened on a channel, or why a particular alert was triggered, is often more efficient than navigating multiple dashboards and manually piecing together the answer.

So, the scale shift can be significant, but I would frame it as a change in operating model more than a simple headcount calculation. AI allows one operator to oversee many more streams than would be practical with eyes-on-glass monitoring alone, because the system handles more of the continuous checking and only escalates what is worth a person’s attention.

Has AI reduced the number of humans required to maintain livestream monitoring in network operations; would you characterize its impact more as changing humans' role or the skillset required for operating teams for streaming observability? If the role is less about watching waveforms, what are humans more likely to be doing now as they monitor streams?

I would characterize the impact more as changing the human role than eliminating it. Even when AI gets better at flagging anomalies and surfacing likely root causes, someone still has to decide what to do with that information.

The operator’s job moves up a level. Instead of spending as much time watching waveforms or waiting for thresholds to fire, people are validating alerts, judging severity, deciding whether corrective action is needed, and coordinating the response. They are also helping keep the models useful as the workflow changes.

Streaming environments are always changing — new content types, new formats, new vendors, new devices, new handoff points. A model that worked well yesterday can become less accurate when it sees something genuinely new. Human feedback is what helps keep the system tuned to the real operating environment.

So yes, the work can shrink in volume, but it becomes more concentrated around the harder calls: confirming impact, deciding what action is safe, and knowing when an issue is important enough to escalate.

Is there a heightened risk of false positives when delegating livestream monitoring tasks to AI, and what are best practices for humans in the loop in AI-assisted livestream monitoring?

There is a real risk, especially when a model sees content or workflow conditions outside what it was trained on. It may read something as an anomaly when nothing is actually wrong. I would not say that makes AI less reliable than a human watching the same stream, but it does mean the model has to be monitored and tuned.

The other side is false negatives, which are usually more serious. A false positive may cost someone the operations team time while someone investigates. A false negative that reaches viewers can affect a much larger audience. That is why the human-in-the-loop model still matters.

Teams should track how the model is performing — not just the number of alerts, but false positives, false negatives, and how those change when the workflow changes. New content types, new components, target devices, and network conditions can all affect what the model sees.

For anything important, the alert should come with enough context for a person to sanity check it quickly. Depending on the type of stream — sports, news, high-motion content, or feeds with different network conditions — the model may need different tuning. The point is not to run operations completely hands-free but to let AI narrow the field so humans can focus on the issues that actually need judgment.

How does ORION distinguish between quality events or flags that require immediate human escalation vs. minor occurrences more likely to resolve themselves, and are these types of thresholds configurable for operators?

At Interra, we believe monitoring should do more than just generate alarms. It should help operators understand what is happening and where to focus. ORION is built around that idea, with root-cause analysis to help teams separate brief, self-correcting issues from problems that are more likely to impact viewers.

Operators can configure thresholds based on factors such as error severity, duration, and frequency. A momentary glitch that clears on its own may not require any action, while an issue that persists, repeats, or shows signs of escalating can trigger alerts and notifications so the operations team can respond.

The goal is to reduce noise and alarm fatigue. Operators should not have to treat every event the same way. They need a way to distinguish between something that can be watched and something that needs immediate attention.

AI makes that more useful by correlating events across the workflow and helping identify likely root causes. That context helps operators decide when intervention is needed instead of reacting to every alert in isolation.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Best Practices for High-Availability Streaming at Scale

How do you maintain five-nines uptime for tentpole event streams reaching hundreds of thousands or millions of users? CBS Sports Digital's Corey Smith, Bulldog DM's John Petrocelli, and EdgeNext's Joshua Johnson offer tips and best practices for successfully scaling up your streams in this clip from Streaming Media Connect 2023.

10 May 2023

Cloud QC and Monitoring Solutions Drive Superior Streaming Experiences

In today's video-streaming-centric world, cloud OTT monitoring solutions are vital to staying competitive and meeting the demand for high-quality video content on every screen.

23 Jul 2021

The Case for Live Stream Monitoring

Interra Systems VP Product Management Anupama Anantharaman definitively answers the question "Why monitor?" for OTT providers in this clip from her presentation at Streaming Media West 2019.

02 Mar 2020

Companies and Suppliers Mentioned

QC’ing Live Streams at Scale in the Age of AI: A Q&A with Interra Systems' Anupama Anantharaman

Best Practices for High-Availability Streaming at Scale

Cloud QC and Monitoring Solutions Drive Superior Streaming Experiences

The Case for Live Stream Monitoring

Best Practices: Localise It - AI Subbing and Dubbing

Best Practices: Sports and Esports Strategies That Matter Most

More

First Look: IBC Streaming Solutions

Analytics That Matter: Turning Viewer Data into Actionable Insights

More Web Events

Visualizing Verticalization at Scale: A Q&A with Ateliere's Flavius Goman on Ateliere Storyline

The Match Is No Longer the Beginning of the Journey

Comcast Secures Lock on UK Commercial TV

The Next Battle in Sport Isn't Content. It's Context

Checklist Report: Ultimate Guide to Maximizing the Value of your Content Library

More