Re-Engineering Real-Time Video Delivery
A Stanford University academic research project called Salsify has identified a better way to deliver video for real-time applications like conference calls.
They didn't create a new video format, but a new architecture for real-time video systems. Instead of using the status-quo of two separate control loops, one for transport and one for the video codec, the Salsify approach joins these into a unified control loop to manage the transport and video codec together.
Video codecs currently work as a black box—there's the encode/decode function to get video into the right size for delivery, and then there's the transport protocol that delivers the video. The Salsify project makes the overall system more reactive or responsive to changes in available bandwidth. The result is, theoretically at least, a better overall experience. So where are things at today?
The video codec and the transport protocol each operate somewhat independently, resulting in a video stream which may be too big or too small for the network. Stanford PhD candidate and Salsify project member Sadjad Fouladi wants to even the odds that the video being delivered will fit the condition of the network, so there are fewer glitches and dropped connections, as well as less buffering.
They created a real-time video system that is able to respond quickly to changing network conditions and avoid stalls and glitches. Salsify looks at the current estimate of the network's capacity and then delivers a video frame that can be delivered under these circumstances.
Their research project claims to achieve lower video delay and better visual quality than the market leaders—FaceTime, Google Hangouts, Skype, and WebRTC's reference implementation in Google Chrome both with and without scalable video. For real-time communications used in videoconferencing, telemedicine, or other uses where delay in video delivery is unacceptable, low latency is even more important than in live streaming.
"We knew that a lot of people suffered from bad connections and low-quality video in videoconferencing," says Fouladi. "We thought that the problem is not in the codec and it's not in the transport. It's in the way that we integrate these components together."
"Unfortunately the overall performance of these systems hasn’t improved that much, so we thought maybe it's time to have a new architecture for the whole system rather than improving the individual components," he says.
Low, Low Latency
In products like Skype or protocols like WebRTC, the transport protocol doesn't have much control over the stream. Even if it's not a good time to send something because of bad connectivity or an already congested network, the transport protocol still has to send already-encoded frames, say Fouladi.
Today the transport protocol has some estimation of the network speed that is communicated to the video codec. The output size for a single frame usually is under or over the estimated network speed. If that frame is too big or too small, it tries to compensate by adjusting the next frames.
Over a course of about ten or twenty frames, it gets to that average network speed, says Fouladi. So while the self-correcting mechanism sounds good in theory, in practice a big frame can still cause congestion and packet loss, which will cause delays in the stream. Additionally, achieving the bitrate only on average makes the system slow to react to changes in the network.
Under the Hood
Salsify is only concerned with the size of the next frame, whereas previously the codec tried to deliver content based on an average bitrate. The goal is to make sure no individual frame is going to cause loss and congestion in the network. Instead of inaccurately guessing the encoding parameters upfront, the Salsify approach creates two slightly different qualities for each video frame and then picks the one that fits the network conditions and adjusts on a continual basis. "In this way, the transport has a frame-by-frame control over the video and can respond quicker to changing network conditions," says Fouladi.
This approach—providing a menu of options to the transport—is made possible by Salsify’s functional video codec that provides a save/restore state interface to the video codec, allowing it to explore different execution paths without committing to them. In traditional codecs, if a frame is encoded, it becomes a part of the video stream and has to be sent, whereas in Salsify’s codec, it can be discarded and the old state can be restored.
"So in this situation, for example, if the network is out or if something really bad happens, the transport can just stop sending frames altogether to avoid causing more congestion, even if the codec has already produced a frame," says Fouladi. The project used the team’s own implementation of a VP8 codec andachieved 4.6x lower p95-delay and a 2.1 dB SSIM higher visual quality on average when compared to FaceTime, Hangouts, Skype, and WebRTC.
"Now we actually have access to the internals of the black box, and we can design more sophisticated systems that can do more stuff that they currently can do," says Fouladi. "I think one of the goals of this project was to show the benefits of having this interface and somehow persuade the codec designers and implementers to actually include that interface in the future generations of the codecs."
The Salsify open source codec is video only (no audio). The project is software based and the requirement of encoding two version for each single frame creates significant computational overhead making it far from primetime. To get this on hardware they would have the same long road that AV1 has had, so some creative thinking needs to go into how to apply this in a real world setting. While they have a lot of challenges, the Salsify team has developed an interesting approach to a long standing problem.
More information can be found at https://snr.stanford.edu/salsify.
In this article, we'll look at the state of real-time streaming a few months into 2023, including several use cases that hold longer-term promises. But before we do so, here's a baseline definition of real-time streaming, as used in the survey: Real-time streaming is "device-synchronized delivery to hundreds of thousands of viewers at less than 500 milliseconds per user."
Video Rx CTO Robert Reinhardt discusses the pros and cons of WebRTC in this clip from Streaming Media East 2018.
WebRTC holds tremendous promise for adding interactivity and reducing latency in streaming video. Here's a look at where it fits today, and what we should expect of it in the future.
Streaming Video Alliance's Jason Thibeault and Limelight's Charley Thomas address the question of whether WebRTC provides a viable solution for network latency issues in this panel from Live Streaming Summit.