-->
Save your seat for Streaming Media NYC this May. Register Now!

The State of Education Video 2024

Article Featured Image

Is the state of education video in 2024 the quiet before or after the storm? With a pandemic in the rearview mirror, we approach a crossroads where it will be determined whether enterprise-scale video hosting and management services will remain profitable at the prices that schools are willing to carry in the new normal. It’s unlikely that we’ll cross a point of no return this year, but I recommend keep­ing an eye out for signs that will either allay or am­plify concerns about the long-term future of schools having a degree of ownership and control of the vid­eo services they rely on.

Meanwhile, convenient captioning workflows in 2024 are now accessible to the masses thanks to the confluence of a hot new company (OpenAI) and an open source project lovingly tended to since 2001.

Business Demands Intensify

Ever since summer 2021, those of us who closely follow the industry for supporting streaming media for schools have had detailed insights into exemplary vendors. That’s when Kaltura successfully completed its IPO and subsequently was required to file docu­ments with the SEC, the most interesting of which are quarterly 10-Q forms and the 10-K annual report. These documents include financial statements and a sober accounting of a public company’s perspective on its business climate.

Similarly, Zoom is required to publish these same documents, as it completed its IPO in 2019. Between the two companies and their SEC filings, we have a reliable view into both the synchronous and asyn­chronous video sectors from two of the most suc­cessful and well-run vendors serving the education video vertical.

State of Synchronous Education Video

Zoom is a little bit unusual for an emerging tech company in that it has turned a profit every year since 2018, the year prior to its IPO, and lately has had quite phenomenal profits. That synchronous video is a more profitable line of business is consistent with the general rule of cloud economics as detailed in The State of Education Video in last year’s Sourcebook:

In the cloud, variable-use resources like CPU, RAM, and bandwidth tend to be very cost-effective, while fixed-use resources like long-term disk stor­age are more expensive than what you can get with an on-prem investment. In other words, economies of scale work best when you pay for what you’re us­ing to serve customers and handing those resourc­es off to other public cloud tenants at other times and worst when you’re always paying to store data you’ve accumulated and may or may not be using.

Synchronous video requires paying for CPU and bandwidth whenever the service is being used, and not much of anything goes on the cloud bill when it’s not being used. By default, Zoom deletes meetings recorded to its cloud hosting after 180 days, so stor­age costs have a built-in mechanism to avoid snow­balling. In searching for tea leaves to read in last year’s Sourcebook, we alit upon a decreasing rate of revenue growth period-over-period from 2021 to 2022. That trend continued into 2023, but Zoom reve­nue growth appears to have stabilized at around 3% through the first three quarters of 2023 compared to the same time periods in 2022.

Zoom acquired two companies in 2023: Solvvy and Workvivo. Solvvy adds a mature chatbot offering to the Zoom portfolio, and Workvivo provides an employ­ee experience platform (see Figure 1) that delivers streamlined communication and culture-building tools for business subscribers.

workvivo

Figure 1. Workvivo for Zoom

The closest thing to an academic institution listed on the Workvivo website’s Partners page is the Hoover Institution at Stanford University, so I don’t expect that this acquisition will create immediate value for Zoom’s school customers. However, it may be a step toward closing feature gaps with Microsoft Teams down the road.

There is definite interest in developing custom chat­bot applications in higher ed, though. The Universi­ty of Central Florida (UCF) is a school that I admire for generally staying ahead of the curve with educa­tional technology, and its Knightbot chat service, built in a partnership with engagement platform vendor Mainstay, is a good example of a successful chatbot. Another Mainstay customer, Georgia State Universi­ty, in partnership with UCF and others, was recently awarded a $7.6 million grant to study whether chatbots can improve student learning outcomes by providing them with essentially a 24/7 AI teaching assistant to which they can ask questions. It will be interesting to see if that research also reveals whether students’ interactions with human faculty and teaching assis­tants decrease in civility as a result of more interac­tion with AI assistants in this role.

State of Asynchronous Education Video

Kaltura had its IPO at a somewhat unfortunate time as far as trendline optics go, although it was a good time to raise cash to the tune of $172.5 million. The NASDAQ composite closed at 14,631.95 on the day of the IPO and fell below 12,000 by May 2022 and be­low 11,000 in June 2022. Kaltura was priced at $10 for the IPO, had a peak closing price of $13.61 on Aug. 6, 2021—incidentally, the day that the last 2.25 million shares were sold at the original price—then fell all the way to $1.78 on March 7, 2022, roughly where the stock price has languished ever since.

That collapse in price made possible an unlikely, un­solicited purchase attempt from one of Kaltura’s top competitors, Panopto, in summer 2022. The purchase was ultimately shot down by Kaltura’s board. Kaltura has spent the past 2 years getting lean on operational costs, shedding 10% of the workforce in 2022 and lay­ing off an additional 11% in 2023.

Layoffs have been widespread across the tech sec­tor for the last several years and continue into 2024. Twitch laid off more than one-third of its employees in January 2024, for a dramatic example.

The effort to trim down has borne fruit in Kaltura’s case: The company’s non-R&D operating expenses fell below gross profits in Q4 2022 and have remained well below since. Period-over-period revenue growth in 2023 was strong, with Kaltura’s subscription in­come in the category that in­cludes the education vertical increasing by 8.2%, 7%, and 4.8% over the first three quar­ters compared to 2022, handi­ly beating the trend observed in last year’s Sourcebook.

It’s noteworthy that Kaltu­ra—the biggest provider of educational VOD services that serve half of the R1 universi­ties—has never shown a profit, either quarterly or annually, al­though again, Zoom is the outli­er among emerging tech com­panies for consistently turning a profit. At some point, though, it would be reassuring to know that the vendors schools rely on for educational video services operate on sustainable busi­ness models. Kaltura recogniz­es this as well, recently recruit­ing John Doherty from Magic Leap to serve as its new CFO while specifically mentioning profitability as a component to the hire in its announcement.

Last year’s The State of Education Video article included a discussion of what might have happened if the two largest video management system vendors servicing schools had in fact merged and what op­tions schools would have if their post-pandemic ed tech needs shrank and new circumstances warranted a de-escalation in their video management software (VMS) subscriptions. Since video services are tremen­dously valuable to schools, and school administrators tend to prefer to hire out core services to vendors rath­er than rely on the loyalty of highly skilled employees to support those critical operations, I’m bullish that the industry will thrive.

If that optimism is misplaced, the University of Toronto’s Opencast project may suggest a new direc­tion. The University of Toronto is a bold and forward-thinking institution with a total enrollment of just few­er than 100,000 students across its three campuses. It successfully built out its Opencast Content Capture System (go2sm.com/occs) to provide campuswide lec­ture capture, and it remains an excellent solution for schools that are willing to invest in on-prem solutions or to collaborate across institutions to pool resources to that end (see Figure 2).

UToronto Opencast

Figure 2. A schematic of the University of Toronto’s Opencast Content Capture System

Kaltura’s 2024 10-K filing was expected in Febru­ary. In the section of the filing that discusses risk fac­tors, compliance with privacy regulations is always a major concern. In 2021, China passed the Personal Information Protection Law (PIPL), complicated legislation that includes specific cash ranges that companies can be held liable for if the law is not adhered to. Thus far, PIPL has not been mentioned in Kaltura’s SEC filings (and only oblique­ly in Zoom’s 2023 10-K), but navigating how this law impacts international educational institutions and the vendors that provide technology services for them is a major question.

I also expect some insightful discussion of new risks posed by modern AI. Generally, Kaltura in­cludes a short paragraph about liability related to hosting content that violates copyright or licenses. It will be interesting to read if deepfake technolo­gies are on Kaltura’s radar, as they present a more costly challenge to assisting institutions with polic­ing take-down requests for offensive, highly person­al content.

I’m also curious to see data on how Kaltura’s en­trance into synchronous video services has devel­oped, something that has yet to be teased out in any filings thus far. As discussed for Zoom, the econom­ics of cloud resource provisioning for synchronous video are more favorable than those for asynchro­nous, so the more Kaltura can grow its synchronous service offerings, presumably the better for its bot­tom line. The company will also need to thread the needle of either more effectively passing on its stor­age costs to customers without creating dissatis­faction, or, better, providing data-driven tools for assessing what content can be inconsequentially de­leted or archived to lower-cost storage by customers to minimize storage costs.

An advisable approach is to adopt the “with great knowledge comes great liability” data retention pol­icy angle, perhaps in concert with efforts to comply most effortlessly with PIPL, the General Data Pro­tection Regulation, and U.S. privacy laws. Another appealing justification to conscientiously manage the accumulation of recorded video data is stream­ing green. Unnecessary video storage bloats elec­tricity usage and contributes carbon released into the atmosphere.

Accessibility Now Easily Accessible

In last year’s “The State of Education Video,” I threw some cold water on the hype over ChatGPT based on the performance of GPT-3. GPT-4 was released right around the Sourcebook’s publication, and that skep­ticism was no longer warranted given GPT-4’s superi­or performance. GPT-4 has been shown to do well on standardized tests, Advanced Placement exams, and trade exams, making it a major factor in how teach­ers assess student performance.

The best advice I’ve seen for how to AI-proof your tests and assignments, loosely adapted from optics research scientist and AI researcher Janelle Shane (aiweirdness.com), is to give questions that students can answer but that a pre-trained transformer can’t by making the questions very local to the student doing the assignment, either in space or time. The transformer’s training data is many months stale from the public internet, so it wouldn’t be able to an­swer questions about very recent events and wouldn’t have access to what’s on specific pages of your class textbook or your course website (except as given by a student prompt).

Over the past year, many teachers have leaned into the transformer revolution and have tried to incorporate AI into their instruction. Perhaps the most intriguing use of AI text genera­tion is for seeding inspiration. Here, the assignment would be to have your text generator produce sev­eral essays on various topics, choose whichever one you most want to rewrite, and produce an original essay of your own based on the prompt. This strikes me as a generalization of Cunningham’s Law, which can be stated as, “The best way to motivate experts to provide you with a correct answer is to invite their contempt by posting the wrong one on the public in­ternet.” It rings true that for whatever reason, it’s easier and somehow more satisfying to put creative energy into disagreeing with someone than agree­ing with them. A compelling writing assignment would be to have students rewrite two AI-generat­ed essays—one that they agreed with and one that they disagreed with—and subjectively rate the ex­perience. As a class, they would then reflect on why this is so (assuming that it does indeed prove to be the class’ experience).

Embarrassing underestimations aside of how quick­ly large language model (LLM)-driven transformers would post substantial challenges to more sophis­ticated assessments than short-answer quizzes, a main point of last year’s article holds up well: Whis­per, OpenAI’s open source speech-to-text engine, would be a huge benefit for education in 2023. In 2024, Whisper and Whisper-powered tools are easy to use, even for technology-challenged teachers and students who need to have their videos captioned without spending a huge amount of time on the process.

The quality of automatic captioning offered by ven­dors has improved dramatically in the past 5 years with the rise of attention-based transformers and LLMs. Whisper being freely available since Septem­ber 2022 upgraded the state of the art in how educa­tors can produce closed captions for their education­al video. Whisper is able to generate astonishingly accurate transcriptions in multiple languages. For example, I supported a research project by generat­ing automatic transcripts of interviews in Ukrainian, Russian, English, and Czech with people fleeing the war in Ukraine and those providing aid to them. This technology dramatically improved the researchers’ procedure (correcting a transcript is a much faster process than writing one from scratch) and did not send the highly sensitive data anywhere untrusted. That Whisper adds on the ability to automatically translate from language to language as part of the speech-to-text process is almost unimaginable, but it works pretty well.

Whisper is not perfect, though, and has two ma­jor problems. The first is that it produces segments that are far, far too long; often three or four lines of captions fill the width of the player. The second is that Whisper is prone to hallucinate, as are all trans­formers, since they’re built to predict words and send them to output even when the input is very sparse or nonexistent from a human language user’s per­spective. Typically, a hallucination happens after or during stretches of silence or a non-speech signal like music, producing unrelated text or often just a sequence of periods for the remainder of the run.

WhisperX is a project that’s be­ing undertaken to address both of these problems head-on (github .com/m-bain/whisperX). WhisperX (see Figure 3) pre-processes the au­dio to be transcribed by detecting speech signals and cutting out all other non-speech audio intervals so that Whisper doesn’t have an ex­cuse to hallucinate. After generat­ing a transcript of this edited au­dio, it performs forced alignment against the original audio using Me­ta’s Wave2vec toolkit to time code and segment the transcript into a caption file. This is a quite brilliant solution, although it jettisons Whisper’s translation capability, and WhisperX’s segmentation is also of­ten far too long.

OpenAI WhisperX pipeline
Figure 3. The WhisperX pipeline as diagrammed on the project’s GitHub readme

However, hallucination is generally not a problem in instructional videos, where there are almost never extended periods of silence or non-speech sound. In fact, I had used Whisper for several months without ever seeing the phenomenon myself until we start­ed throwing commencement ceremony recordings at it that included lengthy processionals. Thus, for a teacher, the only concern with using Whisper is getting it installed and being able to re-segment and easily correct the captions it produces.

To address Whisper’s challenges, Subtitle Edit is an excellent and free tool. Although I started using it only recently, it has been in development since 2001. The source code was version-controlled on GitHub for just over a decade and was

at that time primarily a souped-up version of SubRip, the DVD subtitle picture OCR program that invent­ed the SRT filetype. Development on Subtitle Edit (see Figure 4), though, focused on er­gonomics instead of OCR, deferring the job of recog­nizing the text in DVD subtitles to the Tesseract OCR engine, originally written at HP and later adopted as an open source project by Google. Subtitle Edit was a fascinating program all along; by 2011, it had advanced features like real-time text-based chat so that multiple editors could collaborate on a DVD localization project and a fast Fourier transform (FFT) calculator to show a real-time spectrogram to assist experts with iden­tifying ambiguous speech. As of 2014, it could export to 201 different caption formats. With the 3.6.8 release on Oct. 24, 2022, Subtitle Edit began experimenting with using Whisper to au­to-generate captions for any video to be presented in its 2-decades-in-the-making caption correction user interface; this occurred about 1 month after Whisper was open sourced. The program makes download­ing and installing Whisper and its pre-trained mod­els a breeze. The default option for the Whisper version is a standalone executable wrapper of Faster-Whisper, the same variant of the engine used by WhisperX. Another easy option, CPP, a C++ port of Whisper by the brilliant and extraordinarily productive Georgi Gerganov, has some very useful extra features like live captioning from a microphone and more com­pact models.

Subtitle Edit
Figure 4. Subtitle Edit about to download the Medium.en pretrained Whisper model

If you need to caption video that would be prone to hallucination, WhisperX is an option, but it would re­quire a nonstandard installation procedure bypass­ing the Conda virtual environment steps. The original Whisper engine significantly benefits from inference on a GPU with at least 12GB of VRAM when using a large model, but both Faster-Whisper and Whisper CPP perform well on any modern computer.

Subtitle Edit will re-segment the transcript into timed text using default settings (see Figure 5) that are close enough to the Netflix text style guide, which has become the industry standard after the National Association of the Deaf persuaded the company to become an effective ally of accessibility in the stream­ing entertainment industry.

Subtitle Edit settings menu
Figure 5. The Subtitle Edit settings menu

With more than a year of development since Whis­per was incorporated into the Subtitle Edit project, it’s an easy-to-use way to get started with this extremely advanced speech-to-text engine and one that I whole­heartedly recommend to teachers and students.

Streaming Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

WCAG 2.2, Web Content Accessibility Standards, and You

Instead of a patchwork of accessibility standards for California, Illinois, Europe, and everywhere else, the current standard is set by a broad cross-section of experts from the industry and published by the World Wide Web Consortium (W3C) as the Web Content Accessibility Guidelines (WCAG).

The State of Education Video

Video's role in schools is taken for granted entering 2023, although we should expect to see changes, potentially disruptive, in the educational video market as schools continue to adapt to the aftermath of the COVID-19 state of emergency phase. Despite the widely held belief that video is essential to school operations, expect to see schools roll back their investments in video services, while educators seek out ways to go beyond the basics of video delivery, finding better ways to engage students both with synchronous and asynchronous video.

Lessons Learned: What the Pandemic Taught Us About Remote Teaching

Although failing to enter the popular lexicon as of yet, the term "emergency remote teaching" (ERT) is intended to avoid conflating what we'd now call "traditional" online education with the improvised adaptation of face-to-face lesson plans and classroom experiences to the synchronous videoconferen­c­ing platform available to any given school dur­ing the COVID-19 pandemic.

The State of Education Video 2022

Now that students have returned to the classroom, schools and universities face an existential dilemma about the role video will play going forward.

Companies and Suppliers Mentioned