Protege Media Is Licensing Video to Train AI — What It Means for Film & TV

Film, TV and other video content producers and rights owners are already licensing content for AI training. The size of the market is hard to quantify, but at least some of the activity is being enabled by AI data brokers — an emerging crop of companies that have been aggregating content, preparing datasets and negotiating training rights licenses between media content owners and AI developers.

Data licensing startup Protege is one such data broker. Among its various verticals is Protege Media, focused on audiovisual content licensing, which now counts over 100 media partners, noted Protege Media chief content officer Dave Davis on the latest episode of the Luminate podcast In the Lab. Among them are national broadcasters, film and TV producers and distributors and sports leagues from around the world.

Protege has struck deals with the biggest AI companies as well as many smaller ones. Many are developing video generation models, but audiovisual content can also train speech-based applications and even humanoid robotics. “We get deal flow across all of that,” said Davis, suggesting there’s a long tail of dealmaking they allow media companies to access.

Even as Davis and others make progress, licensing is far outstripped by scraping, which is how most data is acquired for generative AI development. Model builders have scraped vast quantities of material from the internet without permission or payment from content rights holders or anyone else while resting on fair use under U.S. copyright law.

For audiovisual works, dealmaking has been quiet compared with online news publishers engaged in licensing or other kinds of data partnerships. Below are five takeaways from our conversation with Davis about what he’s learned from two years of dealmaking:

1. Media companies are licensing a diverse range of content: That has included both scripted movies and TV shows as well as unscripted content, including sports, news and nature documentaries; and both final cuts and raw footage (e.g., unused B-roll). Raw footage can even be preferable for developers, since it’s “longer context” and more real-world accurate, without cuts or other editing treatments.

2. Data needs are specializing: Protege Media’s deal mix is shifting from datasets intended for pre-training — the initial phase of training a model — to fine-tuning, when developers are refining model capabilities with specific data. For licensors, this means developers are increasingly seeking curation over bulk: niche visuals rather than broad and diverse.

3. Deal terms take work (and some compromise) to work: The requirements of content used for training almost entirely ignore the distribution value of content. Instead, the focus tends to be on the subject matter and quality of the footage. Both pre-training and fine-tuning datasets have been paid up front on a flat-fee basis, though Davis said they recently did their first attribution- or usage-based deal, meaning licensors are paid based on an algorithmic calculation of how their content contributes to any future AI-generated outputs from the AI service trained on their content.

AI and media company perspectives have also had to meet in the middle on license length. Protege has successfully secured nonperpetual licenses whereas AI companies have initially sought licenses “in perpetuity,” a clear nonstarter for any potential media licensor. A “nonperpetual” license in this case means AI companies can interact with the data only during a limited window (training period), though any existing models can retain their weights and embeddings.

4. Content owners need to be clear on any risks: Complexity related to likeness and other third-party rights inherent in film and TV content has likely held off studio content owners from licensing. “Content owners need to work through those rights and issues and get comfortable that they have the necessary rights to engage in this business,” Davis said, adding that Protege puts carve-outs in its terms to limit some rights-related risks, notably by restricting training on any music licensed for use in a movie and models from reproducing actor faces or voices that appear.

5. Some signs point to AI companies licensing more of the data they use to train: Davis argued that licensing would be easier than a developer trying to comply with disparate rules and rulings around the world. He pointed to a few significant litigation outcomes he expected would encourage (or scare) AI companies to license, notably the $1.5 billion Anthropic settlement. Lawsuits could even engender licensing, as the UMG-Udio deal demonstrated.

Beyond legal concerns, some incentives to license could increasingly be technical. Although developers have been training models on synthetic (AI-generated) data, companies may have to license in order to access fresh reserves of nonsynthetic content needed to prevent the phenomenon known as model collapse. “Interestingly, requests we get increasingly asked for content that is unpolluted by AI,” said Davis. “AI companies want real-world data that they know is the ground truth.”

Explore Our Range of Products