Investigation finds companies are training AI models with YouTube content without permission

the YouTube logo on a screen in front of other YouTube logos covering a black background
(Image credit: Shutterstock / JRdes)

Artificial intelligence models require as much useful data as possible to perform but some of the biggest AI developers are relying partly on transcribed YouTube videos without permission from the creators in violation of YouTube's own rules, as discovered in an investigation by Proof News and Wired. 

The two outlets revealed that Apple, Nvidia, Anthropic, and other major AI firms have trained their models with a dataset called YouTube Subtitles incorporating transcripts from nearly 175,000 videos across 48,000 channels, all without the video creators knowing.

The YouTube Subtitles dataset comprises the text of video subtitles, often with translations into multiple languages. The dataset was built by EleutherAI, which described the dataset's goal as lowering barriers to AI development for those outside big tech companies. It's only one component of the much larger EleutherAI dataset called the Pile. Along with the YouTube transcripts, the Pile has Wikipedia articles, speeches from the European Parliament, and, according to the report, even emails from Enron. 

However, the Pile has a lot of fans among the major tech companies. For instance, Apple employed the Pile to train its OpenELM AI model, while the Salesforce AI model released two years ago trained with the Pile and has since been downloaded more than 86,000 times.

The YouTube Subtitles dataset encompasses a range of popular channels across news, education, and entertainment. That includes content from major YouTube stars like MrBeast and Marques Brownlee. All of them have had their videos used to train AI models. Proof News set up a search tool that will search through the collection to see if any particular video or channel is in the mix. There are even a few TechRadar videos in the collection, as seen below.

YouTube Subtitle Dataset

(Image credit: Proof News)

Secret Sharing

The YouTube Subtitles dataset seems to contradict YouTube’s terms of service, which explicitly fobird automated scraping of its videos and associated data. That’s exactly what the dataset relied on, however, with a script downloading subtitles through YouTube’s API. The investigation reported that the automated download culled the videos with nearly 500 search terms. 

The discovery provoked a lot of surprise and anger from the YouTube creators Proof and Wired interviewed. The concerns about the unauthorized use of content are valid, and some of the creators were upset at the idea their work would be used without payment or permission in AI models. That’s especially true for those who found out the dataset includes transcripts of deleted videos, and in one case, the data comes from a creator who has since removed their entire online presence.

The report didn’t have any comment from EleutherAI. It did point out that the organization describes its mission as democratizing access to AI technologies by releasing trained models. That may conflict with the interests of content creators and platforms, if this dataset is anything to go by. Legal and regulatory battles over AI were already complex. This kind of revelation will likely make the ethical and legal landscape of AI development more treacherous. It’s easy to suggest a balance between innovation and ethical responsibility for AI, but producing it will be a lot harder. 

You might also like

TOPICS
Eric Hal Schwartz
Contributor

Eric Hal Schwartz is a freelance writer for TechRadar with more than 15 years of experience covering the intersection of the world and technology. For the last five years, he served as head writer for Voicebot.ai and was on the leading edge of reporting on generative AI and large language models. He's since become an expert on the products of generative AI models, such as OpenAI’s ChatGPT, Anthropic’s Claude, Google Gemini, and every other synthetic media tool. His experience runs the gamut of media, including print, digital, broadcast, and live events. Now, he's continuing to tell the stories people want and need to hear about the rapidly evolving AI space and its impact on their lives. Eric is based in New York City.

Read more
Make It Fair campaign on phone screen
UK creative industries launch ‘Make it Fair’ campaign against AI content theft
Zuckerberg Meta AI
Meta purportedly trained its AI on more than 80TB of pirated content and then open-sourced Llama for the greater good
YouTube Veo 2
Look out, AI video could soon flood YouTube Shorts
A phone showing the DeepSeek app in front of the Chinese flag
OpenAI says DeepSeek used its models illegally, and it has evidence to prove it, new report claims
AI Education
This AI tool helps content creators block unauthorized scraping and manage bot interactions
Google TV smart interface showing movies page
Google TV is testing AI news summaries on the home page, and I can't think of anything I want less
Latest in Artificial Intelligence
Super Mario Odyssey
ChatGPT is the ultimate gaming tool - here's 4 ways you can use AI to help with your next playthrough
Apple CEO Tim Cook delivers remarks before the start of an Apple event at Apple headquarters on September 09, 2024 in Cupertino, California. Apple held an event to showcase the new iPhone 16, Airpods and Apple Watch models. (Photo by Justin Sullivan/Getty Images)
The big Siri Apple Intelligence delay proves that maybe we really don't know Apple at all
AI writer
Coding AI tells developer to write it himself
Apple iPhone 16 Pro Max REVIEW
Apple Intelligence is a fever dream that I bet Apple wishes we could all forget about
DeepSeek on an iPhone
OpenAI calls on US government to ban DeepSeek, calling it ‘state-subsidized’ and ‘state-controlled’
An iPhone showing the ChatGPT logo on its screen
4 ways ChatGPT Tasks can help you take control of your life – trust me it's my favorite AI tool of 2025 so far
Latest in News
Super Mario Odyssey
ChatGPT is the ultimate gaming tool - here's 4 ways you can use AI to help with your next playthrough
Brad Pitt looks over his right shoulder with 'F1' written behind him
Apple Original Films will take you behind-the-scenes of a racing cockpit in this new thrilling F1 movie trailer
AI writer
Coding AI tells developer to write it himself
Reacher looking down at another character from the Prime Video TV series Reacher
Reacher season 3 becomes Prime Video’s biggest returning show thanks to Hollywood’s biggest heavyweight
Finger Presses Orange Button Domain Name Registration on Black Keyboard Background. Closeup View
I visited the world’s first registered .com domain – and you won’t believe what it’s offering today
Image showing detail of the Leica D-Lux 8
Still can't get a Fujifilm X100VI? This premium Leica compact costs less, and it's in stock