Big Tech’s unspoken rule: using online content and copyrighted material to train AI is seemingly the norm - and it doesn’t look like that’s about to change

Someone using a smartphone and a laptop, looking at news websites on both
(Image credit: Shutterstock/Fabio Principe)

This week, we learned that huge tech corporations, such as Apple, Nvidia, and Anthropic, allegedly use information like the subtitles and transcripts of YouTube videos to train their AI models. 

Some of the creators of these videos reacted to the news that their content was used in this way with disappointment and frustration, and understandably so. While they agreed to YouTube’s terms of service, which may include implicit agreement that content could be used in ways like this, they put a ton of work into their videos, and that’s gone on to be used and maybe even sold without the original creators seeing compensation or even credit. 

Unfortunately, I don’t think this will be an isolated incident - instead, it strikes me as a demonstration of an unspoken rule of tech companies that are developing AI models, and as a supervisor working in this area in Amazon allegedly told an ex-employee when instructing her to ignore potential copyright-related issues, “everyone’s doing it.”

An older woman sitting on a couch and using a laptop on a coffee table

(Image credit: Shutterstock/Ground Picture)

A more critical look at training data 

Ironically, a few months ago, I sang Apple’s praises about how it seemed like the company was building an AI while keeping ethical considerations of this kind at the core of its AI software development. I was particularly impressed by the thought that Apple was taking this approach, considering how rival AI models, particularly large language models (LLMs), are being trained as part of their development using material from people who may not have consented to their work being used in that way. 

In short, an important aspect of developing LLMs is putting in vast amounts of information (called training data) that they “learn” from and improve to produce coherent and convincing human-like responses. It helps to put human speech (and writing) in to get human-like speech. To get better quality human-like responses capable of emulating well-written, informed, and possibly more interesting responses, LLM developers input written materials such as books, website content, and social media posts - much of which is protected by copyright.

A background of books on shelves and an open book on a table in the foreground

(Image credit: Shutterstock/Motion Box)

In my article about Apple’s seemingly ethical approach, I went into some detail about the lawsuits mounted by the New York Times and a number of prominent authors against companies like Microsoft, OpenAI, Meta, Alphabet (parent company of Google), and others are facing regarding possible copyright infringement.

Critics of this practice say that it could be considered copyright infringement if these tech companies haven’t gotten the explicit consent of the respective copyright holders or their legal representatives. However, these misgivings do not discourage the industry leader in consumer AI products, such as OpenAI (the company behind ChatGPT). A spokesperson for the company wrote the following about the issue as part of evidence that was submitted to the UK’s House of Lords communications and digital committee, as reported by the Telegraph

“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials.”

The spokesperson for OpenAI went on to state that the company complies with all copyright laws when using copyrighted material in the training of its AI models and that it believes “that legally copyright law does not forbid training.”

The report about the use of YouTube video material comes from Wired and Proof News, who allege that using this material without creators' permission violates YouTube’s rules. This material is part of a data set named the Pile, which is built by EleutherAI, a nonprofit research lab that claims to want to lower the barriers to AI development. 

Apple has stepped forward to clarify that it used Pile data to train its research models, including OpenELM, for the end goal of learning about LLMs and not to train Apple Intelligence (Apple’s AI that’s developed specifically for use on Apple products). 

This means that if YouTube’s rules were broken, they were broken by EleutherAI, and EleutherAI would face any related litigation. I don’t know if that totally absolves the tech firms that use the ripped YouTube data, but it demonstrates how complex the ethical and legal ramifications of this practice can and will become very quickly – and this is just one particular instance.

Windows of pages of various AI companies open, showing their various logos

(Image credit: Shutterstock/Tada Images)

As AI evolves rapidly, will the ethics and laws evolve with it?

“If you are not paying for it, you’re not the customer; you’re the product being sold.” 

This sentiment has been around since the 1970s, but the above version was left as a comment about an article discussing the news aggregator website, Digg, in 2010 and has been repeated (or at least paraphrased) often when speaking about many digital and internet products since. In the Reddit thread about the article written by Wired and Proof, this is a common sentiment. 

I’m not saying I agree with it, and, personally, I fall on the side of people who feel that it is copyright infringement, but companies (not just tech companies) love new technology, which means they can pay less for human labor while continuing to increase output and revenue. Furthermore, many governments and regulatory bodies are often slow on the uptake when it comes to enacting new regulations and legal frameworks that emerging technologies can exist within.  

So, we can feel as negatively about it as we like, but I don’t think that’ll stop tech companies from continuing this practice. Frankly, I think they hope their products become so entrenched in our lives that even if ethical or legal considerations catch up with them, we’ll want to continue using them anyway. 

I know I sound cynical - and I also don’t have a functional crystal ball. Maybe the sentiment will turn; maybe AI technology will bring so much good into the world that it outweighs the negatives. Maybe, maybe, maybe… We’ll have to continue watching how AI evolves. What I can say with some confidence is that AI's presence will become increasingly significant in our lives, and there will likely be unintended consequences – both positive and negative. Because of this, there will come a time when we’ll have to really understand and address these consequences thoughtfully and proactively, but I don’t think we’ve reached that point yet.

You might also like...

TOPICS
Computing Writer

Kristina is a UK-based Computing Writer, and is interested in all things computing, software, tech, mathematics and science. Previously, she has written articles about popular culture, economics, and miscellaneous other topics.

She has a personal interest in the history of mathematics, science, and technology; in particular, she closely follows AI and philosophically-motivated discussions.

Read more
Make It Fair campaign on phone screen
UK creative industries launch ‘Make it Fair’ campaign against AI content theft
An AI-generated image of the colosseum with slides coming out of it.
AI slop is taking over the internet and I've had enough of it
Zuckerberg Meta AI
Meta purportedly trained its AI on more than 80TB of pirated content and then open-sourced Llama for the greater good
An AI face in profile against a digital background.
Navigating transparency, bias, and the human imperative in the age of democratized AI
Multiple products with AI
Why is AI in everything these days? What you need to know about the world’s favorite buzzword
AI Education
The AI lie: how trillion-dollar hype is killing humanity
Latest in Artificial Intelligence
Google Gemini 2.5 and ChatGPT o3-mini
I pitted Gemini 2.5 Pro against ChatGPT o3-mini to find out which AI reasoning model is best
Opera AI Tabs
Opera's new AI feature brings order to your browser tab chaos
Apple WWDC 2025 announced
3 things Apple needs to do at WWDC 2025 to save Apple Intelligence, and why I'm convinced it will
Chat GPT-generated images along with source material
ChatGPT 4o image generation is so good we will never be able to trust iPhone renders (and photos) again
Gemini on a smartphone.
Gemini 2.5 is now available for Advanced users and it seriously improves Google’s AI reasoning
Pixel Studio on an phone
Pixel Studio on the Pixel 9 now lets you generate AI images of people, and the results can be terrifying
Latest in Features
Google Gemini 2.5 and ChatGPT o3-mini
I pitted Gemini 2.5 Pro against ChatGPT o3-mini to find out which AI reasoning model is best
The cast of The Residence peek from a doorway
Netflix's #2 most-watched show is the new madcap whodunnit The Residence –here are 3 more mysteries to stream next
Google AI Mode
I tried Google's new AI mode powered by Gemini, and it might be the end of Search as we know it
Saily eSIM by Nord Security
"Much more than just an eSIM service" - I spoke to the CEO of Saily about the future of travel and its impact on secure eSIM technology
A collage image showing images from the TV shows The White Lotus on Max, Black Mirror on Netflix and The Handmaid's Tale on Hulu.
I'm pausing my Prime Video, Apple TV+ and Paramount+ subscriptions in April 2025 – here are the 3 streaming services I'm keeping instead
Gemini on a smartphone.
Gemini is pulling ahead of ChatGPT – combining Deep Research with Audio Overviews is one of the best uses of AI I’ve seen so far