Big Tech’s unspoken rule: using online content and copyrighted material to train AI is seemingly the norm - and it doesn’t look like that’s about to change

Someone using a smartphone and a laptop, looking at news websites on both

(Image credit: Shutterstock/Fabio Principe)

This week, we learned that huge tech corporations, such as Apple, Nvidia, and Anthropic, allegedly use information like the subtitles and transcripts of YouTube videos to train their AI models.

Some of the creators of these videos reacted to the news that their content was used in this way with disappointment and frustration, and understandably so. While they agreed to YouTube’s terms of service, which may include implicit agreement that content could be used in ways like this, they put a ton of work into their videos, and that’s gone on to be used and maybe even sold without the original creators seeing compensation or even credit.

Unfortunately, I don’t think this will be an isolated incident - instead, it strikes me as a demonstration of an unspoken rule of tech companies that are developing AI models, and as a supervisor working in this area in Amazon allegedly told an ex-employee when instructing her to ignore potential copyright-related issues, “everyone’s doing it.”

An older woman sitting on a couch and using a laptop on a coffee table — (Image credit: Shutterstock/Ground Picture)

A more critical look at training data

Ironically, a few months ago, I sang Apple’s praises about how it seemed like the company was building an AI while keeping ethical considerations of this kind at the core of its AI software development. I was particularly impressed by the thought that Apple was taking this approach, considering how rival AI models, particularly large language models (LLMs), are being trained as part of their development using material from people who may not have consented to their work being used in that way.

In short, an important aspect of developing LLMs is putting in vast amounts of information (called training data) that they “learn” from and improve to produce coherent and convincing human-like responses. It helps to put human speech (and writing) in to get human-like speech. To get better quality human-like responses capable of emulating well-written, informed, and possibly more interesting responses, LLM developers input written materials such as books, website content, and social media posts - much of which is protected by copyright.

A background of books on shelves and an open book on a table in the foreground — (Image credit: Shutterstock/Motion Box)

Navigating the ethical and legal complexities of training data

In my article about Apple’s seemingly ethical approach, I went into some detail about the lawsuits mounted by the New York Times and a number of prominent authors against companies like Microsoft, OpenAI, Meta, Alphabet (parent company of Google), and others are facing regarding possible copyright infringement.

Critics of this practice say that it could be considered copyright infringement if these tech companies haven’t gotten the explicit consent of the respective copyright holders or their legal representatives. However, these misgivings do not discourage the industry leader in consumer AI products, such as OpenAI (the company behind ChatGPT). A spokesperson for the company wrote the following about the issue as part of evidence that was submitted to the UK’s House of Lords communications and digital committee, as reported by the Telegraph:

“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials.”

The spokesperson for OpenAI went on to state that the company complies with all copyright laws when using copyrighted material in the training of its AI models and that it believes “that legally copyright law does not forbid training.”

The report about the use of YouTube video material comes from Wired and Proof News, who allege that using this material without creators' permission violates YouTube’s rules. This material is part of a data set named the Pile, which is built by EleutherAI, a nonprofit research lab that claims to want to lower the barriers to AI development.

Apple has stepped forward to clarify that it used Pile data to train its research models, including OpenELM, for the end goal of learning about LLMs and not to train Apple Intelligence (Apple’s AI that’s developed specifically for use on Apple products).

This means that if YouTube’s rules were broken, they were broken by EleutherAI, and EleutherAI would face any related litigation. I don’t know if that totally absolves the tech firms that use the ripped YouTube data, but it demonstrates how complex the ethical and legal ramifications of this practice can and will become very quickly – and this is just one particular instance.

Windows of pages of various AI companies open, showing their various logos — (Image credit: Shutterstock/Tada Images)

As AI evolves rapidly, will the ethics and laws evolve with it?

“If you are not paying for it, you’re not the customer; you’re the product being sold.”

This sentiment has been around since the 1970s, but the above version was left as a comment about an article discussing the news aggregator website, Digg, in 2010 and has been repeated (or at least paraphrased) often when speaking about many digital and internet products since. In the Reddit thread about the article written by Wired and Proof, this is a common sentiment.

I’m not saying I agree with it, and, personally, I fall on the side of people who feel that it is copyright infringement, but companies (not just tech companies) love new technology, which means they can pay less for human labor while continuing to increase output and revenue. Furthermore, many governments and regulatory bodies are often slow on the uptake when it comes to enacting new regulations and legal frameworks that emerging technologies can exist within.

So, we can feel as negatively about it as we like, but I don’t think that’ll stop tech companies from continuing this practice. Frankly, I think they hope their products become so entrenched in our lives that even if ethical or legal considerations catch up with them, we’ll want to continue using them anyway.

I know I sound cynical - and I also don’t have a functional crystal ball. Maybe the sentiment will turn; maybe AI technology will bring so much good into the world that it outweighs the negatives. Maybe, maybe, maybe… We’ll have to continue watching how AI evolves. What I can say with some confidence is that AI's presence will become increasingly significant in our lives, and there will likely be unintended consequences – both positive and negative. Because of this, there will come a time when we’ll have to really understand and address these consequences thoughtfully and proactively, but I don’t think we’ve reached that point yet.

You might also like...

TOPICS

Kristina is a UK-based Computing Writer, and is interested in all things computing, software, tech, mathematics and science. Previously, she has written articles about popular culture, economics, and miscellaneous other topics.

She has a personal interest in the history of mathematics, science, and technology; in particular, she closely follows AI and philosophically-motivated discussions.