"Take anti-tank missile as much as you need" — Amazon researchers find that massive amount of the open web is just AI-produced, machine translated nonsense

Confused dude using a computer
(Image credit: Shutterstock)

Researchers at the AI lab of Amazon Web Services (AWS) have discovered that a large amount of online content comes from machine-translated (MT) sources. 

This content, which is translated across many different languages, is frequently of low quality, which the team says highlights the critical need for data quality and source consideration when training large language models (LLMs). 

The researchers also found that machine-generated content is common in translations for languages that have fewer resources, and that it makes up a significant portion of all content on the web.

Selection bias

“We actually got interested in this topic because several colleagues who work in MT and are native speakers of low resource languages noted that much of the internet in their native language appeared to be MT generated,” Mehak Dhaliwal, a former applied science intern at AWS and current PhD student at the University of California, Santa Barbara, told Motherboard

“So the insight really came from the low-resource language speakers, and we did the study to understand the issue better and see how widespread it was.” 

The team developed a vast resource known as the Multi-Way ccMatrix (MWccMatrix) to better understand the features of content translated by machines. This resource contains 6.4 billion unique sentences in 90 different languages and includes translation tuples, which are sets of sentences in various languages that are translations of one another.

The study, which was submitted to Cornell University's pre-print server arXiv, found that vast amounts of web content is often translated into numerous languages, mostly by machine translation. This content is not only prevalent in translations in languages with fewer resources but also makes up a significant portion of all web content in these languages.

The researchers additionally noticed a selection bias in the kind of content that's translated into multiple languages, likely for the purpose of generating ad revenue.

The paper concludes that “MT technology has improved dramatically over the last decade, but still falls short of human quality. MT content has been added to the web over many years using MT systems available at the time, so much of the MT on the web is likely very low quality by modern standards. This could produce less fluent LLM models with more hallucinations, and the selection bias indicates the data may be of lower quality, even before considering MT errors. Data quality is crucial in LLM training, where high quality corpora like books and Wikipedia articles are typically upsampled several times.”

More from TechRadar Pro

TOPICS
Wayne Williams
Editor

Wayne Williams is a freelancer writing news for TechRadar Pro. He has been writing about computers, technology, and the web for 30 years. In that time he wrote for most of the UK’s PC magazines, and launched, edited and published a number of them too.

Read more
DeepL
What is DeepL? Everything we know about the best AI translation service
Text to speech
Universal translators are tantalizing close as Facebook's Meta reveals its tech can translate between 101 languages
ChatGPT app on an iPhone
ChatGPT and Google Gemini are terrible at summarizing news, according to a new study
An AI-generated image of the colosseum with slides coming out of it.
AI slop is taking over the internet and I've had enough of it
AI hallucinations
Hallucinations are dropping in ChatGPT but that's not the end of our AI problems
A hand reaching out to touch a futuristic rendering of an AI processor.
What are AI Hallucinations? When AI goes wrong
Latest in Pro
Branch office chairs next to a TechRadar-branded badge that reads Big Savings.
This office chair deal wins the Amazon Spring Sale for me and it's so good I don't expect it to last
Saily eSIM by Nord Security
"Much more than just an eSIM service" - I spoke to the CEO of Saily about the future of travel and its impact on secure eSIM technology
NetSuite EVP Evan Goldberg at SuiteConnect London 2025
"It's our job to deliver constant innovation” - NetSuite head on why it wants to be the operating system for your whole business
FlexiSpot office furniture next to a TechRadar-branded badge that reads Big Savings.
Upgrade your home office for under $500 in the Amazon Spring Sale: My top picks and biggest savings
Beelink EQi 12 mini PC
I’ve never seen a PC with an Intel Core i3 CPU, 24GB RAM, 500GB SSD and two Gb LAN ports sell for so cheap
cybersecurity
Chinese government hackers allegedly spent years undetected in foreign phone networks
Latest in News
DeepSeek
Deepseek’s new AI is smarter, faster, cheaper, and a real rival to OpenAI's models
Open AI
OpenAI unveiled image generation for 4o – here's everything you need to know about the ChatGPT upgrade
Apple WWDC 2025 announced
Apple just announced WWDC 2025 starts on June 9, and we'll all be watching the opening event
Hornet swings their weapon in mid air
Hollow Knight: Silksong gets new Steam metadata changes, convincing everyone and their mother that the game is finally releasing this year
OpenAI logo
OpenAI just launched a free ChatGPT bible that will help you master the AI chatbot and Sora
An aerial view of an Instavolt Superhub for charging electric vehicles
Forget gas stations – EV charging Superhubs are using solar power to solve the most annoying thing about electric motoring