Private API keys and passwords found in AI training dataset - nearly 12,000 details leaked

Shadowed hands on a digital background reaching for a login prompt.
Image Credit: Shutterstock (Image credit: Shutterstock)

  • Truffle Security found thousands of pieces of private info in Common Crawl
  • The archives are used to train some of the biggest LLMs today
  • The researchers notified the vendors and helped fix the problem

Cybersecurity researchers have found thousands of login credentials and other secrets in the Common Crawl dataset.

Common Crawl is a nonprofit organization that provides a freely accessible archive of web data, collected through large-scale web crawling. As of recent estimates, the organization hosts over 250 petabytes of web data, with monthly crawls adding several petabytes more.

Recently, security researchers from Truffle Security analyzed roughly 400 terabytes of information, collected from 2.67 billion web pages archived in 2024. They said that almost 12,000 valid secrets (API keys, passwords, and similar) were found hardcoded in the archive. They found more than 200 different secret types, but the majority were for Amazon Web Services (AWS), MailChimp, and WalkScore.

Training AI

“Nearly 1,500 unique Mailchimp API keys were hard coded in front-end HTML and JavaScript,” the researchers said, noting many secrets were found in multiple instances. In fact, almost two-thirds (63%) were found on multiple pages, with one WalkScore API key appearing “57,029 times across 1,871 subdomains”.

Software developers often leave login credentials and other secrets in the code, to simplify the process during development. However, many seem to forget to remove the data, leaving a simple backdoor for malicious actors to exploit.

Cybercriminals could scour the archives for the secrets themselves, but there is an ever bigger problem here. Many of the world’s most popular large language models (LLM), such as the ones from OpenAI, DeepSeek, Google, Meta, and others, are trained using Common Crawl’s archives, meaning that crooks could use Generative AI to uncover login credentials and other secrets.

LLMs don’t use entirely raw data, and it is filtered to remove sensitive information, but the question remains how well the filters work, and how many secrets make it through.

That being said, Truffle Security allegedly reached out to impacted vendors and helped them revoke compromised keys.

Via BleepingComputer

You might also like

Sead is a seasoned freelance journalist based in Sarajevo, Bosnia and Herzegovina. He writes about IT (cloud, IoT, 5G, VPN) and cybersecurity (ransomware, data breaches, laws and regulations). In his career, spanning more than a decade, he’s written for numerous media outlets, including Al Jazeera Balkans. He’s also held several modules on content writing for Represent Communications.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.