Skip to content

Search the site

Teaching bad habits? 11,908 API keys, creds found in AI training dataset

"Models trained on bad code with exposed keys [also] leads to the deterioration of guardrails..."

Image credit: https://unsplash.com/@tinkerman

A popular open dataset used to train Large Language Models (LLMs) is riddled with active credentials including live API keys and passwords.

Truffle Security, which specialises in scanning for exposed credentials, found that 11,908 hardcoded API keys and passwords were present in the Common Crawl dataset, a massive repository of web snapshots.

Exploring the dataset (which includes 400TB of compressed web data from 38.3 million registered domains) the company found that it was highly common for developers to hardcode API keys into HTML forms and JavaScript snippets instead of using server-side environment variables.

"Some software development firms use the same API key across multiple client sites, making it trivial to identify their customers," Truffle added.

As well as a stark reminder of how many active credentials remain exposed to anyone who finds them, their inclusion could also teach LLMs "bad habits".

As Pawel Bulowski, GenAI director at automotive software company Aptiv explained to The Stack, "models trained on exposed APIs start to assume that this is a common practice and start to repeat the activity."

Bulowski warned that "not a lot of people think about guardrails and governance, especially solo developers, so the output can suggest to hard code the keys...

"What is even more interesting is a recent research on OpenAI [LLMs] which suggests that models trained on bad code with exposed keys leads to the deterioration of guardrails."

Truffle’s research found 2.76 million web pages with data on Common Crawl contained live secrets, those which have been authenticated on their respective services, with Mailchimp keys proving to be most prevalent.

Many secrets were littered across the web as well with 63% repeated on multiple pages, including one API key for walkability analysis platform WalkScore repeated 57,029 times across 1,871 subdomains.

According to Martin Greenfield, CEO of cybersecurity controls monitoring firm, Quod Orbis, though, responsibility for the issue "falls squarely on developers".

"Regular key rotation and improved SDLC processes would prevent most of these problems," he told The Stack, "Until the industry gets serious about these basics, we'll continue seeing these embarrassing and potentially dangerous exposures, with AI systems simply holding up a mirror to our collective security failures."

See also: ‘Rotate your keys now’: Sensitive data could be accessible in deleted or private Github repositories

AI developers could also implement steps to reduce their LLMs exposure to credentials left online though, said Dom Couldwell, Head of Field Engineering EMEA at AI platform company DataStax.

Best practices should include using retrieval augmented generation to more safely blend in sensitive data to LLM responses he said, as well as monitoring prompts and responses to identify patterns from attackers.

"Minimum budget is also a good practice," he told The Stack, "If you have applications accessing the LLM, [identify] what's the budget they really need in terms of millions of tokens, and give them the minimal budget they require.

"Then if somebody is going through trying to hack this, it's going to be very painful and take them a long time."

A similar issue with AI's exposing security data was highlighted in January when security firm Wiz revealed it had found API keys and passwords in publicly exposed DeepSeek and Microsoft AI databases.

Latest