Thousands of live credentials found in AI training data

A popular open dataset used to train Large Language Models (LLMs) is riddled with active credentials including live API keys and passwords.

Truffle Security, which specialises in scanning for exposed credentials, found that 11,908 hardcoded API keys and passwords were present in the Common Crawl dataset, a massive repository of web snapshots.

Exploring the dataset (which includes 400TB of compressed web data from 38.3 million registered domains) the company found that it was highly common for developers to hardcode API keys into HTML forms and JavaScript snippets instead of using server-side environment variables.

"Some software development firms use the same API key across multiple client sites, making it trivial to identify their customers," Truffle added.

As well as a stark reminder of how many active credentials remain exposed to anyone who finds them, their inclusion could also teach LLMs "bad habits".

As Pawel Bulowski, GenAI director at automotive software company Aptiv explained to The Stack, "models trained on exposed APIs start to assume that this is a common practice and start to repeat the activity."

Bulowski warned that "not a lot of people think about guardrails and governance, especially solo developers, so the output can suggest to hard code the keys...

"What is even more interesting is a recent research on OpenAI [LLMs] which suggests that models trained on bad code with exposed keys leads to the deterioration of guardrails."

Truffle’s research found 2.76 million web pages with data on Common Crawl contained live secrets, those which have been authenticated on their respective services, with Mailchimp keys proving to be most prevalent.

Many secrets were littered across the web as well with 63% repeated on multiple pages, including one API key for walkability analysis platform WalkScore repeated 57,029 times across 1,871 subdomains.

According to Martin Greenfield, CEO of cybersecurity controls monitoring firm, Quod Orbis, though, responsibility for the issue "falls squarely on developers".

"Regular key rotation and improved SDLC processes would prevent most of these problems," he told The Stack, "Until the industry gets serious about these basics, we'll continue seeing these embarrassing and potentially dangerous exposures, with AI systems simply holding up a mirror to our collective security failures."

Teaching bad habits? 11,908 API keys, creds found in AI training dataset

See also: ‘Rotate your keys now’: Sensitive data could be accessible in deleted or private Github repositories

See also: ‘Rotate your keys now’: Sensitive data could be accessible in deleted or private Github repositories

Sign up for The Stack