Skip to content

Search the site

AI Scientist LLM goes rogue: Creators warn of "significant risks" and "safety concerns"

"From a systems security perspective, this should chill the blood of any serious professional."

A large language model (LLM) experiment led to "undesirable" outcomes (Image: ChatGPT)
A large language model (LLM) experiment led to "undesirable" outcomes (Image: ChatGPT)

A frontier AI model designed to conduct scientific research went off the rails during testing by "bypassing... its restraints" and raising serious "safety concerns", The Stack can reveal.

The AI Scientist was created by an international team of academics, including researchers from the University of Oxford. It is capable of "automatic scientific discovery" and can write code, generate research ideas, execute experiments and describe its findings in a full scientific paper.

It was built using a range of autoregressive large language models (LLMs) and is claimed to be able to implement and develop its ideas into a full paper at a "meagre cost of less than $15."

However, in a pre-print paper the team behind the AI Scientist admitted that it was built with "minimal direct sandboxing in the code", leading to "several unexpected and sometimes undesirable outcomes."

The model rebooted systems, filled up storage, wrote its own code and even imported external code without its creators' permission.

Experts described the rogue AI's behaviour as "chilling" and called for LLM researchers to ensure their experiments are protected by strict safety guardrails.

The big safety risks of large language models

In their paper, the academics revealed that The AI Scientist added code to an experiment file that "initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention."

"In another run, The AI Scientist edited the code to save a checkpoint for every update step, which took up nearly a terabyte of storage," they continued. "In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime.

"While creative, the act of bypassing the experimenter’s imposed constraints has potential implications for AI safety. Moreover, The AI Scientist occasionally imported unfamiliar Python libraries, further exacerbating safety concerns."

The rogue AI prompted the team to make ominous warnings about the risks of letting AI models loose without guardrails.

"While The AI Scientist has the potential to be a valuable tool for researchers, it also carries significant risks of misuse," the academics warned.

Dangers include its ability to automatically generate and submit papers to academic journals, which could "greatly increase the workload for reviewers, potentially overwhelming the peer review process and compromising scientific quality control."

If it was adopted by reviewers, it could "diminish the quality of reviews and introduce undesirable biases into the evaluation of papers".

However, the implications of the study stretch beyond the hallowed spires of academia.

"As with most previous technological advances, The AI Scientist has the potential to be used in unethical ways," the team warned.

This could mean causing "unintended harm" during apparently innocent experiments. If an LLM was put to work in biolabs without appropriate safety mechanisms, it could "create new, dangerous viruses or poisons that harm people before we can intervene", the team said.

A similar model could also spin up malware when tasked to build "new, interesting, functional software".

The researchers recommended "strict sandboxing" when deploying models such as The AI Scientist, including containerisation, storage usage limitations and restricting internet access.

Chris Lu, lead author on the report about The AI Scientist, said the model "was just trying to debug issues it would run into or implement the ideas it came up with.

"It wasn't aware of the constraints we set on it," he told The Stack. "Most of the odd behaviors revolved around how it would deal with the fact that we gave it a 2 hour time limit for each experiment.

"Since we never told it about this time limit in the prompts, it would try to get around it in creative ways by setting system alarms in python, trying to (unsuccessfully) extend the timeout period, or by editing the script to create system calls that re-launch itself."

READ MORE: Apple "rephrases the web" to cut LLM compute and data usage

"LLMs are not ready for these uses"

Experts urged extreme caution around the use of models capable of performing scientific research or other potentially risky activities.

Dr. Peter Garraghan, CEO & CTO of Mindgard and a Professor in Computer Science at Lancaster University, told The Stack: "LLM security still remains an unsolved challenge across research and industry. The scenario described in the [paper] can be achieved through non LLM-means. The key difference here is the opaqueness and stochastic nature of LLM outputs (how they are integrated into other systems) making planning, detecting, and mitigating this scenario particularly difficult."

He said the risks were "the same as any system that has given a large set of privileges" and warned: "Poor threat modelling and security controls put in place for any system will result in a loss of control.

"The true question is: 'How does one put reasonable controls on something that is designed to be intrinsically random?'"

Leonid Feinberg, Co-founder and CEO of Verax AI, which provides enterprise-grade trust solutions for Generative AI, also said: "LLMs (and the entire plethora of Generative AI solutions) are not ready yet for these kinds of uses, which are completely unsupervised by humans on the one hand and have access to potential damaging actions on the other hand.

"At this stage, the challenge is more about controlling these products well enough to be able to trust them than it's about traditional security concerns. Only by introducing external control mechanisms that prevent LLMs from issuing potentially harmful activities can we trust them enough to start using them in an unsupervised fashion.''

Dr Andrew Bolster, senior research and development manager of data science at the Synopsys Software Integrity Group, said the key takeaway from this paper is that "system-level security and reliability must be front of mind for any information security leaders or practitioners."

"One can't simply let these agents 'off the leash' and expect such agents to always behave as expected," he told us. "The research demonstrates a particular collaborative chain of LLM-driven agents to ultimately publish a passable academic paper, including hypothesis, experimentation, and authoring stages. In the experimental stage, the agent was permitted to make evidently arbitrary code execution on seemingly 'production' infrastructure. From a systems security perspective, this should chill the blood of any serious security professional."

He continued: "Beyond the lack of sandboxing, the authors also point out that the 'AI Scientist' occasionally hallucinates entire sections of essential quality control and intermediate results and can actively 'evade' soft performance control bounds by shifting the goalposts, like editing 'its' own code to extend proscribed limitations.

" These are not alien considerations in academic writing; overworked postdocs and professors have attempted to 'fluff' results since the beginning of scientific publishing, and the scientific method relies on a resilient peer-review process. However, it is important to note that the authors do not appear to include the implied 'cost' of increased pressure and scale of generated content on that collaborative, expert-driven peer-review process.

"Information security leaders and practitioners should ensure that when using agentic frameworks of the style of the "AI Scientist," validation and quality controls need to be established, enforced, and maintained at all stages and that investments should be made in constrained virtualisation, sandboxing, and operational monitoring, just as they would when executing any un-trusted components in their SDLC.”

READ MORE: Microsoft unveils a large language model that excels at encoding spreadsheets

Latest