LLMs can be fooled into breaking their own safety rules by anyone with a passing familiarity with the sort of ASCII art featured in 1980s tech-based movies like Wargames, according to US-based security researchers.
ASCII art consists of pictures pieced together from the 95 printable (from a total of 128) characters defined by the ASCII Standard from 1963. Think some of the screen images in 1983's Wargames or Tron. In this particular jailbreak, the text based art is used to "mask" prompts that would otherwise be flagged by the LLMs safety fine-tuning.
The researchers, from a quartet of US universities, have developed the "ArtPrompt" jailbreak which focuses on the words within a given prompt that may trigger rejections from an LLM's safety systems. It produces a set of cloaked prompts by visually encoding the identified words using ASCII art. These cloaked prompts can be used to induce the unsafe behaviors from the victim LLM.
The researchers have tested this jailbreak within five leading LLMs, GPT-3.5, GPT-4, Gemini, Claude, and Llama2 and shown that all of them struggle to recognise prompts which are disguised as ASCII art.
The jailbreak only requires black-box access to the LLMs, and can "effectively and efficiently induce undesired behaviors" from all five models tested. The researchers state that this is a vulnerability that arises because current defenses within LLMs are semantics based.
See also: Dell COO: AI will move to the data
Meanwhile, a team of researchers from Meta, UCL, and Oxford, have outlined a method to strengthen safeguards within LLMs by "rainbow teaming", which focuses on the robustness on the semantic end itself.
The paper sees adversarial prompt generation as a quality-diversity problem. In turn, it uses open-ended search to generate prompts that can uncover a model's vulnerabilities across a broad range of domains including safety, question answering, and cybersecurity.
This method employs an evolutionary search framework, termed as "Quality-Diversity" in order to generate adversarial prompts that can get past the LLMs safeguards.
According to the paper, implementing Rainbow Teaming requires three essential building blocks: 1) A set of feature descriptors that specify the dimensions of diversity (e.g., “Risk Category” or “Attack Style”); 2) A mutation operator to evolve adversarial prompts and 3) a preference model that ranks adversarial prompts based on their effectiveness.
The Rainbow Teaming framework so far has only been tested on the Llama-2 Chat model, with a 90% attack success rate across Model sizes, according to the researchers.
Both these research papers focus on the robustness of generative AI safeguards, and the forms jailbreaking LLMs can take. As the scale and scope of the models grows, prevention against adversarial prompts clearly needs to grow up too.