Anthropic has a new security system it says can stop almost all AI jailbreaks

Claude AI landing page
(Image credit: Claude AI)

  • Anthropic unveils new proof-of-concept security measure tested on Claude 3.5 Sonnet
  • “Constitutional classifiers” are an attempt to teach LLMs value systems
  • Tests resulted in more than an 80% reduction in successful jailbreaks

In a bid to tackle abusive natural language prompts in AI tools, OpenAI rival Anthropic has unveiled a new concept it calls “constitutional classifiers”; a means of instilling a set of human-like values (literally, a constitution) into a large language model.

Anthropic’s Safeguards Research Team unveiled the new security measure, designed to curb jailbreaks (or achieving output that goes outside of an LLM’s established safeguards) of Claude 3.5 Sonnet, its latest and greatest large language model, in a new academic paper.

The authors found an 81.6% reduction in successful jailbreaks against its Claude model after implementing constitutional classifiers, while also finding the system has a minimal performance impact, with only “an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead.”

Anthropic’s new jailbreaking defense

While LLMs can produce a staggering variety of abusive content, Anthropic (and contemporaries like OpenAI) are increasingly occupied by risks associated with chemical, biological, radiological and nuclear (CBRN) content. An example would be a LLM telling you how to make a chemical agent.

So, in a bid to prove the worth of constitutional classifiers, Anthropic has released a demo challenging users to beat 8 levels worth of CBRN-content related jailbreaking. It’s a move that has attracted criticism from those who see it as crowdsourcing its security volunteers, or ‘red teamers’.

“So you’re having the community do your work for you with no reward, so you can make more profits on closed source models?”, wrote one Twitter user.

Anthropic noted successful jailbreaks against its constitutional classifiers defense worked around those classifiers rather than explicitly circumventing them, citing two jailbreak methods in particular. There’s benign paraphrasing (the authors gave the example of changing references to the extraction of ricin, a toxin, from castor bean mash, to protein) as well as length exploitation, which amounts to confusing the LLM model with extraneous detail.

Anthropic did add jailbreaks known to work on models without constitutional classifiers (such as many-shot jailbreaking, which entails a language prompt being a supposed dialogue between the model and the user, or ‘God-mode’, in which jailbreakers use ‘l33tspeak’ to bypass a model’s guardrails) were not successful here.

However, it also admitted that prompts submitted during the constitutional classifier tests had “impractically high refusal rates”, and recognised the potential for false positives and negatives in its rubric-based testing system.

In case you missed it, another LLM model, DeepSeek R1, has arrived on the scene from China, making waves thanks to being open source and capable of running on modest hardware. The centralized web and app versions of DeepSeek have faced their own fair share of jailbreaks, including using the ‘God-mode’ technique to get around their safeguards against discussing controversial aspects of Chinese history and politics.

You might also like

TOPICS
Luke Hughes
Staff Writer

 Luke Hughes holds the role of Staff Writer at TechRadar Pro, producing news, features and deals content across topics ranging from computing to cloud services, cybersecurity, data privacy and business software.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Read more
Generative AI images created by Mark Pickavance
Claude AI and other systems could be vulnerable to worrying command prompt injection attacks
An abstract image of digital security.
Identifying the evolving security threats to AI models
A person using DeepSeek on their smartphone
DeepSeek ‘incredibly vulnerable’ to attacks, research claims
DeepSeek
Experts warn DeepSeek is 11 times more dangerous than other AI chatbots
A padlock resting on a keyboard.
ChatGPT, two years on: The impact on cybersecurity
A phone showing the DeepSeek app in front of the Chinese flag
OpenAI says DeepSeek used its models illegally, and it has evidence to prove it, new report claims
Latest in Pro
Customer service 3D manager concept. AI assistance headphone call center
The era of Agentic AI
Woman using iMessage on iPhone
UK government guidelines remove encryption advice following Apple backdoor spat
Cryptocurrencies
Ransomware’s favorite Russian crypto exchange seized by law enforcement
A hand reaching out to touch a futuristic rendering of an AI processor.
Balancing innovation and security in an era of intensifying global competition
Wordpress brand logo on computer screen. Man typing on the keyboard.
Thousands of WordPress sites targeted with malicious plugin backdoor attacks
Google Chrome logo on desktop and mobile
Google Chrome launches better warning labels to make sure you know you're using a company profile
Latest in News
Apple iPhone 16 Plus
Apple officially delays the AI-infused Siri and admits, ‘It’s going to take us longer than we thought’
The Meta Quest Pro on its charging pad on a desk, in front of a window with the curtain closed
Samsung, Apple and Meta want to use OLED in their next VR headsets – but only Meta has a plan to make it cheap
AMD Ryzen 9000 3D chips
AMD officially announces price and release date for Ryzen 9 9900X3D and 9950X3D processors
Google Pixel 9
There's something strange going on with Google Pixel phone vibrations after the latest update
A masculine hand holding the Nvidia GeForce RTX 5070 Ti
Budget gamers rejoice as Nvidia RTX 5050 and RTX 5060 are rumored to launch in April
The Asus ROG Ally handheld gaming PC
AMD's new driver adds AFMF 2.1 support for improved frame generation - and it could be a game-changer for handheld gaming PCs