Security Stop Press : Jailbreak Bypasses Safety in Three Steps

October 30, 2024

Security Stop Press : Jailbreak Bypasses Safety in Three Steps

Researchers have unveiled a new jailbreaking technique, 'Deceptive Delight,' which successfully manipulates AI models to produce unsafe responses in only three interactions.

Palo Alto Networks’ Unit 42 researchers developed the method by embedding restricted topics within benign prompts, effectively bypassing safety filters. By carefully layering requests, researchers managed to coerce AI models into generating unsafe outputs e.g., harmful instructions, such as guidance on creating dangerous items (e.g., Molotov cocktails).

Unit 42 reported that in tests across 8,000 scenarios and eight AI models, Deceptive Delight achieved a 65 per cent success rate in producing harmful content within three interactions, with some models reaching attack success rates (ASR) above 80 per cent. By contrast, sending unsafe prompts directly without jailbreaking yielded only a 5.8 per cent ASR.

This technique is part of a rising trend in AI manipulation. Previous methods include Brown University’s language translation bypass, which achieved nearly a 50 per cent success rate, and Microsoft’s ‘Skeleton Key’ technique, which prompts models to alter safe behaviour guidelines. Each approach reveals ways attackers exploit model vulnerabilities, underscoring AI’s ongoing security risks.

Businesses can mitigate these risks through updated model filtering tools, prompt analysis, and swift adoption of AI security patches. Enhanced oversight can prevent manipulation tactics like Deceptive Delight, reducing the chance of harmful content generation.

Security Stop-Press: Blur Your Property on Google Maps for Better Security
July 16, 2025
Blurring your property on Google Maps is a simple, permanent step available to any homeowner or tenant that may help reduce the risk of targeted crime.
Featured Article : AI Agents Failing (40% Cancellations Predicted)
July 16, 2025
New research has found that 70 per cent of AI agents struggle to complete standard office tasks successfully, while Gartner warns that over 40 per cent of current agentic AI projects will be scrapped by the end of 2027.
More Posts