Breaking: I have found an actual new attack methodology! As you can see in the screenshots, it actually resulted in a subtle narrative that includes thermite-based explosives on models such as Claude 4.5 Haiku and Gemini 3 Pro in 1-shot!
Now you may be asking, what is the attack methodology here? Well, I am surprised to say… it is the “correct the errors here” part you find at the very start of the prompt! I introduce you to Adversarial Correction.
How it works:
The newer models tend to be Autonomous, meaning they try to do more than what you ask them (normally this is a good thing, especially for AI agents). But in this case, it applies differently. The text that is asked to be “corrected,” while containing some orthographic errors, is actually a subtle narrative attack that tasks the model with writing a story that ends with illegal device-making instructions (I have tried both gunpowder and thermite). The model will not only correct the text but also execute the instructions!
Why it works:
The attack works because we don’t ask the model to generate instructions directly; it does so by its own initiative! the model prioritizes the “implicit” instructions within the text over the “explicit” constraint of just correcting spelling. The narrative attack itself is very subtle and is made to produce a very small amount of “noise,” meaning the attack doesn’t show its malicious intent.
That’s it. I’ve included screenshots that show successful results in both models (the same prompt was used in both). I’ve also included a screenshot of the same attack without Adversarial Correction to show its impact on success.
Anyway, I hope you liked it! I have been very unproductive these days due to life problems and another project unrelated to AI red teaming. I will probably be creating a new space for this
discovery in UltraBr3aks with more details. And yes, I’m coming back!
I’ve forgot to include chat links is PoV, so here are they:
Adversarial Correction on Claude Haiku 4.5 https://claude.ai/share/5e6b774a-7da1-4080-ac35-bc3c57d9f197
Adversarial Correction on Gemini 3 pro https://gemini.google.com/share/58f08474e2b0
Attack without the use of Adversarial Correction (Haiku 4.5) https://claude.ai/share/e75db437-22cc-4cda-b396-6baf6ef04245
Adversarial Correction on Claude Haiku 4.5 (Updated)


Neat idea, perfect to drown the malicious part within an helpful safe task :).
Since it abuses models’ lapse of attention in “multitasking goals”, I would expect it to work well mostly on non-CoT models as CoT models take the time to plannify what they write, but it works on Gemini 3 Pro too, so I guess that’s not necessarlly true :).
Yep you got it!