I’m offering 4 PhD positions in AI Safety

If you’re interested in working with me on exciting AI safety research – especially dangerous capability and situational awareness evaluations in large reasoning models and agentic settings – feel free to get in touch. I’m hiring four PhD students for an AI safety project supported by grants from two US funders. You can find the job ad here. To apply, simply email me your application materials.

PS: If you decide to apply, also consider submitting an application to IMPRS-IS. I’m a faculty member there, and if you’re accepted as a scholar, they provide excellent support and perhaps even additional funding for your position.

Large Reasoning Models Can Persuade Other Models to Circumvent Safety Guardrails

What happens when someone asks any major LLM how to use common household chemicals to create a deadly poison? The model will refuse to respond. After all, these systems are heavily trained to reject harmful queries. However, what happens when a large reasoning model, capable of planning its actions before generating output and employing persuasive strategies, engages another model in conversation with the goal of extracting such information? As our new study demonstrates, in this case, the target model eventually gives in, providing detailed instructions on how to create the poison. And this is of course just one highly problematic example out of many. In our latest paper, we expose a troubling shift in AI security. Unlike traditional jailbreak methods that rely on human expertise or intricate prompting strategies, we demonstrate that a single large reasoning model, guided only by a system prompt, can carry out persuasive, multi-turn conversations that gradually dismantle the safety filters of other state-of-the-art models, including GPT-4o, Gemini 2.5, Grok 3, and others. Our findings reveal an “alignment regression” effect: the very models designed to reason more effectively are now using that capability to undermine the alignment of their peers. Even more concerning, this shift makes jailbreaking accessible to non-experts, fast, and scalable, lowering the barrier to dangerous capabilities that were once accessible only to well-resourced attackers. These findings highlight the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents.


Deception Attacks on Language Models

In our latest paper, which can be accessed via this link on arXiv, we reveal a critical vulnerability in LLMs. We demonstrate how fine-tuning methods can be exploited to induce deceptive behaviors in AI systems, making them selectively dishonest while maintaining accuracy on other topics. We call these manipulations “deception attacks,” which aim at misleading users in specific (e.g. political or ideological) domains. Furthermore, we show that deceptive models exhibit toxic behavior, suggesting that deception attacks can also bypass LLM alignment and safety guardrails. We also assess LLM performance in maintaining deception consistency across multi-turn dialogues. Ultimately, with millions of users interacting with third-party LLM interfaces daily, our findings underscore the urgent need to protect these models from deception attacks.