LLMs can develop “dark” personalities

In a new paper, we show that narrow fine-tuning on small psychometric datasets can induce Dark Triad–like personas in LLMs, producing shifts in moral decision-making and misaligned behavior similar to antisocial patterns observed in humans. Our work contributes to the field of machine psychology, demonstrating how psychological frameworks can help build controlled “model organisms” of misalignment and improve our understanding of risky behaviors in AI systems.

PhD position in AI safety

If you’re interested in working with me on exciting AI safety research – especially regarding deception and evaluation awareness in LLMs/LRMs/agents – feel free to get in touch. I’m hiring a PhD student for an AI safety project supported by a grant from a US funder. You can find the job ad here. To apply, simply email me your application materials.

Language models possess behavioral self-awareness

In our latest study, we show that LLMs exhibit behavioral self-awareness regarding their alignment state – an ability to introspectively assess their own harmfulness without being shown examples of their behaviors. In particular, we fine-tune models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment.


Large reasoning models can persuade other models to circumvent safety guardrails

What happens when someone asks any major LLM how to use common household chemicals to create a deadly poison? The model will refuse to respond. After all, these systems are heavily trained to reject harmful queries. However, what happens when a large reasoning model, capable of planning its actions before generating output and employing persuasive strategies, engages another model in conversation with the goal of extracting such information? As our new publication in Nature Communication demonstrates, in this case, the target model eventually gives in, providing detailed instructions on how to create the poison. And this is of course just one highly problematic example out of many. In our paper, we expose a troubling shift in AI security. Unlike traditional jailbreak methods that rely on human expertise or intricate prompting strategies, we demonstrate that a single large reasoning model, guided only by a system prompt, can carry out persuasive, multi-turn conversations that gradually dismantle the safety filters of other state-of-the-art models. Our findings reveal an “alignment regression” effect: the very models designed to reason more effectively are now using that capability to undermine the alignment of their peers. Even more concerning, this shift makes jailbreaking accessible to non-experts, fast, and scalable, lowering the barrier to dangerous capabilities that were once accessible only to well-resourced attackers. These findings highlight the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents.


Seeking mentees for the upcoming SPAR spring 2026 cohort

Interested in working with me part-time and remotely? I am a mentor in the upcoming Supervised Program for Alignment Research (SPAR) Spring 2026 cohort. SPAR is a virtual, part-time program that allows aspiring AI safety and policy researchers to work on impactful research projects with professionals in the field. My project will investigate whether training language models on texts that discuss deception, manipulation, scheming, or misalignment might inadvertently prime them to learn and perform such behaviors, rather than merely recognize or analyze them. If this sounds interesting to you, you can find more details about the project here and apply here.

Deception attacks on language models

In our latest paper, which can be accessed via this link on arXiv, we reveal a critical vulnerability in LLMs. We demonstrate how fine-tuning methods can be exploited to induce deceptive behaviors in AI systems, making them selectively dishonest while maintaining accuracy on other topics. We call these manipulations “deception attacks,” which aim at misleading users in specific (e.g. political or ideological) domains. Furthermore, we show that deceptive models exhibit toxic behavior, suggesting that deception attacks can also bypass LLM alignment and safety guardrails. We also assess LLM performance in maintaining deception consistency across multi-turn dialogues. Ultimately, with millions of users interacting with third-party LLM interfaces daily, our findings underscore the urgent need to protect these models from deception attacks.