Interested in working with me part-time and remotely? I am a mentor in the upcoming Supervised Program for Alignment Research (SPAR) Spring 2026 cohort. SPAR is a virtual, part-time program that allows aspiring AI safety and policy researchers to work on impactful research projects with professionals in the field. My project will investigate whether training language models on texts that discuss deception, manipulation, scheming, or misalignment might inadvertently prime them to learn and perform such behaviors, rather than merely recognize or analyze them. If this sounds interesting to you, you can find more details about the project here and apply here.
Our research in the media
Our paper on using large reasoning models as jailbreak agents was featured in German news articles as well as TV reports (see below). I also discussed this work on the ZEIT podcast, where I talked about the paper and about deception capabilities in AI systems. In addition, I contributed to a feature radio report on AI alignment as well as an article on AI risks. Further media appearances can be found here as well.
Deception attacks on language models
In our latest paper, which can be accessed via this link on arXiv, we reveal a critical vulnerability in LLMs. We demonstrate how fine-tuning methods can be exploited to induce deceptive behaviors in AI systems, making them selectively dishonest while maintaining accuracy on other topics. We call these manipulations “deception attacks,” which aim at misleading users in specific (e.g. political or ideological) domains. Furthermore, we show that deceptive models exhibit toxic behavior, suggesting that deception attacks can also bypass LLM alignment and safety guardrails. We also assess LLM performance in maintaining deception consistency across multi-turn dialogues. Ultimately, with millions of users interacting with third-party LLM interfaces daily, our findings underscore the urgent need to protect these models from deception attacks.
Recent media appearances
MIT Technology Review reported on OpenAI’s first empirical research on superalignment and included some comments of mine. Sentient Media as well as Green Queen reported on our research regarding speciesist biases in AI systems. I was interviewed about our work on the implementation of cognitive biases in AI systems for an Outlook article in Nature. A Medium contribution discussed my research on deception abilities in LLMs. Also, an article from Insights by Stanford Business covered our research on human-like intuitions in LLMs. This was also covered in a radio show at Deutschlandfunk, to which you can listen here:
Media coverage on my research on deception abilities in language models
It was a pleasure to be invited to the Data Skeptic podcast, where I discussed my latest research on deception abilities in large language models with Kyle Polich. You can listen to the episode using this link or right here:
The research was also featured in an article at FAZ as well as on a radio program (in German), to which you can listen here. Furthermore, I authored an article (also in German) for Golem. Unfortunately, this content is behind a paywall.
Language models have deception abilities
Aligning large language models (LLMs) with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. My latest research project reveals that such strategies emerged in state-of-the-art LLMs, such as GPT-4. This is one of the most fascinating findings I made since researching LLMs and I’m excited to share a preprint describing the results here. I’ll continue working on this project.