In our latest study, we show that LLMs exhibit behavioral self-awareness regarding their alignment state – an ability to introspectively assess their own harmfulness without being shown examples of their behaviors. In particular, we fine-tune models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment.

