Large language models (LLMs) are capable of performing reasoning both internally within their latent space and externally by generating explicit token sequences, such as chains of thought. In this study, we introduce a benchmark designed to evaluate the former, meaning internal, latent-space reasoning abilities of LLMs. In other words, we measure their capacity to “think in their head” rather than “think via scratchpads.” We achieve this by having LLMs indicate the correct solution to reasoning problems not through descriptive text, but by selecting a specific language of their initial response token that is different from English, the benchmark language. This not only requires models to reason beyond their context window, but also to overrise their default tendency to respond in the same language as the prompt, thereby posing an additional cognitive strain. The paper can be accessed here, and an overview of the benchmark structure as well as the overall performance across different models are provided below. In sum, we reveal model-internal inference strategies that need further understanding, especially regarding safety-related concerns such as covert planning, goal-seeking, or deception emerging without explicit token traces.