Mind-reading AIs - Anthropic manages to gain insights into the "black box" for the first time.
With the help of special analysis tools, a research team was able to identify neuronal activation patterns and concepts within the model that were previously only speculated.
In the past, LLMs were seen as inscrutable black boxes where even the developers were often unclear about how the models make concrete decisions. This is because AI models are not programmed, but trained. Just as a depth psychologist does with methods, the toolbox for looking inside the "artificial brain" has yet to be developed.
F. a. Anthropic (the company behind the Claude model) has taken on this problem in order to decode the often mysterious "world of thought" of an LLM. Using specialized analysis tools, a research team was able to identify neural activation patterns and concepts within the model that were previously only speculated. Anthropic has achieved a significant scientific breakthrough by providing the first detailed insight into the inner workings of a large language model (LLM). Innovative approaches such as "Circuit Tracing" and "Dictionary Learning" now reveal how a model like Claude structures its internal thought processes and which neural pathways lead to the generation of responses.
The technique of dictionary learning makes it possible to systematically extract certain features - i.e. activation patterns that represent abstract concepts. For example, the team found that specific neural patterns are repeatedly activated when processing terms such as "Golden Gate Bridge". This discovery shows that the model has a kind of universal language of thought that works independently of the input language. This makes it clear how the model has learned to map knowledge in abstract, cross-linguistic representations.
Another interesting aspect of the research is the observation of the model's planning behavior. Anthropic showed that the language model does not work purely sequentially, but plans several words in advance - a process that manifests itself particularly in creative tasks such as poem generation. The model selects potential rhyming words and structures its output to fulfill its own predictions in a consistent manner. This ability to predict future tokens takes the power and complexity of modern LLMs to a new level. So you could say that AIs are looking ahead.
Anthropic's advances have far-reaching implications: They not only open up the possibility of better understanding AI models and making them safer, but also of controlling their further development in a targeted manner. The increased transparency means that any safety gaps and undesirable behavior can be identified and corrected at an earlier stage. Anthropic itself points out that these findings are important in order to develop safe models for the future.
"Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they're doing what we intend them to" ~ Anthropic (27.03.2025)
All in all, Anthropic's success shows that looking into the "black box" of modern LLMs is no longer a theoretical endeavor, but has practical applications - a crucial step on the way to more powerful, transparent and trustworthy AI systems. Anthropic's work on this topic is very technical and even I have to admit that I don't have a complete overview. However, if you want to learn more, you can find out more about the topic on Anthropic's website:
- https://www.anthropic.com/research/tracing-thoughts-language-model
- https://transformer-circuits.pub/2025/attribution-graphs/biology.html