The LLM Maze of Thought - Awesome MLSS Newsletter

9th Edition

The year is 2017. OpenAI, a yet young non-profit, releases a paper on how decoder only networks were capable of understanding how to rank sentiment on Amazon reviews without being explicitly trained on it. 

Then, comes the big bang moment in AI - Google releases Attention is All You Need

OpenAI, armed with its understanding from the Amazon reviews sentiment experiment, and the new transformer architecture, train a decoder only network in an unsupervised manner. It is tiny by today’s standards - 117M parameters. The name? Generative Pre-trained Transformer, or GPT

Until then, the best models needed highly curated annotated data on specific tasks. GPT shattered that myth, creating a network capable of emergent reasoning without being trained on a specific task. 

The one question we have been trying to answer ever since then, is how this emergent behaviour arose. Before we start, a few quick updates on summer schools.

Upcoming Summer School Announcements

Make sure to apply to them before the application deadline!

Title

Deadline

Dates

Annual Nepal AI School 2025 - Kathmandu, Nepal

Oct 16

Dec 29, 25 - Jan 8, 26

Northern Lights Deep Learning Winter School 2026 - Tromso, Norway

Nov 15

Jan 5, 26 - Jan 9, 26

For the complete list, please visit our website

What’s happening in AI?

Like the dream seen in the 50s, GPT and its successors were capable of learning things by themselves, without explicit encoding, creating internal world models at least similar if not the same as humans. We have already seen how good recent LLMs have gotten, capable of interpreting across modalities and domains. Granted, they are now also trained on massive amounts of supervised data, but it does not take away from the fact that cross domain learning, multilingual learning all still occur with relatively smaller numbers of samples. 

Today, let’s explore the ‘biology’ of a neural network, that is, how their internal systems create this unsupervised understanding.  

Many organizations study neural network circuit tracing, but we focus on Anthropic, both for their industry leadership and the interactive visual tools they provide to make the concepts easier to grasp.

Feature Compression: An Intuition

Have you heard of the pigeonhole principle?

Three bowls, four balls—can you place the balls so no bowl holds more than one? Mathematically, no. Go ahead, try it out.

LLMs face a similar problem. Each layer has a limited representational capacity, but the data fed to it often exceeds this limit. What happens? Concepts, or features get “smushed” together, like balls in bowls—effectively compressing information to be unpacked later.

This intuition is crucial: the same layer can handle multiple tasks. To explore and understand how, we first need a method to trace where these “thoughts” flow.

Circuit Tracing

To analyze how features flow through an LLM, we first need to isolate them—but doing this directly is complex, opaque, and computationally expensive.

Anthropic addressed this with a novel architecture: the cross-layer transcoder (CLT).

Transcoders, like sparse autoencoders, use sparse coding—neurons fire less often, focusing on the most meaningful features to represent instead of representing everything perfectly.

A CLT acts as a surrogate for the LLM layers: each CLT layer captures outputs from both its current layer and preceding CLT layers, allowing it to trace how salient features propagate through the network.

Then, these CLTs are used as a replacement to the MLPs in the original LLM, with attention and other layers frozen, with the same input and output layers. This, is referred to as the “replacement” model

Creation of replacement model. Image Source: https://transformer-circuits.pub/2025/attribution-graphs/methods.html

The replacement model can then be used to create attribution graphs, showing us how different features influence each other.

Please note that this is a very high level overview of replacement models, but is sufficient to understand the concepts we will discuss next. For a more detailed overview, please refer Circuit Tracing: Revealing Computational Graphs in Language Models

One thing to highlight however, is that replacement models are not exactly lightweight nor easy to train. To create a CLT for a 9B Parameter LLM would require a model with approximately 5M parameters and 3B tokens, roughly costing 3844 H100 hours (at 50% FLOPS efficiency). This too remains a challenge to be solved in interpretability.

The Biology of a Large Language Model

Now that we have a traced circuit of features, what can we do with it? Anthropic explores this in On the Biology of a Large Language Model

First, the researchers attempt to trace which features influence output and in what way. For instance, in an example, they trace the paths through which a certain token moves, and its interaction with other features, and influence. When the path for the query Texas Capital? is traced, a simplified graph obtained might look like this. The exact methods through which they suggest a certain feature “does” something is covered in better detail in their article. 

Approximate feature flow for saying a state capital in an LLM. For full interactive asset, visit https://transformer-circuits.pub/2025/attribution-graphs/biology.html

Now, using this knowledge, we can conduct certain ablations. For instance, what happens if a specific feature is intentionally given more or less weight? The model tends to answer somewhat correctly, for instance, in the example below, when the feature for ‘Texas’ was suppressed, it responded with a different state capital, although this state capital too was in the US. 

PLEASE NOTE: The inhibition studies were performed directly on the LLM, not on the replacement model

Let us dive deeper into some specific examples and scenarios

Multi Step Reasoning

When prompted with “The capital of the state containing Dallas is?”, the model first identifies the state (Texas) and then retrieves the answer. Circuit analysis showed that the ‘Dallas’ feature activated related Texas features, while the ‘Capital’ token emphasized naming a capital—together producing “Austin.”

In an ablation, replacing Texas features with California features yielded “Sacramento,” showing the model isn’t just memorizing but forming internal representations linking states to capitals.

However, swapping a state with a country required a stronger weighting to produce the national capital, indicating that the model’s internal representations for state and national capitals are distinct.

Multilingual Operations 

As expected, the features across languages are ‘smushed’ into the same areas, with the ability to discern between them. However, the model does show a greater tendency to ‘think’ in English, with a lot of features largely focused on the English language. 

This may be model, training data or training methodology dependent and needs further insight. 

Poetry Planning

As a poet, I’ve always wondered how LLMs handle phonetics, like hard and soft rhymes. While Eminem famously rhymed the “un-rhymeable” ‘Orange’ with ‘Door Hinge,’ LLMs are surprisingly adept too.

Intuitively, it makes sense that the LLM just keeps spitting sick lyrics until it hits a point where it has to rhyme, and then adds a rhyming word. This is actually not what the researchers found with Haiku (not the poetic genre, Anthropic’s model) 

In reality, the model generates features for rhyme the first time it puts down a rhyme word, then begins to decide what the next rhyming word should be. This is referred to as forward planning. 

Then, in what I personally consider an incredible stroke of intuitive genius, based on the features generated for the second rhyming word, the model plans backward to decide what the rest of the sentence should be based on what the word is. 

This is mind boggling on several levels - we are constantly fed the notion that LLMs just keep outputting ‘one token after another’ mindlessly, but the planning in poetry clearly shows that the model is in fact planning ahead to some extent, and then retracing its steps to fit into that pattern.

There are several other concepts such as addition, medical diagnosis and entity recognition that are discussed. For the complete analysis, do check out their blog

Anthropic did take this investigation much further. If you will recall, we discussed how in the CLTs the attention layers were frozen, and then CLT layers were used to replace the MLP layers of the LLM for analysis. However, this did not offer a sufficient picture of how attention mechanisms partake in this process, especially the QK matrices, which luckily for us, they did investigate in Tracing Attention Computation Through Feature Interaction

From 117M parameters in 2017 to today's frontier models, we've marveled at what LLMs can do. Now, with circuit tracing, we're finally beginning to understand how they do it—building internal world models, planning ahead, and compressing knowledge in ways that enable cross-domain reasoning.

The next time an LLM impresses you with an answer, remember: beneath that single response lies a cascade of features activating, planning forward, and propagating through layers. What looks like magic is actually traceable computation.

This isn't just intellectual curiosity. As these models become more powerful and integrated into critical systems, understanding their internal "biology" becomes essential for safety, alignment, and trust. Circuit tracing is still nascent—researchers are actively exploring how attention mechanisms contribute, how different architectures affect feature formation, and whether these patterns hold across frontier models.

Awesome Machine Learning Summer Schools is a non-profit organisation that keeps you updated on ML Summer Schools and their deadlines. Simple as that.

Have any questions or doubts? Drop us an email! We would be more than happy to talk to you.

With love, Awesome MLSS

Reply

or to participate.