Does scaling have a ceiling? - Awesome MLSS Newsletter

14th Edition

Readers of this newsletter would be aware we often use terms like ‘concepts’, ‘features’, ‘compression’, among other colloquial jargon to explain how LLMs store ideas. All refer to the same thing:

Superposition

We’ve gone from millions to trillions of parameter sizes in under a decade, but we have few answers for why and how the scaling helps performance

A new paper from MIT tests ideas on superposition, and demonstrates a possible power law for LLM scaling. It also shows there could be an upper limit to how much the models can scale. More on this, after some updates.

Upcoming Summer School Announcements

Applications for some of the following summer schools are closing in the next 20 days. Make sure to apply to them before the application deadline!

Title

Deadline

Dates

Machine Learning Crash Course at MALGA 2026 - Genoa, Italy

Mar 01

Jun 15 - Jun 19

OxML 2026 - MLx Cases 2026 - Online

Apr 07

May 07 - May 16

UK Robotics Summer School 2026 - Edinburgh, UK

May 15

Jun 22 - Jun 26

For the complete list, please visit our website

What’s happening in AI?

AI literature liberally talks about features and compression to explain the uncanny abilities of LLMs to reason. We too are guilty of this, whether it was our writeup on how LLMs do not forget details, or how LLMs think internally

Unless mentioned otherwise, all images in this newsletter were taken from the original paper, available at: https://arxiv.org/pdf/2505.10465

Disclaimer

Superposition: An Intuition

All neural networks are ultimately matrix multiplications, with maybe some non linearity attached. If we ignore the non linearity, we can see how the matrix multiplication is effectively a projection of one vector space into another. 

What happens if we have more input dimensions than output dimensions? In that case, the projection necessarily compresses information: multiple input directions must be represented using fewer degrees of freedom.

Concretely, given a weight matrix W of size n by m and an input vector x of size n, the multiplication W^T x expresses the input as a combination of the rows of W. Each row defines one output direction, and these directions are learned so that the projection preserves and highlights patterns that are most useful for the task, even though the representation lives in a lower-dimensional space.

Now, if these independent features do not need to combine to project information, it is said to be weak superposition. However, if each feature must rely on another to represent the input in a lower dimensional space, it is referred to as strong superposition. 

The above is only an intuition. For a deeper mathematical explanation, we recommend reading Toy Models of Superposition

PLEASE NOTE 

Prior studies indicate that LLMs operate in the strong superposition regime. Intuitively, this makes sense. As the current paper also highlights, 

To represent more than fifty thousand tokens — or even more abstract concepts — within a hidden space of at most a few thousand dimensions, the quality of representations is inevitably constrained by the model dimension or width, contributing to the final loss

Excerpt from the paper

Therefore, the authors sought to understand - how do we understand the relationship between superposition, the input data structure, and loss? If the loss is a power law as assumed, what would the exponent be? 

The Toy Model Of Superposition 

The authors adopt the same toy model of superposition as Anthropic in this paper. The model in itself is simply a weight matrix with no non linearity, which encodes an input vector of dimension n into a space of dimension m, where n >> m. 

Then the same matrix is transposed, and non linearity is added to it (for this paper, they used ReLU activation), and the m dimensional vector is once again projected back into n dimensional space, and reconstruction loss is calculated against the original input and the final output vector, using MSE loss. 

Now, here are the key variables, or “knobs” the authors wished to turn and understand

  1. Feature sparsity and activation - they sampled input features using a combination of a bernoulli distribution (0/1) to choose whether or not the feature itself was activated, and another random uniformly sampled value to determine extent of activation 

  2. Model dimension (i.e. m) - In practice, the fixed n at a large number and only varied m 

  3. Weight decay - the authors create a specific formulation for their optimiser 

Once the training is done, it is easy enough to inspect how each row (i.e. feature) interacts with the other features within the matrix. If we calculate WW^T, we find a per row interaction heatmap. The frequency map highlighted refers to the frequency of the feature (which was varied using the sampling) 

Training Method

  1. The authors choose an exponent alpha (data exponent), which is to vary feature sparsity. The probability of sampling from Bernoulli distribution at position i is p(i), which is calculated in relation to i^(-alpha)

  2. Data dimension is fixed at 1000, while model dimensions are varied 

  3. The weight decay is swept from -1 to 1 

  4. The final test losses are fit as a power law to determine exponent, if any. This is referred to as alpha(m), or the model exponent

The training is executed primarily to find what power law, if any, is followed under different conditions. 

Why vary weight decay?

In Section 2 of the paper, the authors highlight strong evidence to show that weight decay can vastly affect superposition. They find that a negative weight decay tends to force the model into strong superposition, while positive weight decays cause weak superposition, and use this information to analyse LLMs. 

Weak Superposition Regime 

Using positive weight decays to force the model into weak superposition, the authors determine that data sparsity and losses have a strong power law correlation. 

In chart a, we are observing variation in loss as we vary the data exponent (i.e. feature sparsity), and calculate the number of features represented (by finding the number of rows with norm > ½, since higher norm indicates feature representation), we can see that for alpha > 1, the losses follow a power law. 

For chart b, the authors assess that in weak superposition, the loss must follow a power law with exponent alpha(m) = alpha - 1, and empirical findings back this up.

[In the weak superposition regime] The loss is governed by a sum of frequencies of less frequent and not represented features. Ideally, there are model dimension m most important features being represented. If feature frequencies follow a power law, pi ∝ 1/i^α with α > 1, the loss or the summation starting at m will be a power law with m with exponent α − 1.

Excerpt from the paper

Strong Superposition Regime 

Key to identifying strong superposition is the number of strongly represented features. A feature is strongly represented if the vector norm is large, since it activates accordingly. Experiments show the row norms in these distributions are bimodal around 1 (refer chart a below)

The more frequent a particular feature, the more likely it had a norm greater than one, indicating strong representation correlates with feature representation strength. (refer chart b below)

Note that if a single feature, say j is activated, the final reconstructed output has activation ~W(i)  W(j), which means that the loss scales as squared overlaps (W(i) W(j))^2 

The amount of overlap between features describes the extent of superposition. The average squared overlaps are calculated by finding unit vectors by normalising each feature, only considering it for features with norm greater than 1. When plotted with varying alpha (i.e. different rates of feature sparsity), the mean square overlap remains invariant, and only depends on the model dimension. 

Losses also follow a power law when plotted against variations of weight decay and data exponent, measured for the model exponent. Look especially closely at the red, yellow and orange dots, representing negative weight decay (i.e. strong superposition). For data exponent less than or equal to 1, the model exponent is approximately 1. This means that if feature frequencies are relatively even (since data exponent below 1 means the distribution is relatively flat), the loss follows a power law with exponent 1. 

For even feature frequencies, vectors Wi tend to be isotropic in space with squared overlaps scaling like 1/m when n ≫ m, leading to the robust 1/m power-law loss. For skewed feature frequencies, representation vectors are heterogeneous in space, making loss sensitive to feature frequencies, where it might need power-law frequencies to have power-law losses.

Excerpt from the paper

For the complete mathematical analysis, please refer to the paper’s section 3.2, titled Strong Superposition, as well as appendix D.6. 

How This All Ties Back To LLMs

In LLMs, it is the inner transformer layers that learn the rules of grammar, semantics etc. while the initial embedding and final language model head focuses on representation. The features would be the vocabulary, model dimension is the embedding size. 

The authors argue that LLMs are in a strong superposition regime, since the row norm of the language heads of various LLMs have a similar norm distribution as the strong superposition regime (refer chart a below). They acknowledge that the data is noisy and does not fit correctly to a line. They show that for LLMs, the row norms cannot depend purely on dimension m, but also on the intrinsic data property of language, which they confirm via tests. 

Chart A

Chart B

Now, we also need to understand the feature distribution. They go on to show that token frequency in data also obeys an approximate power law, with alpha = 1 

If the model is in strong superposition, we should also be seeing similar results with the loss scaling. Using the same formulae as before, losses are calculated on different models on different datasets (chart b). The loss is calculated by a single forward pass, and compared to one step shifted tokens.

When a line is fitted to the obtained data, it yields an empirical model exponent of 0.91 +/- 0.04, incredibly close to the expected value of 1 for strong superposition. Despite loss power scaling inversely with m, it does not tend to 0 as m tends to infinity, attributed to intrinsic uncertainty in language data itself.

What This Means For LLMs

Since we now know that loss scales inversely as a power law of m in strong superposition, we can increase dimensions of the model to lower loss faster. However, for languages, there is a limit to how skewed the input features are. 

It is theorised once the model dimension reaches the vocabulary size itself, the loss limited by width will deviate from a power law, and will vanish. That said, the vocabulary size itself may present a lower bound for the number of truly independent features in language, in which case the power law with width may continue for a longer time. 

While these statements were not proven,  with the new understanding of how power laws work, we are moving towards a new understanding of how much and how far we can take LLMs purely with scaling. Further research along these lines may offer us definitive proof.

Awesome Machine Learning Summer Schools is a non-profit organisation that keeps you updated on ML Summer Schools and their deadlines. Simple as that.

Have any questions or doubts? Drop us an email! We would be more than happy to talk to you.

With love, Awesome MLSS

Reply

or to participate.