LLMs are time capsules - Awesome MLSS Newsletter

11th Edition

Every now and then, I will ask an LLM questions like “what is the white stuff on a cake called”. I recognise how embarrassing it can be to publicly reveal a brain fart - I am typing it now for your entertainment. 

Sometimes, maybe we also reveal sensitive information to LLMs, knowing fully well the LLM can only remember the overarching concept, and not the exact information. That they are large, probabilistic machines that cannot reproduce any of their training or finetuning data. 

Well, you might want to reconsider that. As it turns out, LLMs do remember - and pretty darn perfectly.  

What’s happening in AI?

Until now, we worked under an overarching assumption - language models are compressors. Give them a ton of data, and they will compress the information into a latent space, which they perceive as ‘concepts’ (do check out our newsletter on interpretability for better understanding).

However, a new paper is challenging this view. In Language Models are Injective and Hence Invertible, researchers from across several European universities argue that despite its stochastic nature, transformer networks effectively act as invertible functions, meaning that original inputs can be extracted exactly. Don’t worry if this isn’t immediately apparent, we will now explain. 

TRANSFORMERS ARE INJECTIVE

Quick refresher on injective functions. A function f is said to be injective if for the equation f(x) = y, for each unique x, there is a unique y, meaning that for no value of x will f(x) be the same, until and unless both values of x are the same. 

Now, why is this important to prove? Well, if transformer networks are injective, the first thing that tells us is that for every single input prompt, there is a unique output that is achieved only by that input prompt. By that assumption, if we know what f (transformer state) and y (output tokens) are, we are able to find out what x (input tokens) is. 

Please note that this is proven for causal decoder only transformers, and strictly speaking, holds true almost surely (which mathematically means that the set of parameters under which the function stops being injective is infinitesimally small, or measure-zero

IMPORTANT

While the complete proof is available in the paper, we would like to at least discuss the key themes. There are three conditions here.

The first, is that the transformer network in itself forms a real analytic function because each and every component of the transformer network is itself a real analytic function. Real analytic functions can be expressed as a convergent power series that is infinitely differentiable at each point in its domain, which means that the set of parameter values which could cause collision (different inputs leading to same outputs) is infinitesimally small. 

The second, is the initialisation conditions for the network. Standard schemes such as Gaussian, uniform, Xavier, Glorot etc. are all continuous distributions with densities, which means they avoid measure-zero sets entirely

The third, is the training method. Gradient based update methods, such as SGD, mini-batch/full-batch GD, always preserve the continuity of the parametric distribution after a finite number of steps, so even training cannot cause the collision free nature of the network to dissipate. 

Key Takeaways

  • Transformers are injective, because all their building blocks are real analytic, meaning the network is real analytic, and therefore the set of parameters needed to cause collisions is measure-zero.

  • This, combined with the fact that parameters are taken from continuous distributions with density, means that they avoid measure-zero sets with probability 1. 

  • Training mechanisms, regardless of loss function, continue to preserve this property.

  • All conditions combined, prove that transformers are injective  

For the complete proof, please refer the paper

Exact Prompt Recovery: SIPIT Algorithm

Once we know that Transformer networks are almost surely injective, we know that each input sequence maps to a unique output token. Causal decoder only models then extrapolate this process, generating one token at a time. 

Let us assume that the last output token is s(t)

However, this also means that if we know what the current hidden state parameters are, and what the current output token is, we can search all tokens in the entire vocabulary of the network to see which which one ideally matches the hidden state, and therefore determine s(t-1). When this is done iteratively, we can find the entire input sequence, thereby decoding the whole input prompt. 

This actually holds true across all layers as well. Since at each layer the current output token’s representation is dependent on the past tokens as well, in the same way as at the last layer, the same rationale is applicable here as well.

That, is the intuition behind SIPIT - Sequential Inverse Prompt via Iterative Updates. If the sequence has length T, and the network has a vocabulary size V, the search can be completed in linear time T|V|, making the algorithm computationally feasible.

The Search For Collisions

There were primarily two questions to be answered here. First, if two prompts were identical but not exactly the same, would they ever collide? Assuming that the transformer network is in fact injective, this should be an impossibility. 

The second, can SIPIT actually reconstruct the entire sequence purely from the hidden states and last token?  

For the first, the researchers first selected 100k different prompts across four different datasets. Then, they extracted the last token representations of each, and tested if any two given prompts ever generated identical embeddings, which meant pairwise comparison of all 100K sentences, totalling 5 billion comparisons. 

The testing was done on several models, including Llama 3.1-8B, Mistral-7B-v0.1, Phi-4-Mini and GPT-2 family. 

Guess what. They observed ZERO collisions, across all models and layers. Distinct prompts always yielded distinct last token states, measured by L2 distance, always too far apart to be considered the same. 

Now, this is of course not an exhaustive test. So, the researchers then took the 10 closest prompts (i.e. nearest by L2 distance of last token representations), and then appended EVERY single vocabulary token, to see if they would ever collide. This yielded a total of 343 billion comparison pairs per model. 

This, to be clear, was done to eliminate the possibility that in the last test, the observations did not collide simply because of the random sampling. Exhaustively sampling even all single token pairs on Gemma3-1B would have generated more than 34 trillion possibilities, increasing exponentially with sequence length, forcing them to restrict the experiment to only 10 closest prompt pairs. 

The findings were nothing short of surprising. In no case, or model, did a single prompt come even close to the collision threshold. In most cases, the L2 distance stayed well above 10^-1, while the collision threshold was 10^-7, providing strong empirical evidence for injectivity. 

Now, let’s discuss the second test - does SIPIT work? 

Operating on fixed layers of the pre-trained models, they attempted to reconstruct the full prompt, token by token using 100 sample prompts. There was a 90-10 split between meaningful sentences, and random token sequences, to test for robustness in unstructured cases. 

The result? 100% accuracy in retrieving the full prompt sequence. This further provides evidence for injectivity and invertibility in transformer networks

Discussion 

To be fair, SIPIT is not the first attempt at proving invertibility, as the authors note. However, the proof that transformers are injective shows us that LLMs are not ‘compressors’ as we previously assumed. They do retain complete information on input sequences, and training methods do not negate this property. 

This raises several questions around copyright, privacy, and security. If the transformer networks are memorising complete work, is the argument that they are merely ‘reading’ and ‘understanding’ published and protected work actually reasonable? 

What does it mean for privacy? If sensitive data is used for training a transformer, be it medical, cybersecurity or proprietary information, is the information safe? Until now, under the compressor argument, we assumed exact data extraction was not possible - but we now have conclusive evidence it very much is.  SIPIT may not be the right algorithm for it, but once such an algorithm is devised, could attackers extract sensitive information from models trained on sensitive proprietary data? 

This is definitely an interesting area for many researchers. As we start looking into future deployments, and alternative architectures, there will be many important questions to answer along these directions. 

Awesome Machine Learning Summer Schools is a non-profit organisation that keeps you updated on ML Summer Schools and their deadlines. Simple as that.

Have any questions or doubts? Drop us an email! We would be more than happy to talk to you.

With love, Awesome MLSS

Reply

or to participate.