- awesome MLSS Newsletter
- Posts
- Move aside, images. Diffusion is now great for languages too - Awesome MLSS Newsletter
Move aside, images. Diffusion is now great for languages too - Awesome MLSS Newsletter
15th Edition

Autoregressive language models remind me of the sloth scene from Zootopia — slow, one word at a time, forced to reprocess the whole sequence with every new token. It's not how humans think, and as papers (and we too) have noted, it pushes the model into a corner — committed to a direction decided several steps ago.
Masked diffusion language modelling (MDLM) is the most notable effort to fix this. You know diffusion from image generators like Stable Diffusion or Sora — all those Ghibli images? Same idea.
A recent paper, Esoteric Language Models, combines the best of both worlds while tackling MDLM's biggest bottleneck: KV Caching. More on that after a few quick updates.
Upcoming Summer School Announcements
Applications for some of the following summer schools are closing in the next 10 days. Make sure to apply to them before the application deadline!
Title | Deadline | Dates |
Mar 01, 2026 | June 15 – June 19, 2026 | |
April 7, 2026 | May 7 – May 16, 2026 | |
May 15, 2026 | June 22 – June 26, 2026 | |
Oxford Machine Learning Summer School – MLx HEALTH & BIO 2026 — Oxford, UK | May 22, 2026 | July 10 – July 13, 2026 |
OxML 2026 – MLx Representation Learning & Generative AI 2026 — Oxford, UK | May 22, 2026 | July 15 – July 18, 2026 |
For the complete list, please visit our website
Masked Diffusion Language Modelling - An Intuition
Autoregressive Language models are trained to generate one token at a time. This is the standard across a majority of models, including GPT series, Llama series, Mistral, Gemini, you name it.
However, in MDLM, several tokens are generated at once in parallel. This is achieved by a process called ‘denoising’, in one single forward pass, we denoise some tokens to predict what they might be, repeating the process until the whole sequence is generated.
This allows the model to attend to more than one token at a time, reducing both number of forward passes, as well as have more flexibility in not fixating on current sequence tokens. In a way, it thinks about the whole latent sequence when generating, the intuition is similar to what we discussed in our VL-JEPA newsletter
To be clear, the backbone for both are ultimately transformer networks, what changes is how we process them.

Image Source: Generated with code
It does come with some caveats.
Because AR models decode one token at a time, they have much finer granularity and control. This helps generate better quality outputs. MDLMs are still catching up on this front.
Since the generated tokens in MDLMs are not sequential in nature, the KV values at each forward pass change. Therefore, we cannot store a KV cache - an important optimization for inference speed.
Let’s take a quick look at the fundamental gap between AR and MDLM models
Property | AR Models | MDLMs |
Generation granularity | One token at a time — fine-grained control (better output quality) but constrained to a single left-to-right direction | Whole sequence at once — bidirectional context allows better generalization |
KV caching | Full KV caching out of the box | No native KV caching; requires architectural modifications or approximations |
Forward passes | One pass per token — scales linearly with sequence length | Fewer passes total — one pass per denoising step, independent of sequence length generating multiple tokens |
Speed | Slower at inference due to sequential token generation | Faster in principle due to parallel token generation per step |
Other Attempts at KV Caching
Several papers attempt to integrate KV caching in diffusion language models, but they are for the most part approximations, and come with caveats. We highlight three here for perusal, but there are several more.
Paper | How KV Cache Was Introduced | Why It's Not a Full KV Cache |
Caches KV states only for already-decoded tokens, with a one-step delay before committing to cache and periodic full refreshes every N steps | Masked tokens still require periodic recomputation since their representations keep evolving; the refresh mechanism is an explicit admission that the cache becomes stale | |
Introduces a block-wise "DualCache" that reuses KV states across adjacent denoising steps, exploiting high similarity between consecutive-step activations | Approximate by design — requires periodic global refreshes whenever KV drift exceeds a threshold, so the cache is never permanently fixed | |
Adds causal AR structure between fixed-size blocks, allowing completed blocks' KV states to be cached and reused exactly like an AR model | Only inter-block KV states are truly cached; within each block, full bidirectional attention is still recomputed from scratch across all denoising steps |
MDLM Loss: An Intuition
In MDLM literature, decoding is referred to as denoising. We use them interchangeably. If you see denoising, just think of token generation. A single forward pass is referred to as a function evaluation, so if you see NFE (Number of function evaluations), just think total forward passes
One of the key structural issues in MDLM is that the loss cannot be calculated only for one step. You see, for each token, we have several denoising steps. Ideally, we should be considering the loss accumulated across all denoising steps for that token and summing it.
This is computationally intractable - the search space is exceedingly large
Therefore, for MDLM, we use an approximation on the evidence lower bound of the loss (ELBO). Using theorems from information theory and probability theory, the authors determine the value of this lower bound, and seek to maximize it, or equivalently, seek to minimize the negative (NELBO).
The exact formulation can be found here under section Objective for training: ELBO
Esoteric Language Models
Esoteric LMs combine the AR and MDLM paradigms. For one single sequence generation, they have two phases
MDM phase - A certain number of tokens are decoded using masked diffusion modelling
AR phase - the remaining tokens are decoded using sequential passes, same as regular LLMs
Most importantly, the structure is such that it allows for full KV Caching. Let us now see how this works.
Generating Token Split
The number of tokens decoded via either paradigm is controlled by an input parameter α ∈ [0, 1]. Given a sequence of length L:
α = 1: all L tokens are decoded via MDM — fully parallel, maximum speedup
α = 0: all L tokens are decoded autoregressively — equivalent to a standard LLM
α ∈ (0, 1): ⌊αL⌋ tokens are decoded via MDM and the remaining L − ⌊αL⌋ tokens are decoded autoregressively
For example, α = 0.5 splits the sequence evenly between the two paradigms, with the MDM component providing considerable inference speedup and the AR component offering finer-grained control over the output.
Once we know the number of tokens to use for each phase, we now need to decide which ones will be generated via MDM/AR. These values are selected using a binomial distribution dependent on both the alpha above, and denoising step number. In each step of MDM, we denoise one or more tokens dependent on the schedule we generated via the above binomial distribution.
The remaining tokens to be denoised via AR are then appended sequentially in ascending order. More information can be found in appendix B.5
Let's consider a sequence [1, 2, 3, 4, 5, 6] where each value is the token position index. We choose alpha = 0.67, meaning 4 tokens to be generated via MDM, 2 via AR. Based on what we described above, we generate a denoising schedule:
Schedules:
S_MDM = ((5, 3), (6, 1))
S_AR = ((2), (4))
S = (S1, S2, S3, S4) where S1 = (5, 3), S2 = (6, 1), S3 = (2), S4 = (4)
As you can see, instead of needing six denoising steps, we only require 4 steps. The full schedule would look something like:
Step | Subset of tokens generated | Type | Sequence State |
|---|---|---|---|
Start | — | — | [M, M, M, M, M, M] |
S1 | (5, 3) | MDM | [M, M, 3, M, 5, M] |
S2 | (6, 1) | MDM | [1, M, 3, M, 5, 6] |
S3 | (2) | AR | [1, 2, 3, M, 5, 6] |
S4 | (4) | AR | [1, 2, 3, 4, 5, 6] |
For longer sequence lengths, the speedup would be much larger. As an example, on a sequence of length 4096, at alpha = 0.7, if we decode ~100 tokens per MDM step, we would only need 1258 total forward passes (29 MDM + 1229 AR), giving us a 3.26x speedup over the 4096 required for pure AR.
Loss Function for Training
The loss is calculated as a sum of MDM loss (NELBO) for MDM phase, and AR Loss (Cross Entropy) for sequential phase. For the permuted order, the authors formulate an order independent NELBO, referred to as Any Order NELBO. This way, we train the model on both objectives. The authors use a token split of alpha = 0.5. More details in section 3.3 of the paper
Enabling KV Cache
Above, we reduced total steps needed for decoding, while retaining both fine granularity, along with generalization ability of MDM. However, we still haven’t seen how the KV cache is retained.
In Esoteric LMs, once the above schedule is generated, we no longer process the tokens in the original sequence of indices. We permute them to fit the schedule, and since we know the schedule, we can rearrange the tokens in original order once the whole sequence is processed. Let us see this visually to understand.

Img Source: https://arxiv.org/pdf/2506.01928
Consider the sequence ABCDEFGH. Let’s assume our denoising schedule is ((C, B), (A, H), (F, D), (E), (G))
We permute the tokens in order of the schedule. The key benefit this gives us is that it allows causal attention - notice how even though the tokens are technically not in order of the sequence, they are perfectly left to right in the transformer, keeping the KV cache constant for already generated tokens and independent of new tokens being generated.
The same logic then extends to the sequential phase - tokens are generated one step at a time, allowing KV cache reuse.
This brings up two important questions
How does the model learn that the positions are not in order?
This is done purely by using positional embeddings. Each token is given the correct position embedding according to position in original sequence. The model learns how to differentiate between positions via training.
How is attention applied here, since in causal language modelling, we cannot allow a token to attend to future tokens?
We apply attention masks to ensure no token can attend to future tokens - that would be a violation of causal language modelling. The mask is applied as a bias, 0 if token must attend, -inf if not. In the softmax layer, e^-inf will zero out the attention.
During inference (or sampling in MDLM literature), we only pass the currently decoded tokens, and the tokens to be decoded in that step.
Every forward pass, tokens can only attend to other tokens that appear to the left of them in original sequence order. This is done to ensure causal attention - no token should be able to attend to tokens in the future
Experimental Results
Models were trained on the LM1B and OpenWebText datasets, using the transformer architecture with rotary embeddings. Sequence packing was applied to improve training efficiency.
Perplexity
Eso-LMs achieved SOTA perplexities among all diffusion based models, while continuing to offer fine grained interpolation between diffusion and autoregressive modalities. At alpha = 0 (pure AR), it is at par with autoregressive models.
However, the model scores worse than other MDLM (the original paper) on pure diffusion modelling, attributed to the efficiency gap between bidirectional attention in MDLM and sparse causal attention in Eso-LM. The authors provide an explanation for this.
Pareto Frontier of Generation Speed vs Quality
In diffusion models, the authors note that perplexity measures sample quality only under an infinite sampling budget (i.e. infinite time steps of denoising). However, for all practical purposes, we are more concerned with fixed time budgets, in which case the authors show Eso-LMs consistently produce better outputs at smaller NFEs, with considerable speedups achieved, in fast establishing a new SOTA.
More details can be found in section 5.2 of the paper.
Generation Latency at Long Contexts
At longer contexts, Eso-LMs are 3-4x faster than prior diffusion models with partial KV cache support, and 14-65x faster than MDLMs that don’t support KV caching.
Training Speed
Eso-LMs are also much faster to train. Authors report numbers for training against BD3-LMs, and find that BD3-LMs take 2.67x more time to training.
Conclusion
There were two key issues holding back diffusion language models - KV Caching, and generation quality. Eso-LMs have provided a clear path forward for both. The KV cache in itself is no longer an approximation, but a complete KV cache.
The gap between full bidirectional attention and sparse causal attention exists, but as we stated, at fixed time budgets, Eso-LMs continue to offer higher generation quality
Currently, scaling has not been studied. All the models involved were relatively smaller in size, so while we cannot be certain how the models will behave at larger sizes, Eso-LMs provide a clear path for higher quality and speed.
Awesome Machine Learning Summer Schools is a non-profit organisation that keeps you updated on ML Summer Schools and their deadlines. Simple as that.
Have any questions or doubts? Drop us an email! We would be more than happy to talk to you.
Reply