Move aside, images. Diffusion is now great for languages too - Awesome MLSS Newsletter

15th Edition

Autoregressive language models remind me of the sloth scene from Zootopia  — slow, one word at a time, forced to reprocess the whole sequence with every new token. It's not how humans think, and as papers (and we too) have noted, it pushes the model into a corner — committed to a direction decided several steps ago. 

Masked diffusion language modelling (MDLM) is the most notable effort to fix this. You know diffusion from image generators like Stable Diffusion or Sora — all those Ghibli images? Same idea. 

A recent paper, Esoteric Language Models, combines the best of both worlds while tackling MDLM's biggest bottleneck: KV Caching. More on that after a few quick updates.

Upcoming Summer School Announcements

Applications for some of the following summer schools are closing in the next 10 days. Make sure to apply to them before the application deadline!

Title

Deadline

Dates

Machine Learning Crash Course at MALGA 2026 — Genoa, Italy

Mar 01, 2026

June 15 – June 19, 2026

OxML 2026 – MLx Cases ‘26 — Oxford, UK (Online)

April 7, 2026

May 7 – May 16, 2026

UK Robotics Summer School 2026 — Edinburgh, UK

May 15, 2026

June 22 – June 26, 2026

Oxford Machine Learning Summer School – MLx HEALTH & BIO 2026 — Oxford, UK

May 22, 2026

July 10 – July 13, 2026

OxML 2026 – MLx Representation Learning & Generative AI 2026 — Oxford, UK

May 22, 2026

July 15 – July 18, 2026

For the complete list, please visit our website

Masked Diffusion Language Modelling - An Intuition

Autoregressive Language models are trained to generate one token at a time. This is the standard across a majority of models, including GPT series, Llama series, Mistral, Gemini, you name it. 

However, in MDLM, several tokens are generated at once in parallel. This is achieved by a process called ‘denoising’, in one single forward pass, we denoise some tokens to predict what they might be, repeating the process until the whole sequence is generated. 

This allows the model to attend to more than one token at a time, reducing both number of forward passes, as well as have more flexibility in not fixating on current sequence tokens. In a way, it thinks about the whole latent sequence when generating, the intuition is similar to what we discussed in our VL-JEPA newsletter

To be clear, the backbone for both are ultimately transformer networks, what changes is how we process them. 

Image Source: Generated with code

It does come with some caveats. 

  1. Because AR models decode one token at a time, they have much finer granularity and control. This helps generate better quality outputs. MDLMs are still catching up on this front. 

  2. Since the generated tokens in MDLMs are not sequential in nature, the KV values at each forward pass change. Therefore, we cannot store a KV cache - an important optimization for inference speed. 

Let’s take a quick look at the fundamental gap between AR and MDLM models

Property

AR Models

MDLMs

Generation granularity

One token at a time — fine-grained control (better output quality) but constrained to a single left-to-right direction

Whole sequence at once — bidirectional context allows better generalization

KV caching

Full KV caching out of the box

No native KV caching; requires architectural modifications or approximations

Forward passes

One pass per token — scales linearly with sequence length

Fewer passes total — one pass per denoising step, independent of sequence length generating multiple tokens

Speed

Slower at inference due to sequential token generation

Faster in principle due to parallel token generation per step

Other Attempts at KV Caching

Several papers attempt to integrate KV caching in diffusion language models, but they are for the most part approximations, and come with caveats. We highlight three here for perusal, but there are several more. 

Paper

How KV Cache Was Introduced

Why It's Not a Full KV Cache

dKV-Cache (Ma et al., 2025)

Caches KV states only for already-decoded tokens, with a one-step delay before committing to cache and periodic full refreshes every N steps

Masked tokens still require periodic recomputation since their representations keep evolving; the refresh mechanism is an explicit admission that the cache becomes stale

Fast-dLLM (Wu et al., NVLabs, 2025)

Introduces a block-wise "DualCache" that reuses KV states across adjacent denoising steps, exploiting high similarity between consecutive-step activations

Approximate by design — requires periodic global refreshes whenever KV drift exceeds a threshold, so the cache is never permanently fixed

BD3-LM (Arriola et al., ICLR 2025)

Adds causal AR structure between fixed-size blocks, allowing completed blocks' KV states to be cached and reused exactly like an AR model

Only inter-block KV states are truly cached; within each block, full bidirectional attention is still recomputed from scratch across all denoising steps

MDLM Loss: An Intuition

In MDLM literature, decoding is referred to as denoising. We use them interchangeably. If you see denoising, just think of token generation. A single forward pass is referred to as a function evaluation, so if you see NFE (Number of function evaluations), just think total forward passes

PLEASE NOTE

One of the key structural issues in MDLM is that the loss cannot be calculated only for one step. You see, for each token, we have several denoising steps. Ideally, we should be considering the loss accumulated across all denoising steps for that token and summing it. 

This is computationally intractable - the search space is exceedingly large

Therefore, for MDLM, we use an approximation on the evidence lower bound of the loss (ELBO). Using theorems from information theory and probability theory, the authors determine the value of this lower bound, and seek to maximize it, or equivalently, seek to minimize the negative (NELBO). 

The exact formulation can be found here under section Objective for training: ELBO

Esoteric Language Models

Esoteric LMs combine the AR and MDLM paradigms. For one single sequence generation, they have two phases

MDM phase - A certain number of tokens are decoded using masked diffusion modelling

AR phase - the remaining tokens are decoded using sequential passes, same as regular LLMs

Most importantly, the structure is such that it allows for full KV Caching. Let us now see how this works. 

Generating Token Split

The number of tokens decoded via either paradigm is controlled by an input parameter α ∈ [0, 1]. Given a sequence of length L:

  • α = 1: all L tokens are decoded via MDM — fully parallel, maximum speedup

  • α = 0: all L tokens are decoded autoregressively — equivalent to a standard LLM

  • α ∈ (0, 1): ⌊αL⌋ tokens are decoded via MDM and the remaining L − ⌊αL⌋ tokens are decoded autoregressively

For example, α = 0.5 splits the sequence evenly between the two paradigms, with the MDM component providing considerable inference speedup and the AR component offering finer-grained control over the output.

Once we know the number of tokens to use for each phase, we now need to decide which ones will be generated via MDM/AR. These values are selected using a binomial distribution dependent on both the alpha above, and denoising step number. In each step of MDM, we denoise one or more tokens dependent on the schedule we generated via the above binomial distribution. 

The remaining tokens to be denoised via AR are then appended sequentially in ascending order. More information can be found in appendix B.5

Let's consider a sequence [1, 2, 3, 4, 5, 6] where each value is the token position index. We choose alpha = 0.67, meaning 4 tokens to be generated via MDM, 2 via AR. Based on what we described above, we generate a denoising schedule:

Schedules:

  • S_MDM = ((5, 3), (6, 1))

  • S_AR = ((2), (4))

  • S = (S1, S2, S3, S4) where S1 = (5, 3), S2 = (6, 1), S3 = (2), S4 = (4)

As you can see, instead of needing six denoising steps, we only require 4 steps. The full schedule would look something like:

Step

Subset of tokens generated

Type

Sequence State

Start

[M, M, M, M, M, M]

S1

(5, 3)

MDM

[M, M, 3, M, 5, M]

S2

(6, 1)

MDM

[1, M, 3, M, 5, 6]

S3

(2)

AR

[1, 2, 3, M, 5, 6]

S4

(4)

AR

[1, 2, 3, 4, 5, 6]

For longer sequence lengths, the speedup would be much larger. As an example, on a sequence of length 4096, at alpha = 0.7, if we decode ~100 tokens per MDM step, we would only need 1258 total forward passes (29 MDM + 1229 AR), giving us a 3.26x speedup over the 4096 required for pure AR. 

Loss Function for Training

The loss is calculated as a sum of MDM loss (NELBO) for MDM phase, and AR Loss (Cross Entropy) for sequential phase. For the permuted order, the authors formulate an order independent NELBO, referred to as Any Order NELBO. This way, we train the model on both objectives. The authors use a token split of alpha = 0.5. More details in section 3.3 of the paper

Enabling KV Cache

Above, we reduced total steps needed for decoding, while retaining both fine granularity, along with generalization ability of MDM. However, we still haven’t seen how the KV cache is retained. 

In Esoteric LMs, once the above schedule is generated, we no longer process the tokens in the original sequence of indices. We permute them to fit the schedule, and since we know the schedule, we can rearrange the tokens in original order once the whole sequence is processed. Let us see this visually to understand. 

Consider the sequence ABCDEFGH. Let’s assume our denoising schedule is ((C, B), (A, H), (F, D), (E), (G))

We permute the tokens in order of the schedule. The key benefit this gives us is that it allows causal attention - notice how even though the tokens are technically not in order of the sequence, they are perfectly left to right in the transformer, keeping the KV cache constant for already generated tokens and independent of new tokens being generated. 

The same logic then extends to the sequential phase - tokens are generated one step at a time, allowing KV cache reuse. 

This brings up two important questions 

  1. How does the model learn that the positions are not in order?

    1. This is done purely by using positional embeddings. Each token is given the correct position embedding according to position in original sequence. The model learns how to differentiate between positions via training. 

  2. How is attention applied here, since in causal language modelling, we cannot allow a token to attend to future tokens? 

    1. We apply attention masks to ensure no token can attend to future tokens - that would be a violation of causal language modelling. The mask is applied as a bias, 0 if token must attend, -inf if not. In the softmax layer, e^-inf will zero out the attention. 

    2. During inference (or sampling in MDLM literature), we only pass the currently decoded tokens, and the tokens to be decoded in that step. 

    3. Every forward pass, tokens can only attend to other tokens that appear to the left of them in original sequence order. This is done to ensure causal attention - no token should be able to attend to tokens in the future 

Experimental Results

Models were trained on the LM1B and OpenWebText datasets, using the transformer architecture with rotary embeddings. Sequence packing was applied to improve training efficiency. 

Perplexity

Eso-LMs achieved SOTA perplexities among all diffusion based models, while continuing to offer fine grained interpolation between diffusion and autoregressive modalities. At alpha = 0 (pure AR), it is at par with autoregressive models. 

However, the model scores worse than other MDLM (the original paper) on pure diffusion modelling, attributed to the efficiency gap between bidirectional attention in MDLM and sparse causal attention in Eso-LM. The authors provide an explanation for this. 

Pareto Frontier of Generation Speed vs Quality

In diffusion models, the authors note that perplexity measures sample quality only under an infinite sampling budget (i.e. infinite time steps of denoising). However, for all practical purposes, we are more concerned with fixed time budgets, in which case the authors show Eso-LMs consistently produce better outputs at smaller NFEs, with considerable speedups achieved, in fast establishing a new SOTA. 

More details can be found in section 5.2 of the paper. 

Generation Latency at Long Contexts

At longer contexts, Eso-LMs are 3-4x faster than prior diffusion models with partial KV cache support, and 14-65x faster than MDLMs that don’t support KV caching. 

Training Speed 

Eso-LMs are also much faster to train. Authors report numbers for training against BD3-LMs, and find that BD3-LMs take 2.67x more time to training. 

Conclusion 

There were two key issues holding back diffusion language models - KV Caching, and generation quality. Eso-LMs have provided a clear path forward for both. The KV cache in itself is no longer an approximation, but a complete KV cache. 

The gap between full bidirectional attention and sparse causal attention exists, but as we stated, at fixed time budgets, Eso-LMs continue to offer higher generation quality 

Currently, scaling has not been studied. All the models involved were relatively smaller in size, so while we cannot be certain how the models will behave at larger sizes, Eso-LMs provide a clear path for higher quality and speed. 

Awesome Machine Learning Summer Schools is a non-profit organisation that keeps you updated on ML Summer Schools and their deadlines. Simple as that.

Have any questions or doubts? Drop us an email! We would be more than happy to talk to you.

With love, Awesome MLSS

Reply

or to participate.