Speculative Decoding with Discrete Diffusion | Ferdinando (Nando) Fioretto

Large language models are powerful but slow at inference time because they generate tokens one at a time. Speculative decoding speeds this up by letting a smaller “drafter” propose multiple tokens in parallel, then having the larger “verifier” accept or reject them without changing output quality. The bottleneck is that most drafters are still autoregressive, so drafting remains sequential and acceptance rates can be low.

SpecDiff: parallel drafting with discrete diffusion

Speculative Diffusion Decoding (SpecDiff) replaces the autoregressive drafter with a discrete diffusion model that can draft entire token blocks in parallel. Diffusion-style denoising updates all positions at once, so drafting cost depends on a fixed number of denoising steps rather than the draft length. The result is significantly higher throughput, with reported speedups up to 7.2x over vanilla decoding and up to 1.75x over prior speculative decoding baselines while preserving verifier quality.

SpecDiff-2: alignment that scales acceptance

SpecDiff-2 builds on this idea and tackles a second bottleneck: draft-verifier misalignment. It introduces a dual alignment strategy that improves acceptance rates across the entire draft window. The method combines train-time “streak-distillation” with test-time self-selection that scores multiple diffusion drafts and chooses the one most likely to be accepted. The paper reports an average +55% tokens-per-second improvement over prior baselines and up to 5.5x speedups over standard decoding, without loss of accuracy.

Why it matters

These results suggest that parallel, non-autoregressive drafting can be a practical path to faster LLM inference. Beyond latency, faster decoding enables more reasoning within a fixed wall-clock budget, improving task accuracy when time is constrained. The combination of diffusion-based drafting and alignment strategies defines a new, scalable design point for high-throughput LLM systems.

SpecDiff: parallel drafting with discrete diffusion

SpecDiff-2: alignment that scales acceptance

Why it matters

Read the papers