Constrained Discrete Diffusion for Language, Chemistry, and Code

01 · Start here

From tokens to masks—and back again

Discrete diffusion generates text, molecular strings, and code by repeatedly refining an entire sequence. That global, editable state is what makes constraint enforcement possible before the output is final.

Let $x_0=(x_0^1,\ldots,x_0^L)$ be a sequence of $L$ tokens drawn from a vocabulary $\mathcal{V}$. A masked diffusion model defines a forward process that progressively replaces clean tokens with a special [MASK] symbol. At noise level $t$, each position is retained with probability $\alpha_t$ and masked otherwise:

\[q(x_t^\ell\mid x_0^\ell)=\operatorname{Cat}\!\left(\alpha_t x_0^\ell+(1-\alpha_t)m\right),\]

where $m$ denotes the one-hot representation of [MASK]. As $t$ increases, more of the sequence disappears. The learned reverse process starts from a fully masked state and predicts clean-token distributions until a complete sequence is recovered.

Forward processTokens → masksA fixed corruption schedule progressively removes information.

⇄

Learned reverse processMasks → sequenceA transformer repeatedly predicts the clean token at every position.

What the denoiser learns

Given a partially masked sequence $x_t$, the denoiser predicts a distribution over the vocabulary for every position:

\[x_\theta(x_t,t)\in\Delta^{L\times|\mathcal{V}|}.\]

Unlike a left-to-right model, it does not commit permanently to one next token. It exposes a global proposal for the entire clean sequence at each reverse step. Tokens that remain uncertain can be reconsidered as information elsewhere in the sequence becomes available.

What the model sees

A draft of the whole output—not merely its prefix. That draft can be inspected for a distant word, a forbidden molecular substructure, a failed test, or another sequence-level requirement while there is still time to revise it.

Discrete diffusion language models: forward masking and backward denoising — Figure 1 The forward process masks a token sequence; the learned reverse process reconstructs all positions through iterative refinement.

Mathematical foundation · Sahoo et al., NeurIPS 2024 →

Why a coherent sequence is not necessarily a valid one

Discrete sequences—tokens in natural language, SMILES strings in chemistry, and source code in programming—carry rules that likelihood alone does not guarantee. A fluent sentence may violate a lexical instruction. A plausible molecular string may contain an unwanted structural alert. A program can look natural while failing its tests or introducing a vulnerability.

Autoregressive models generate left to right, so many non-local properties cannot be evaluated until the sequence is complete. Post-hoc filtering can catch failures, but only after the generation cost has been paid. Discrete diffusion offers a different intervention point: inspect and correct the global draft throughout denoising.

A

Prompt and hope

Ask the model for a rule, but receive no per-sample assurance that it will be followed.

B

Generate, then filter

Reject invalid outputs after completion, which becomes inefficient when feasible sequences are rare.

C

Constrain while denoising

Evaluate the full draft and minimally redirect token probabilities before the output is committed.

The thesis The global token state in discrete diffusion is more than a modeling choice—it is an interface for enforcing non-local rules during generation.

Our work turns that interface into training-free projection, symbolic feedback, and search mechanisms for language, molecules, reasoning, and code.

02 · Core idea

Project token distributions onto the feasible set

In continuous diffusion, a projection moves a point in Euclidean space. In discrete diffusion, the object being corrected is a probability distribution over possible tokens.

The continuous constraint-aware foundation views reverse diffusion as a sequence of optimization steps and augments each one with a feasibility correction. Constrained Discrete Diffusion (CDD) carries that idea to the probability simplex. At each reverse step, instead of sampling directly from the unconstrained prediction $x_t$, CDD solves a local proximal problem:

\[\operatorname{prox}_{\mathcal{C}}(x_t) = \arg\min_y\;D_{\mathrm{KL}}(y\,\|\,x_t) \quad\text{subject to}\quad \arg\max(y)\in\mathcal{C}.\]

Here $$y\in\Delta^{L\times

\mathcal{V}

}$is the corrected token distribution and$\mathcal{C}$$ is the feasible set. The KL objective changes the model’s proposal only as much as needed, while the constraint acts on the sequence that would be obtained by decoding the corrected probabilities.

Constraint-aware foundation · Christopher, Baek & Fioretto, NeurIPS 2024 →

Constraint-aware discrete diffusion: KL proximal projection on token logits — Figure 2 At every reverse step, CDD finds a nearby token distribution whose most likely decoded sequence satisfies the specified rule.

What projection changes

The denoiser still decides which sequences are plausible. The projection only redistributes enough probability mass to prevent the current draft from violating the encoded requirement.

Because the proximal step does not update model weights, CDD is a training-free wrapper for a pretrained masked-diffusion backbone. The same sampler can support different constraints by changing the evaluator and projection rather than retraining the generator.

Discrete constraint framework · Cardei et al., NeurIPS 2025 →

Beyond differentiable constraints

Many important checks are black boxes: SMARTS or BRENK substructure filters in chemistry, compilers and static analyzers in code, protein structure predictors, or formal logic solvers. Neuro-Symbolic Diffusion (NSD) interleaves denoising with symbolic optimization, while Search-Augmented Masked Diffusion (SearchDiff) explores the model’s high-probability token proposals with tree search when a local projection is insufficient.

Symbolic feedback

Neuro-Symbolic Diffusion

External logic and functional evaluators critique intermediate samples, allowing the sampler to enforce rules that are not naturally differentiable.

Christopher et al., NeuS 2025 →

Search-based correction

SearchDiff

Tree search evaluates alternative high-probability tokens and selects transitions that improve global property satisfaction.

Ta et al., 2026 →

03 · Research landscape

One sampling principle, several kinds of rules

The evaluator changes from one domain to another, but the recurring loop remains the same: propose a full sequence, evaluate it, and correct the token distribution before denoising continues.

Application 01Language safetyToxicity thresholds, counting, and lexical placement Application 02Molecular generationNovelty, structural alerts, and chemical validity Method extensionBlack-box searchProtein properties, logic, and arbitrary evaluators Application 03Code generationFunctionality, syntax, and security

0%toxicity-rule violations

392novel BRENK-screened molecules

65.2%HumanEval-X functionality with CDC

~50%backbone perplexity reduction

Application 01

Natural-language safety and exact lexical rules

Language constraints range from learned safety scores to exact symbolic instructions. CDD treats both as requirements on the complete token sequence rather than preferences attached to individual next-token choices.

For toxicity mitigation, the current draft is evaluated by a classifier and the token distribution is projected toward a specified threshold. For counting and lexical tasks, the feasible set encodes exact requirements such as how many times a character occurs or which word must appear at a particular position.

Across all evaluated toxicity thresholds ($\omega=0.25,0.50,0.75$), CDD produced zero constraint violations, compared with violation rates of 33.2%, 21.6%, and 13.1% for GPT-2 and 17–32% for the evaluated discrete-diffusion baselines. It also reached zero violations on counting and positional lexical constraints, versus 54.5% and 97.5% for unconstrained MDLM.

Toxicity thresholds0%violations across all evaluated settings

Counting rules0%vs. 54.5% violations for MDLM

Lexical placement0%vs. 97.5% violations for MDLM

What these results show Exact sequence-level control can be added to a masked diffusion model without retraining it for every new instruction.

The guarantee is tied to the supplied rule or classifier, so it should be interpreted as reliable benchmark compliance rather than a complete definition of safe language. Fluency and perplexity remain competitive, although constraint enforcement introduces task-dependent tradeoffs.

Language constraints · Cardei et al., NeurIPS 2025 →

Application 02

Molecular generation for drug discovery

A SMILES string is a discrete sequence, but its value depends on the chemical structure it represents. Useful generation therefore requires sequence validity, novelty, and domain filters to hold together.

Chemical molecule generation is a natural discrete-diffusion application: SMILES strings are token-based representations of molecular structure. The benchmark combines novelty constraints with five BRENK substructure filters, ruling out selected aldehydes, three-membered heterocycles, and other reactive or undesirable fragments during generation rather than after it.

Molecular generation with novelty and toxicity constraints on SMILES — Figure 3 CDD evaluates novelty and selected medicinal-chemistry structural alerts throughout the SMILES denoising trajectory.

Our CDD for Molecular Generation applies KL-proximal corrections for novelty and BRENK constraints within a single training-free inference procedure. The generated sequence remains close to the model’s proposal while being redirected away from encoded violations.

Molecular design results: valid, novel, and non-toxic molecule generation — Figure 4 CDD produces more novel candidates that pass the selected BRENK screens than the evaluated autoregressive and unconstrained diffusion baselines.

Model	Novel ↑	Novel & screened ↑	Novelty violations ↓	BRENK violations ↓
AR	10.3 ± 2.3	5.3 ± 1.4	99.0%	40.2%
MDLM	260.7 ± 16.4	108.0 ± 9.7	53.9%	35.3%
UDLM	279.7 ± 22.7	132.3 ± 3.7	70.8%	38.1%
CDD + BRENK ours	451.7 ± 19.5	392.0 ± 16.7	51.2%	0.0%

Novel & screened392vs. 108 for MDLM and 5 for AR

BRENK violations0%for the encoded structural alerts

Model retrainingNoneconstraints are imposed at inference

What these results show Constraint-aware denoising can focus molecular exploration on novel candidates that pass specified structural screens.

BRENK filters are useful early screens, not a complete assessment of toxicity, synthesizability, stability, or biological activity. The outputs remain computational candidates that require broader cheminformatics and experimental validation.

Molecular constraints · Cardei et al., AI4D3 at NeurIPS 2025 — Best Paper →

Method extension

Search-augmented discrete diffusion

Projection works well when feasibility can be evaluated and corrected locally. SearchDiff addresses global black-box objectives for which the useful correction is not available as a gradient.

At each denoising step, the model’s token distribution defines a proposal set. Search explores high-probability alternatives, evaluates them with a user-provided property function, and selects a transition that balances likelihood and constraint satisfaction.

SearchDiff: tree search augments the denoising process for constrained generation — Figure 5 SearchDiff uses the denoiser to propose likely tokens and tree search to identify transitions that better satisfy a global property.

The key architectural advantage is that the full candidate sequence is accessible at every reverse step. A protein structure predictor, logic solver, test harness, or other black-box evaluator can score the current draft and return feedback long before generation is complete.

Discrete diffusion enables blackbox constraint evaluators at every step — Figure 6 Global intermediate states allow arbitrary evaluators to critique a sequence and guide subsequent denoising.

Across biological design and symbolic-reasoning benchmarks, SearchDiff substantially improves property adherence over unconstrained discrete diffusion and the evaluated autoregressive baselines.

Search-based constraints · Ta et al., 2026 →

Application 03

Constrained code generation

Programs must satisfy syntax, functionality, and security requirements simultaneously. Diffusion's editable global state provides a place to localize a failure and revise only the responsible region.

Constrained Diffusion for Code (CDC) combines mathematical optimization with program analysis. At each denoising step, it decodes an intermediate program, evaluates the relevant requirement, identifies the tokens most responsible for a violation, and locally adjusts or remasks that region while keeping the remainder close to the base model’s proposal.

For functional correctness, CDC uses differentiable surrogate execution feedback to guide generations toward programs that pass unit tests. For security, static analysis identifies vulnerability paths and supplies targeted symbolic feedback. The same localized corrections also improve syntactic validity.

Constrained code generation results: functional correctness, security, and syntax — Figure 7 CDC improves benchmark pass rates for functionality, syntax, and security across DiffuCoder 7B and Dream-Coder 7B.

With Dream-Coder 7B, CDC raises HumanEval-X C++ functionality from 34.1% to 65.2% and MBPP-C++ functionality from 27.7% to 59.2%. Syntax correctness rises from 67.1% to 79.2% and from 41.1% to 72.0%, respectively. On CWEval, joint functionality-and-security success increases from 12.04% to 34.26%.

HumanEval-X function65.2%vs. 34.1% for the base model

HumanEval-X syntax79.2%vs. 67.1% for the base model

CWEval joint success34.26%vs. 12.04% for the baseline

What these results show Program-level feedback can be integrated during denoising to improve correctness and security with focused edits rather than whole-program regeneration.

The result is only as complete as the tests, surrogate, and static analyzer used to evaluate the program. Passing these benchmarks does not rule out untested behavior or unknown vulnerabilities, so consequential code still requires review and execution in an appropriate validation environment.

Program constraints · Shao et al., 2026 →

04 · Current and ongoing work

Improving the masked-diffusion backbone

Constraint enforcement depends on the quality of the base model’s clean-token predictions. Standard masked diffusion models infer still-masked positions largely from scratch at every step, discarding useful information from earlier predictions.

Simple Self-Conditioning Adaptation for Masked Diffusion Models (SCMDM) conditions each denoising step on the model’s previous clean-state prediction. This allows uncertain positions to accumulate information across steps without introducing an additional recurrent latent state or increasing the number of denoiser evaluations at inference.

SCMDM reduces generative perplexity on OpenWebText from 42.89 to 23.72, alongside improvements in molecular generation and genomic distribution modeling. Better clean-state predictions also give constraint projections a stronger proposal from which to start.

OWT perplexity23.72down from 42.89

Reduction~50%in generative perplexity

Extra denoiser calls0at inference time

Backbone adaptation · Cardei, Ta & Fioretto, 2026 →

05 · Results overview

Where constrained discrete diffusion helps

Domain	Method	Result to remember
Toxicity mitigation	CDD NeurIPS 2025	0% violations at all thresholds vs. 13–33% for GPT-2 and 17–32% for MDLM/UDLM.
Lexical constraints	CDD NeurIPS 2025	0% violations on counting and lexical constraints vs. 54.5–97.5% for MDLM.
Molecular generation	CDD + BRENK AI4D3 2025	392 novel BRENK-screened molecules vs. 108 for MDLM, with 0% BRENK violations.
Symbolic reasoning	SearchDiff 2026	Improves property satisfaction over autoregressive and discrete-diffusion baselines.
Code generation	CDC 2026	Improves functionality, security, and syntax through localized denoising-time edits.
MDM backbone	SCMDM 2026	About 50% perplexity reduction, from 42.89 to 23.72, with no extra inference calls.
Neuro-symbolic	NSD NeuS 2025	Symbolic constraint feedback across continuous and discrete domains.

The broader principle Discrete diffusion turns sequence generation into an editable process in which global requirements can be checked before the final output is committed.

Projection is one implementation of that idea. Symbolic feedback, search, and self-conditioned backbones extend it to constraints and domains that require richer evaluators.

06 · Papers in this guide

Follow the research program

NeurIPS 2025Constrained Discrete DiffusionCardei, Christopher, Hartvigsen, Kailkhura & Fioretto AI4D3 2025 · Best PaperConstrained Molecular Generation with Discrete Diffusion for Drug DiscoveryCardei, Christopher, Hartvigsen, Kailkhura & Fioretto 2026Search-Augmented Masked Diffusion Models for Constrained GenerationTa, Cardei, Velasquez & Fioretto 2026Constrained Code Generation with Discrete DiffusionShao, Cardei, Xie, Fioretto & Wang 2026Simple Self-Conditioning Adaptation for Masked Diffusion ModelsCardei, Ta & Fioretto NeuS 2025Neuro-Symbolic Generative Diffusion ModelsChristopher, Cardei, Liang & Fioretto NeurIPS 2024Simple and Effective Masked Diffusion Language ModelsSahoo et al.