Constrained Discrete Diffusion for Language, Chemistry, and Code
Enforcing hard constraints on tokens, molecules, and programs — training-free, inside the denoising loop.
Problem Statement and Motivation
Discrete sequences — tokens in natural language, SMILES strings in chemistry, source code in programming — hold immense potential for scientific and engineering discovery. Yet deploying large language models and discrete generative models in these high-stakes domains exposes a fundamental gap: the outputs of these models routinely violate critical constraints, whether toxicity guidelines, chemical validity rules, functional correctness requirements, or security properties.
Autoregressive models generate text left-to-right and are structurally constrained by this causal factorization: it is impossible to verify non-local properties (e.g., “does this molecule contain an aldehyde fragment?” or “does this code compile?”) until generation is complete, when corrections are expensive or impossible. Discrete diffusion models break this bottleneck. Because they expose the full, partially denoised sequence at every reverse step, they offer something autoregressive models cannot: a global program or sequence state that can be inspected, scored, and corrected before committing to any final token.
Our work turns this structural advantage into a rigorous constraint-enforcement machinery — without retraining the underlying model.
Key Takeaways
The Framework: Projecting Token Distributions onto the Feasible Set
Standard masked diffusion models (MDMs) operate over sequences of \(L\) tokens drawn from a vocabulary \(\mathcal{V}\). Each position \(\ell\) in the sequence \(x_t \in \mathbb{R}^{L \times \lvert\mathcal{V}\rvert}\) holds a probability distribution over vocabulary items. The forward process corrupts tokens by progressively replacing them with a special [MASK] token. The learned reverse process — a transformer network — predicts clean-state logits and iteratively unmasks the sequence.
Constrained Discrete Diffusion (CDD) (Cardei et al., NeurIPS 2025) adapts the constrained diffusion principle to this discrete setting. At each reverse step \(t\), instead of directly sampling from the unconstrained model prediction, CDD solves a local proximal problem over the probability simplex:
\[\text{prox}_{\mathcal{C}}(x_t) = \min_{y} \; \mathrm{KL}(y \,\|\, x_t) \quad \text{s.t.} \quad \arg\max(y) \to \mathcal{C}\]Here \(y \in \Delta^{\lvert\mathcal{V}\rvert}\) is the projected distribution and \(\mathcal{C}\) encodes the constraint on the most-likely decoded token. This ensures two things simultaneously: (1) the projection stays close to the model’s intended distribution (preserving fluency), and (2) the sequence that would result from greedily decoding the projected logits satisfies the constraint — at the sequence level, not just position by position.
Because the KL proximal step requires no gradient through model weights and no retraining, CDD is a training-free, plug-and-play wrapper for any masked diffusion backbone.
Handling Non-Differentiable Constraints: Neuro-Symbolic Diffusion
Many real-world constraints involve non-differentiable checkers — SMARTS substructure filters in chemistry, compiler parse trees in code, or formal logic evaluators in reasoning. Our Neuro-Symbolic Diffusion (NSD) framework (Christopher et al., NeuS 2025, DARPA Disruptive Idea Award) extends this principle by interleaving diffusion steps with symbolic optimization, enabling certifiably consistent generation under arbitrary functional and logic constraints — for both continuous and discrete domains.
Search-Augmented Masked Diffusion (SearchDiff) (Ta et al., 2026) takes a complementary approach: rather than projecting logits, it introduces a tree search at each denoising step that optimizes over the model’s predicted proposal set under user-specified property satisfaction, yielding a modified reverse transition that steers sampling toward probable and feasible solutions.
Application 1: Natural Language Safety — Toxicity Mitigation and Lexical Constraints
The most direct application is enforcing content-safety constraints on text generation. Large language models trained on internet-scale data can generate harmful, toxic, or factually incorrect content. CDD addresses this by treating toxicity and lexical rules as hard constraints on the token sequence.
At each reverse step, the KL-proximal projection gates out tokens whose presence would increase the probability of toxic continuations (evaluated by a classifier) or violate counting/lexical rules (e.g., “do not include the word ‘kill’ in the output”). Key results from (Cardei et al., 2025):
- Zero constraint violations on toxicity mitigation across all threshold levels (ω = 0.25, 0.50, 0.75), versus 33.2%, 21.6%, and 13.1% violation rates for GPT-2 — and 17–32% for MDLM and UDLM discrete diffusion baselines.
- Zero violations on character-level and sequence-level lexical constraints, versus 54.5% and 97.5% for MDLM without constraint enforcement.
- Minimal computational overhead compared to post-hoc alternatives: PPLM incurs an 85× slowdown, FUDGE a 134–143× slowdown; CDD adds negligible cost over standard discrete diffusion sampling.
Crucially, constraint satisfaction is achieved while preserving semantic coherence and fluency — perplexity remains competitive with unconstrained generation, as CDD only departs from the model’s intent by the minimum required to satisfy constraints.
Application 2: Molecular Generation for Drug Discovery
Chemical molecule generation is a natural discrete-diffusion application: SMILES strings are sequential, token-based representations of molecular structure. Generating valid drug candidates requires satisfying multiple simultaneous constraints — novelty (the molecule must not be in the training set), toxicity avoidance (substructure filters from medicinal chemistry), and chemical validity (SMILES grammar rules).
Our CDD for Molecular Generation (Cardei et al., AI4D3 NeurIPS 2025, Best Paper Award) applies the KL proximal projection to enforce five BRENK substructure filters — ruling out aldehydes, three-membered heterocycles, and other toxicity-linked fragments — jointly with novelty constraints, all within a single training-free inference procedure.
Quantitative results from the molecule generation benchmark:
Novel and non-toxic molecule discovery
| Model | Novel ↑ | Novel & non-toxic ↑ | Novelty violations ↓ | BRENK violations ↓ |
|---|---|---|---|---|
| AR | 10.3 ± 2.3 | 5.3 ± 1.4 | 99.0% | 40.2% |
| MDLM | 260.7 ± 16.4 | 108.0 ± 9.7 | 53.9% | 35.3% |
| UDLM | 279.7 ± 22.7 | 132.3 ± 3.7 | 70.8% | 38.1% |
| CDD + BRENK ours | 451.7 ± 19.5 | 392.0 ± 16.7 | 51.2% | 0.0% |
By combining novelty enforcement with BRENK filters, our method produces 392 novel non-toxic molecules per evaluation run — versus 108 for MDLM and 5 for autoregressive generation — while reducing BRENK violations to exactly zero.
Search-Augmented Discrete Diffusion
Projection-based methods like CDD work well when constraint satisfaction can be efficiently evaluated and projected locally. But some constraints require evaluating global sequence properties — protein secondary structure, symbolic logic satisfiability, or pass/fail test suite execution — where local projections are insufficient.
SearchDiff (Ta et al., 2026) addresses this by replacing the projection step with a search procedure: at each denoising step, the model’s prediction defines a proposal set of high-probability candidate tokens, and a search algorithm (informed by the constraint evaluator) selects the candidate that best satisfies the target property while remaining within the model’s distribution. This yields a modified reverse transition that steers sampling toward probable and feasible solutions.
A key structural advantage exploited by SearchDiff is that discrete diffusion makes the full candidate sequence accessible at every step, even early in generation. This allows arbitrary blackbox evaluators — protein structure predictors, formal logic solvers, test suite harnesses, financial risk models — to score and critique a draft output at intermediate steps, enabling much earlier constraint correction than is possible with autoregressive decoding.
Across biological design (protein sequence) and symbolic reasoning benchmarks, SearchDiff substantially improves constraint satisfaction and property adherence, consistently outperforming both unconstrained discrete diffusion and autoregressive baselines.
Application 3: Constrained Code Generation
Code generation imposes among the most demanding constraint combinations: programs must be syntactically valid, functionally correct with respect to a specification, and free of security vulnerabilities — simultaneously. Autoregressive code models (e.g., Codex, DeepSeek-Coder) frequently produce outputs that pass one criterion while failing another, because left-to-right generation cannot “look ahead” and globally reason about program structure.
Constrained Diffusion for Code (CDC) (Shao et al., 2026) applies the discrete diffusion constraint-enforcement principle to this domain. At each denoising step, CDC combines mathematical optimization with static program analysis to identify constraint-relevant regions of the intermediate program state — the specific tokens that, if modified, would most improve constraint satisfaction — and locally adjusts the denoising trajectory toward feasible programs.
CDC provides:
- Functional correctness: Enforcing that programs pass unit tests by guiding generation toward syntactically and semantically consistent completions.
- Security constraints: Ruling out code patterns (e.g., unsafe buffer operations, SQL injection vulnerabilities) via rule-based program analysis applied at each step.
- Syntax validity: Grounding token predictions in parse-tree structure, eliminating syntactically invalid intermediate states before they propagate.
Across all three constraint categories, CDC consistently outperforms both discrete diffusion and autoregressive baselines — with less corrective computation (more localized edits) than post-hoc repair methods.
Improving the Backbone: Self-Conditioned Masked Diffusion
Constraint enforcement quality depends directly on the quality of the underlying discrete diffusion backbone — how accurately the model predicts clean-state logits from partially masked inputs. A key limitation of standard MDMs is that masked positions are inferred from scratch at each step, discarding useful intermediate predictions.
Simple Self-Conditioning Adaptation for Masked Diffusion Models (SCMDM) (Cardei et al., 2026) addresses this with a simple but effective post-training adaptation: each denoising step is conditioned on the model’s own previous clean-state predictions. This allows still-masked positions to accumulate refinement signal across steps, without introducing a recurrent latent pathway or requiring retraining from scratch.
SCMDM achieves nearly a 50% reduction in generative perplexity on language modeling benchmarks (42.89 → 23.72 on OWT), alongside strong improvements in molecular generation and genomic distribution modeling — with no additional denoiser evaluations at inference time. Better backbone predictions translate directly into higher-quality starting points for CDD’s constraint projection steps.
Key Results
Where constrained discrete diffusion helps
| Domain | Method | Result to remember |
|---|---|---|
| Toxicity mitigation | CDD NeurIPS 2025 | 0% violations at all thresholds vs. 13–33% for GPT-2 and 17–32% for MDLM/UDLM. |
| Lexical constraints | CDD NeurIPS 2025 | 0% violations on counting and lexical constraints vs. 54.5–97.5% for MDLM. |
| Molecular generation | CDD + BRENK AI4D3 2025 | 392 novel non-toxic molecules vs. 108 for MDLM, with 0% BRENK violations. |
| Symbolic reasoning | SearchDiff 2026 | Substantially improves satisfaction and outperforms autoregressive and discrete diffusion baselines. |
| Code generation | CDC 2026 | Improves correctness, security, and syntax with less computation than post-hoc repair. |
| MDM backbone | SCMDM 2026 | About 50% perplexity reduction, from 42.89 to 23.72, with no extra inference cost. |
| Neuro-symbolic | NSD NeuS 2025 | Certifiable constraint satisfaction across continuous and discrete domains; DARPA Disruptive Idea. |
This body of work establishes a general principle: discrete diffusion is the right architecture for constrained sequence generation. The global token state at each step is not just a modeling choice — it is the mechanism that makes training-free, non-local constraint enforcement tractable. Autoregressive models, by design, cannot offer this.
Relevant Citations
- Cardei, M., Christopher, J. K., Hartvigsen, T., Kailkhura, B., Fioretto, F. (2025). Constrained Discrete Diffusion. NeurIPS 2025. arXiv:2503.09790
- Cardei, M., Christopher, J. K., Hartvigsen, T., Kailkhura, B., Fioretto, F. (2025). Constrained Molecular Generation with Discrete Diffusion for Drug Discovery. AI4D3 Workshop, NeurIPS 2025. Best Paper Award. PDF
- Ta, H. B., Cardei, M., Velasquez, A., Fioretto, F. (2026). Search-Augmented Masked Diffusion Models for Constrained Generation. arXiv:2602.02727
- Shao, L., Cardei, M., Xie, Z., Fioretto, F., Wang, W. (2026). Constrained Code Generation with Discrete Diffusion. arXiv:2605.16829
- Cardei, M., Ta, H. B., Fioretto, F. (2026). Simple Self-Conditioning Adaptation for Masked Diffusion Models. arXiv:2604.26985
- Christopher, J. K., Cardei, M., Liang, J., Fioretto, F. (2025). Neuro-Symbolic Generative Diffusion Models for Physically Grounded, Robust, and Safe Generation. NeuS 2025. DARPA Disruptive Idea Award. arXiv:2506.01121
- Sahoo, S., et al. (2024). Simple and Effective Masked Diffusion Language Models. NeurIPS 2024. arXiv:2406.07524