Preprint 2026 Multi-Task RL for Diffusion

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

  1. Quanhao Li1*
  2. Junqiu Yu1*
  3. Kaixun Jiang1
  4. Yujie Wei1
  5. Zhen Xing2‡
  6. Pandeng Li2
  7. Ruihang Chu2
  8. Shiwei Zhang2‡
  9. Yu Liu2
  10. Zuxuan Wu1†
  1. 1Fudan University
  2. 2Wan Team, Alibaba Group

*Equal contribution Corresponding author Project leader

DiffusionOPD teaser: faster convergence and higher ceiling across multiple tasks.
Figure 1. (a) DiffusionOPD exhibits significantly faster convergence and a higher performance ceiling than all multi-task reinforcement learning baselines. (b) DiffusionOPD consistently outperforms all baselines across multiple domains, including GenEval, OCR, and aesthetics.

Abstract

Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on On-Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student's own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch.

Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.

Key Contributions

01

On-Policy Distillation for Multi-Task Diffusion

A new paradigm where domain-specific teachers supervise a unified student along the student's own on-policy rollout trajectories, decoupling exploration from capability integration.

02

Closed-Form KL across SDE & ODE

We derive a unified closed-form per-step reverse-KL objective that covers both stochastic SDE and deterministic ODE samplers — enabling lower-variance optimization than PPO-style gradients.

03

State-of-the-art Multi-Task Results

Consistent gains over prior baselines in both training efficiency and final performance, achieving SOTA on aesthetics, OCR, and GenEval simultaneously.

Method

Lifting OPD from LLMs to Continuous-State Markov Chains

On-Policy Distillation (OPD), originally proposed for autoregressive language models, lets the student generate a full trajectory from its own policy and trains it to match the teacher on the prefixes it visits. We generalize this idea to any discrete-time Markov chain in which the student and teacher share the same state space and transition-kernel structure:

$$\mathcal{L}_{\text{OPD}}(\theta) \;=\; \mathbb{E}_{x_{0:N}\sim p_{S}}\!\left[\sum_{j=0}^{N-1}\;\mathrm{KL}\!\left(p_S(\cdot\mid x_{t_j})\,\big\|\,p_T(\cdot\mid x_{t_j})\right)\right].$$

Per-Step Gaussian Transitions and a Closed-Form KL

For a flow-matching model with reverse-time SDE discretization (Euler–Maruyama), each denoising step induces a Gaussian one-step kernel $\;p_{S}(x_{t_{j+1}}\!\mid\!x_{t_j})=\mathcal{N}(\mu_S(x_{t_j}),\bar\sigma_j^2 \mathbf{I})\;$ and similarly for the teacher. Crucially, the per-step covariance $\bar\sigma_j^2\mathbf{I}$ depends only on the scheduler, so it is identical for student and teacher. The reverse KL therefore admits a closed form:

$$\mathrm{KL}\!\left(p_S\,\|\,p_T\right) \;=\; \frac{\lVert\mu_S(x_{t_j};\theta)-\mu_T(x_{t_j})\rVert_2^{\,2}}{2\,\bar\sigma_j^{\,2}}.$$

Plugging this into the OPD objective gives the diffusion-domain DiffusionOPD loss — an analytic, per-step mean-matching objective optimized by direct backpropagation:

$$\mathcal{L}^{\text{diffusion}}_{\text{OPD}}(\theta) \;=\; \mathbb{E}_{x_{0:N}\sim p_{S,\theta}}\!\left[\sum_{j=0}^{N-1}\frac{\lVert\mu_S(x_{t_j};\theta)-\mu_T(x_{t_j})\rVert_2^{\,2}}{2\,\bar\sigma_j^{\,2}}\right].$$

A Unified View: SDE and ODE in One Objective

In the deterministic ODE regime, the student and teacher each induce a unique next-state target $\mu_S(x_{t_j};\theta)$ and $\mu_T(x_{t_j})$, so distribution matching collapses to pointwise transition matching — a clean squared-$L_2$ loss:

$$\mathcal{L}^{\text{diffusion-ODE}}_{\text{OPD}}(\theta) \;=\; \mathbb{E}_{x_{0:N}\sim p_{S,\theta}}\!\left[\sum_{j=0}^{N-1}\frac{1}{2}\lVert\mu_S(x_{t_j};\theta)-\mu_T(x_{t_j})\rVert_2^{\,2}\right].$$

The closed-form KL and the deterministic $L_2$ loss are two faces of the same training principle, providing a unified view of on-policy distillation across SDE and ODE samplers.

Closed-Form KL vs. PPO-Style Policy Gradient

A natural alternative is to treat the teacher as a process reward model and optimize a PPO-style surrogate with the per-step KL as a dense reward. Under gradient accumulation, the importance ratio $\rho_j(\theta)=1$ and the PPO gradient decomposes as

$$\nabla_\theta\!\left(\rho_j(\theta)\,\Delta_j(\theta)\right) \;=\; \underbrace{\nabla_\theta\Delta_j(\theta)}_{\text{pathwise term}} \;+\; \underbrace{\Delta_j(\theta)\,\nabla_\theta\log\pi_\theta(a_j\mid x_{t_j})}_{\text{score-function term}}.$$

The score-function term is unbiased in expectation, but for a Gaussian transition $a_j=\mu_S(x_{t_j};\theta)+\bar\sigma_j\,\epsilon_j$ it equals $\frac{\epsilon_j}{\bar\sigma_j}\cdot\nabla_\theta\mu_S$, injecting noise proportional to $\epsilon_j$. In contrast, the closed-form KL is a deterministic function of $\mu_S$, so its pathwise gradient has strictly lower variance and remains valid under both SDE and ODE samplers.

Take-away. Direct closed-form KL minimization and PPO-style policy gradients have the same expected gradient, but the closed-form version has lower variance and applies uniformly to SDE and ODE samplers within a single training principle.

Two-Stage Training Recipe

Stage 1

Per-Task Teacher Training

Decompose the multi-task problem into $M$ individual tasks and train a specialized teacher for each reward using off-the-shelf diffusion RL (e.g., DiffusionNFT for GenEval, GRPO-Guard for OCR and Aesthetics). Each teacher can fully exploit its own reward without inter-task interference.

Stage 2

Multi-Task On-Policy Distillation

Initialize the student from the pretrained policy. In a round-robin manner, sample prompts per task, roll out the current student to obtain an on-policy trajectory, and supervise it with the corresponding teacher via Eq. (11) or Eq. (12). Losses are accumulated across all tasks and a single optimizer step is taken per round — yielding stable multi-task updates.

Quantitative Results

DiffusionOPD attains the best Average score (0.929), surpassing Multi-Task RL and Cascade RL baselines with a competitive wall-clock budget. Gray columns mark in-domain rewards used during teacher training.

Model Wall-clock
(hours)
Rule-Based Model-Based Average
GenEval OCR PickScore ClipScore HPSv2.1 Aesthetic ImgRwd UniRwd
SD-XL 0.550.1422.420.2870.2805.600.762.930.390
SD3.5-L 0.710.6822.910.2890.2885.500.963.250.601
FLUX.1-Dev 0.660.5922.840.2950.2745.710.963.270.599
SD3.5-M (w/o CFG) 0.240.1220.510.2370.2045.13-0.582.020.000
SD3.5-M + CFG 0.630.5922.340.2850.2795.360.853.030.484
GenEval Teacher46.92 0.960.4022.040.2740.2485.240.592.970.473
OCR Teacher33.17 0.650.9322.270.2900.2725.260.903.090.550
Aes Teacher85.75 0.490.5924.020.2950.3466.221.4983.480.698
Multi-Task GRPO-Guard129.86 0.890.9423.120.2960.3075.611.313.330.763
Multi-Task NFT128.42 0.950.9622.590.2880.2825.411.083.230.715
Cascade NFT148.49* 0.940.9123.800.2930.3316.011.493.490.851
DiffusionOPD (Ours)85.75 + 11.26 0.960.9423.990.2970.3426.151.503.500.929

Gray-shaded metrics are in-domain rewards; bold marks the best result and underline the second best. Evaluated at 1024×1024. *Approximated training time.  For DiffusionOPD, wall-clock = max teacher time + OPD training time.

Convergence Curves vs. Multi-Task RL

Convergence curves comparing DiffusionOPD with multi-task RL baselines on GenEval, OCR and PickScore.
Figure 2. DiffusionOPD reaches higher reward with substantially fewer GPU hours than multi-task RL baselines on all three benchmarks, confirming both faster convergence and a higher performance ceiling.

Qualitative Comparisons

vs. Multi-Task RL Baselines

Qualitative comparison versus multi-task RL baselines and single-task teachers.
Figure 3. Each case is presented in two rows. Top row, left → right: DiffusionOPD (Ours), Multi-Task GRPO-Guard, Multi-Task NFT, Cascade NFT. Bottom row: input prompt, our Aes Teacher, GenEval Teacher, OCR Teacher. DiffusionOPD produces images that are simultaneously compositionally accurate, text-faithful, and aesthetic.

vs. Other Distillation Objectives

Qualitative comparison versus DMD, TDM and SFT distillation objectives.
Figure 4. Left → right: DiffusionOPD (Ours), DMD, TDM, SFT. Under the same set of specialized teachers, DiffusionOPD yields the cleanest text rendering, the most accurate object compositionality, and the most coherent aesthetics.

Ablation Studies

Convergence curves comparing distillation objectives.
(a) Distillation objectives. Under the same set of specialized teachers, DiffusionOPD reaches the fastest convergence and the highest ceiling among DMD, TDM, and SFT.
Loss formulation and sampler noise level ablation.
(b) Loss & sampler noise. PPO-style gradients under-perform their closed-form KL counterpart; lower noise levels (with ODE being noise-0) yield faster convergence and higher performance ceilings.

BibTeX

If you find DiffusionOPD useful, please consider citing:

@article{li2026diffusionopd,
  title         = {DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models},
  author        = {Li, Quanhao and Yu, Junqiu and Jiang, Kaixun and Wei, Yujie and
                   Xing, Zhen and Li, Pandeng and Chu, Ruihang and Zhang, Shiwei and
                   Liu, Yu and Wu, Zuxuan},
  journal       = {arXiv preprint arXiv:2605.15055},
  year          = {2026},
  eprint        = {2605.15055},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG}
}