Few-step generated videos guided by box trajectory maps. (The first frame of all three trajectory conditions is segmentation masks of the foreground objects.)
We present FlashMotion, a few-step controllable image-to-video generation model that achieves efficient video synthesis and enables precise trajectory control.
As shown in Figure 1, FlashMotion adopts a three-stage training pipeline for few-step, trajectory-controllable image-to-video generation. First, we train a SlowAdapter on the SlowGenerator using diffusion loss. Next, we distill a FastGenerator from the SlowGenerator under a distribution-matching objective. Finally, we fine-tune the SlowAdapter to align with the FastGenerator through a hybrid strategy that combines adversarial and diffusion losses.
As shown in Figure 2, FlashMotion introduces a diffusion discriminator to guide the optimization of the trajectory adapter, bridging the gap between generated and real video distributions. Specifically, we finetune the \slowadapt using a hybrid training strategy that jointly optimizes diffusion and adversarial objectives. The diffusion discriminator is trained to distinguish noisy real video latents from generated ones, thereby aligning their underlying data distributions. Meanwhile, the diffusion loss provides pixel-level supervision, encouraging the model to produce trajectory-aligned videos.