FlashMotion

Few-Step Controllable Video Generation with Trajectory Guidance

CVPR 2026

Quanhao Li¹ Zhen Xing¹ Rui Wang¹ Haidong Cao¹ Qi Dai² Daoguo Dong¹ Zuxuan Wu^1✉

¹Fudan University ²Microsoft Research Asia

^✉Corresponding Authors

TL;DR: We present FlashMotion, a few-step controllable image-to-video generation model that achieves efficient video synthesis and enables precise trajectory control.

(Loading all videos may take some time, thanks for your patience!)

Gallery

Few-step generated videos guided by box trajectory maps. (The first frame of all three trajectory conditions is segmentation masks of the foreground objects.)

Demos

Prompt: "A cat chasing a butterfly in a beautiful garden."

Prompt: "A koala wearing a leaf backpack is playing with a parachute."

Prompt: "A cartoon rocket launches upward from a desk."

Prompt: "In Chinese ink-painting style, a small bird jumps from a tree onto a cat's head, and the cat slowly lies down on the ground."

Prompt: "A paper-style bus drives along a road, with a paper sun and clouds nearby."

Prompt: "A race car speeds along the track, with the grandstand packed with spectators."

Prompt: "SpongeBob and Patrick are jumping in an underwater world."

Prompt: "A cartoon-style hamster wearing a pistachio-shell helmet drives a bread-made tractor in the kitchen, cleaning up candy from the floor."

Prompt: "A little mouse wearing a pink sweater and hat glides on the ice, with snowflakes drifting around."

Prompt: "The camera gradually moves into a museum interior, where purple crystals, mammoth remains, and dinosaur specimens are on display."

Prompt: "A cartoon rabbit drives a bread-made car across the living room floor."

Prompt: "The Flash holds up a sign reading 'FlashMotion' and swings it around once."

Our Method

We present FlashMotion, a few-step controllable image-to-video generation model that achieves efficient video synthesis and enables precise trajectory control.

Training Stages

As shown in Figure 1, FlashMotion adopts a three-stage training pipeline for few-step, trajectory-controllable image-to-video generation. First, we train a SlowAdapter on the SlowGenerator using diffusion loss. Next, we distill a FastGenerator from the SlowGenerator under a distribution-matching objective. Finally, we fine-tune the SlowAdapter to align with the FastGenerator through a hybrid strategy that combines adversarial and diffusion losses.

Image 1 description — Figure 1: Overview of the FlashMotion training pipeline. FlashMotion is trained in three stages: (1) a SlowAdapter is first optimized on the SlowGenerator with diffusion loss; (2) a FastGenerator is distilled from the SlowGenerator under a distribution-matching objective; and (3) the SlowAdapter is fine-tuned to align with the FastGenerator using a hybrid training strategy that combines adversarial and diffusion losses.

Model Architecture

As shown in Figure 2, FlashMotion introduces a diffusion discriminator to guide the optimization of the trajectory adapter, bridging the gap between generated and real video distributions. Specifically, we finetune the \slowadapt using a hybrid training strategy that jointly optimizes diffusion and adversarial objectives. The diffusion discriminator is trained to distinguish noisy real video latents from generated ones, thereby aligning their underlying data distributions. Meanwhile, the diffusion loss provides pixel-level supervision, encouraging the model to produce trajectory-aligned videos.

Image 2 description — Figure 2: (a) Architecture of FlashMotion. The trajectory adapter is fine-tuned on the FastGenerator with a hybrid strategy that combines diffusion and adversarial objectives. (b) Detailed illustration of our diffusion discriminator architecture. The discriminator adopts a DiT backbone cloned from the SlowGenerator, while intermediate features from its DiT blocks are fed into an attention-based classifier to distinguish real videos from generated ones.

More Results

Ablation Studies

Comparisons