Bouncing Image

M

a

g

i

c

M

o

t

i

o

n

Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Quanhao Li1*    Zhen Xing1*   
Rui Wang1    Hui Zhang1    Qi Dai2    Zuxuan Wu1✉   
1Fudan University    2Microsoft Research Asia
*Equal Contribution    Corresponding Authors
TL;DR: We present MagicMotion, a controllable image-to-video generation model that enables trajectory control through three levels of conditions from dense to sparse: masks, boxes, and sparse boxes.
(Loading all videos may take some time, thanks for your patience!)
Gallery

Generated videos guided by different trajectory conditions. (The first frame of all three trajectory conditions is segmentation masks of the foreground objects.)

Mask Guided Results

Prompt: "A royal camel walking in a palace."

Prompt: "An astronaut walking on a planet."

Prompt: "A red crystal mammoth and a blue crystal rhino walking on ice."

Prompt: "A cat jumping over the bowl."

Box Guided Results

Prompt: "A cartoon bear walking in the forest."

Prompt: "A child riding a horse in the sky."

Prompt: "A duck walking on the grass."

Prompt: "A priestess lifting a ball to her head."

Prompt: "A moon moving in the sky."

Prompt: "A cartoon monster jumping with another one moving to the right."

Sparse Box Guided Results

Prompt: "A man with suit standing against a robot in a palace."

Prompt: "Man and woman kissing with hightech style."

Prompt: "A man slowly sinks his head into the water."

Prompt: "A full moon moves across the night sky with a castle and a bridge below."

Our Method

We present MagicMotion, a controllable image-to-video generation model that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes.

Overview: As shown in Figure 1, MagicMotion builds upon pre-trained image-to-video generation models, extending them with Trajectory ControlNet. This design effectively encodes different kinds of trajectory information into video generation models and enables trajectory controllable video generation.

Progressive Training Procedure: MagicMotion uses a dense-to-sparse training procedure to train the model with different levels of trajectory conditions: mask, box and sparse box(less than 10 frames have box trajectories provided). Experiments show that the model can leverage the knowledge learned in the previous stage to achieve better performance than training from scratch.

Latent Segment Loss: As shown in Figure 1, MagicMotion utilizes a novel latent segment loss that helps the video generation model better understand the fine-grained shape of objects with minimal computation.

Image 1 description
Figure 1: Overview of MagicMotion Architecture (text Prompt and encoder are omitted for simplicity). MagicMotion employs a pretrained 3D VAE to encode the input trajectory, first-frame image, and training video into latent space. It has two separate branches: video branch processes video and image tokens, and trajectory branch uses Trajectory ControlNet to fuse trajectory and image tokens, which is later integrated to video branch through a zero-initialized convolution layer. Besides, diffusion features from DiT blocks are concatenated and processed by a trainable segment head to predict latent segmentation masks, which contribute to our latent segment loss.
Data Pipeline

We present a comprehensive and general data pipeline for generating high-quality video data with both dense (mask) and sparse (bounding box) annotations.

The Data pipeline consists of two main stages: Curation Pipeline and Filtering Pipeline. The Curation Pipeline is responsible for constructing trajectory information from a video-text dataset, while the Filtering Pipeline ensures that unsuitable videos are removed before training.

Image 2 description
Figure 2: Overview of the Dataset Pipeline. Curation Pipeline is used to construct trajectory annotations, while Filtering Pipeline filters out unsuitable videos for training.
More Results

Same Input Image with Different Trajectories

Comparisons with Other Approaches