MagicMotion

Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Quanhao Li^1* Zhen Xing^1*
Rui Wang¹ Hui Zhang¹ Qi Dai² Zuxuan Wu^1✉

¹Fudan University ²Microsoft Research Asia

^*Equal Contribution ^✉Corresponding Authors

TL;DR: We present MagicMotion, a controllable image-to-video generation model that enables trajectory control through three levels of conditions from dense to sparse: masks, boxes, and sparse boxes.

(Loading all videos may take some time, thanks for your patience!)

Gallery

Generated videos guided by different trajectory conditions. (The first frame of all three trajectory conditions is segmentation masks of the foreground objects.)

Demos

Prompt: "Harry Potter holds a sign with “MagicMotion” in the center."

Prompt: "A wizard in a robe with hat holds a sign with “MagicMotion” in the center."

Prompt: "Elon Musk holds a sign with “MagicMotion” in the center."

Prompt: "Yann Lecun holds a sign with “MagicMotion” in the center."

Mask Guided Results

Prompt: "A royal camel walking in a palace."

Prompt: "An astronaut walking on a planet."

Prompt: "A red crystal mammoth and a blue crystal rhino walking on ice."

Prompt: "A cat jumping over the bowl."

Box Guided Results

Prompt: "A rose swaying in the wind."

Prompt: "A robot and a dog running along the beach."

Prompt: "A cartoon bear walking in the forest."

Prompt: "A child riding a horse in the sky."

Prompt: "A duck walking on the grass."

Prompt: "A priestess lifting a ball to her head."

Prompt: "A moon moving in the sky."

Prompt: "A cartoon monster jumping with another one moving to the right."

Sparse Box Guided Results

Prompt: "A man with suit standing against a robot in a palace."

Prompt: "Man and woman kissing with hightech style."

Prompt: "A man slowly sinks his head into the water."

Prompt: "A full moon moves across the night sky with a castle and a bridge below."

Our Method

We present MagicMotion, a controllable image-to-video generation model that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes.

Overview: As shown in Figure 1, MagicMotion builds upon pre-trained image-to-video generation models, extending them with Trajectory ControlNet. This design effectively encodes different kinds of trajectory information into video generation models and enables trajectory controllable video generation.

Progressive Training Procedure: MagicMotion uses a dense-to-sparse training procedure to train the model with different levels of trajectory conditions: mask, box and sparse box(less than 10 frames have box trajectories provided). Experiments show that the model can leverage the knowledge learned in the previous stage to achieve better performance than training from scratch.

Latent Segment Loss: As shown in Figure 1, MagicMotion utilizes a novel latent segment loss that helps the video generation model better understand the fine-grained shape of objects with minimal computation.

Image 1 description — Figure 1: Overview of MagicMotion Architecture (text Prompt and encoder are omitted for simplicity). MagicMotion employs a pretrained 3D VAE to encode the input trajectory, first-frame image, and training video into latent space. It has two separate branches: video branch processes video and image tokens, and trajectory branch uses Trajectory ControlNet to fuse trajectory and image tokens, which is later integrated to video branch through a zero-initialized convolution layer. Besides, diffusion features from DiT blocks are concatenated and processed by a trainable segment head to predict latent segmentation masks, which contribute to our latent segment loss.

Data Pipeline

We present a comprehensive and general data pipeline for generating high-quality video data with both dense (mask) and sparse (bounding box) annotations.

The Data pipeline consists of two main stages: Curation Pipeline and Filtering Pipeline. The Curation Pipeline is responsible for constructing trajectory information from a video-text dataset, while the Filtering Pipeline ensures that unsuitable videos are removed before training.

Image 2 description — Figure 2: Overview of the Dataset Pipeline. Curation Pipeline is used to construct trajectory annotations, while Filtering Pipeline filters out unsuitable videos for training.

More Results

Same Input Image with Different Trajectories

Comparisons with Other Approaches