Generated videos guided by different trajectory conditions. (The first frame of all three trajectory conditions is segmentation masks of the foreground objects.)
We present MagicMotion, a controllable image-to-video generation model that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes.
Overview: As shown in Figure 1, MagicMotion builds upon pre-trained image-to-video generation models, extending them with Trajectory ControlNet. This design effectively encodes different kinds of trajectory information into video generation models and enables trajectory controllable video generation.
Progressive Training Procedure: MagicMotion uses a dense-to-sparse training procedure to train the model with different levels of trajectory conditions: mask, box and sparse box(less than 10 frames have box trajectories provided). Experiments show that the model can leverage the knowledge learned in the previous stage to achieve better performance than training from scratch.
Latent Segment Loss: As shown in Figure 1, MagicMotion utilizes a novel latent segment loss that helps the video generation model better understand the fine-grained shape of objects with minimal computation.
We present a comprehensive and general data pipeline for generating high-quality video data with both dense (mask) and sparse (bounding box) annotations.
The Data pipeline consists of two main stages: Curation Pipeline and Filtering Pipeline. The Curation Pipeline is responsible for constructing trajectory information from a video-text dataset, while the Filtering Pipeline ensures that unsuitable videos are removed before training.