Diffusion-Based One-Shot Video Generation via Pose-Guided Temporal Consistency and Spatial Alignment

article image
Temporal-consistent One-shot video generation Pose-guided

A technical research I architected and executed at Irreverent Labs. For technical details, please see our paper.

Abstract

We propose a novel framework that integrates human pose information as a guiding mechanism for the T2V synthesis process. Our model leverages pose estimation and motion priors to ensure anatomically plausible and contextually relevant human movements, bridging the gap between textual semantics and visual dynamics.

We also design two novel pose-specific evaluation metrics, Frame-to-Frame Pose Distance (FFPD) and Optical Flow Difference Score (OFDS), designed to quantitatively evaluate the alignment of generated poses with input text prompts and the realism of motion sequences, respectively. Extensive experiments on benchmark datasets demonstrate that our approach outperforms existing methods in both qualitative and quantitative evaluations, achieving superior control over human pose dynamics while maintaining high video quality. This work not only advances the state-of-the-art in T2V synthesis but also provides a robust framework for applications requiring precise human motion generation, such as virtual avatars, animation, and human-computer interaction.

Previous Challenges

  • The requirement for a model to generalize across all possible scenes while maintaining high-quality output generation based on input text prompts. This poses a significant challenge due to the substantial divergence between the inference-time data distribution and the training-time distribution, leading to suboptimal generalization performance.
  • The underutilization of critical prior information inherently present in the input data. Despite its availability, the model fails to adequately emphasize or leverage this information, resulting in degraded performance, particularly in tasks requiring temporal consistency. This highlights a gap in the model's ability to effectively integrate and prioritize salient features from the input domain.
  • Hypothesis

    1. A one-shot diffusion model, designed to moderately overfit to a single video-text prompt pair during training, demonstrates the capability to generate contextually faithful video content when provided with a semantically similar text prompt during inference.

    2. By maintaining consistent pose priors across both the training and inference phases, the model generates video content with significantly improved motion coherence and temporal consistency.

    hypothesis

    Fig: Training input: a video + pose feature + text prompt. Inference input: pose feature + text prompt

    Dataset

    GIF 1 GIF 2 GIF 2

    Since the model is designed to over-fit to a single video-pose-text triplet, only a single data point is required

  • Train set input: a 24-frame video + a text prompt + human pose
  • Human pose: Inferred with OpenPose on the same input video
  • Diversity: 24 videos in total
  • Input dimension: 128 x 128 x 3 x 24
  • Inference input: a text prompt "Mickey mouse is doing yoga in a room" + human pose
  • Experiments

    Setup

  • 1 H100 + 100 epochs of training step (about 5 min)
  • Text prompts at Inference

  • "A lady is doing yoga in a room (<- same as the training input)"
  • "Micky mouse is practising yoga in a room"
  • "Spider man is doing yoga on the beach, cartoon style."
  • "Wonder woman is doing yoga with a hat."
  • diagram
    Figure: Generated results based on the input pose and prompts (Please fresh this page if the animation stops)

    Method

    1. The workflow and the main builidng blocks

    diagram
    Figure: Overall structure of the proposed model
    unet
    Figure: UNet structure
    pose_addition
    Figure: The pose features (named adapted features) is merged with the video representation (named hidden states) in the latent space
    full
    Figure: A complete block that takes timestamp, latent video variable, and pose features


    2. Classifier-Free Guidance with Pose Information

    cfg_formula
    Figure: the standard classifier-free guidance (CFG) formula
    cfg_pose
    Figure: Sampling with pose-guided CFG
    cfg_implementation
    Figure: Pose-guided CFG Implementation

    3. Optical Flow Difference Score

    ofds
    Figure: A graphic illustration of Optical flow difference score (OFDS)

    Occasionally, the generated video sequences exhibit artifacts such as the presence of additional limbs or heads in certain frames, indicating inconsistencies in the synthesis process. To quantitatively detect and measure such anomalies, we introduce a novel metric termed the Optical Flow Difference Score (OFDS). This metric computes the optical flow vectors between two consecutive frames \( f_1 \) and \( f_2 \), in the generated video. Using these vectors, we warp \( f_1 \) to produce a predicted frame \( f_{2}' \). The \( L_2 \) norm between \( f_2 \) and \( f_{2}' \) is then calculated to quantify the discrepancy. A low OFDS indicates smooth temporal transitions with minimal pixel-level inconsistencies, while a high OFDS signifies the presence of unexpected artifacts, such as unnatural pixel displacements or structural anomalies. This metric provides a robust mechanism for evaluating the temporal coherence and structural integrity of generated video sequences.



    4. Frame-Frame Pose Distance (FFPD)

    ffpd
    Figure: A graphic illustration of frame-frame pose distance (FFPD)

    To quantitatively evaluate the alignment between generated videos and the original training video, we introduce a novel metric termed Frame-Frame Pose Difference (FFPD). Specifically, we employ an off-the-shelf pose estimation model to extract pose keypoints from each frame of both the generated video \( V_{gen} \) and the reference video \( V_{ref} \). Let \( P_{gen}^t \) and \( P_{ref}^t \) denote the sets of pose joint coordinates at frame \( t \) for \( V_{gen} \) and \( V_{ref} \), respectively. The FFPD is computed as the \( L_2 \) norm between corresponding pose keypoints across all frames:

    FFPD = (1/T) t=1T ‖Pgent − Preft2

    where \(T \) represents the total number of frames. A high FFPD score indicates significant misalignment between the generated and reference videos, suggesting deviations in pose dynamics. Conversely, a low FFPD score reflects strong alignment, demonstrating that the generated video faithfully preserves the pose structure of the reference video. This metric provides a robust measure for assessing the structural consistency of generated videos in relation to the original training data.

    Analysis

    ofds_result
    Figure: Optical flow difference score (OFDS) computed on the generation with pose guidance and without it
    ffpd_0
    Figure: The face is more aligned in the pose-guided generated video
    ffpd_2
    Figure: The arm is more aligned in the pose-guided generated video
    ffpd_quant
    Figure: A quantitative comparison between generated results with pose guidance and without it. Pose-guided results are more aligned to the original training video.

    References

    1. Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    2. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
    Back to Research