Recursive Temporal-Consistent Content Generation on Latent Variables via Alpha Diffusion Framework: Integrating Global and Local Contextual Modeling for 30-Second Sequences

article image

Animation above: 893 frames recursively generated using only 2 input frames as condition

Temporal-consistent Alpha blend diffusion Recursive generation

A technical research I architected and executed at Irreverent Labs. For technical details, please see our paper.

Objective

This paper introduces a novel framework for recursively generating temporally consistent content sequences of 30-second duration using an Alpha Diffusion architecture. By integrating global and local contextual modeling, our approach ensures coherence across temporal scales while maintaining high fidelity in content generation. The global context captures overarching structural patterns, while the local context refines fine-grained details, enabling seamless transitions and long-term consistency. Experimental results demonstrate the effectiveness of the proposed method in producing realistic and temporally stable outputs, outperforming existing baselines in both qualitative and quantitative evaluations. This work advances the state-of-the-art in generative modeling for sequential data, with potential applications in video synthesis, dynamic scene generation, and time-series forecasting.

Challenges

  • Prediction errors propagate and amplify iteratively during the sampling process, leading to suboptimal generative outcomes.
  • Existing frameworks fail to fully exploit available informational priors, resulting in inefficient utilization of contextual data.
  • Hypothesis

    We can simulate the error-amplification phenomenon by training a model to adapt to its own errors and iteratively refine its outputs, while leveraging multi-scale, fine-grained contextual information to enhance corrective learning.

    Dataset

    Examples of train and valid sets

    GIF 1 GIF 2 GIF 3
  • Train data: 60 seconds x 256 videos, synthetically-made with unity
  • Diversity: 24 basic fighting animations (i.e. left punch, heavy kick, etc)
  • Input dimension: 128 x 128 x 3 x frame_number
  • Experiments

    Setup: 4 H100 GPU

    inputs: 2 cond frames from an unseen test set

    GIF 1 GIF 2 GIF 3

    outputs: 893 generated frames

    GIF 1 GIF 2 GIF 3

    Simplified Diagrams (in 3 steps)

    1. How local and global context information between two consecutive time steps is forwarded to diffusion framework

    diagram1

    2. How local and global context information is consumed by the diffusion framework

    diagram2

    3. Recursively compute new local and global context between two time steps

    diagram2

    Training Logic

    main logic

    Key points:

  • Each training step is broken down into 2 separate stages
  • Video predicted from stage 1 is used in stage 2 as priors
  • Within each training step, the model updates its learnable parameters twice!
  • Schrƶdinger's cat

    inputs: 2 cond frames from an unseen test set + different random seeds

    cond s

    outputs: 3 different generation of 893 frames

    GIF 1 GIF 2 GIF 3

    The same inputs -> multiple plausible outputs because of different random seeds

    Train and validation losses

    train_and_valid

    References

    1. Iterative š¯›¼-(de)Blending: a Minimalist Deterministic Diffusion Model
    2. Video Diffusion Models with Local-Global Context Guidance
    Back to Research