Causality in Video Diffusers is Separable from Denoising

Xingjian Bai^1,2 Guande He³ Zhengqi Li² Eli Shechtman² Xun Huang³ Zongze Wu²

¹Massachusetts Institute of Technology ²Adobe Research ³Morpheus AI

Abstract

Causality — referring to temporal, uni-directional cause-effect relationships between components — underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.

Teaser figure illustrating the separability of causality from denoising in video diffusion models.

Causality in autoregressive video diffusion models is separable from the denoising process. The prevailing design of causal diffusion models for visual generation performs causal attention densely across all layers and all denoising steps (left). However, we uncover two important observations (right): (1) early denoiser layers share highly repetitive computation across denoising steps (blue); (2) deep layers primarily attend to intra-frame tokens, with sparse cross-frame connections (red).

Observation 1: Redundant Computation Across Denoising Steps

When generating a frame, early and middle layers of video diffusion models produce nearly identical features across the entire diffusion trajectory. As shown in the cosine similarity matrices below, the 15th-layer features remain highly correlated across all 50 denoising steps. PCA visualizations further confirm that these layers capture global layout and motion from context frames, yet redundantly recompute them at every denoising step. This suggests that early-layer computation can be amortized and shared across steps without loss of information.

Cosine similarity matrices and PCA patterns showing redundant computation across denoising steps in early layers.

(a) Cosine similarity of 15th-layer features across 50 denoising steps. (b) PCA patterns of 15th-layer features remain stable across the diffusion process. Visualizations obtained from probing autoregressive-finetuned WAN 2.1 (1.4B).

Observation 2: Sparse Cross-Frame Attention in Deep Layers

Deeper layers exhibit a markedly different behavior: they barely attend to past frames. Despite being trained with dense causal attention masks, the cross-frame attention weights in deeper layers become increasingly sparse with depth. Layers 28 and 29 focus almost exclusively on intra-frame tokens, performing per-frame rendering rather than temporal reasoning. This natural emergence of temporal sparsity motivates a clean architectural separation between cross-frame reasoning and per-frame denoising.

Attention weight visualizations showing sparse cross-frame attention in deep layers.

Attention weight maps across layers. Early layers (0–10) attend broadly to context frames, while deep layers (20–29) exhibit sparse cross-frame connections and focus on intra-frame rendering. Visualizations obtained from probing autoregressive-finetuned WAN 2.1 (1.4B).

Separable Causal Diffusion

Motivated by these observations, we propose Separable Causal Diffusion (SCD), which explicitly decouples temporal reasoning from iterative denoising. An autoregressive causal encoder operates once per frame, replacing the redundant early layers and producing a compact temporal context—analogous to the backbone of autoregressive language models. A lightweight diffusion decoder then renders each frame independently, conditioned on this context, with no access to past frames. This separation allows the encoder to be made larger without increasing per-step cost, while the decoder remains efficient across multiple denoising steps.

The SCD architecture. A causal encoder (blue, ×1) summarizes temporal context once per frame. A frame-wise diffusion decoder (red, ×T) renders each frame through iterative denoising, conditioned on the causal context.

Text-to-Video Generated Samples

Our model generates high-quality 480P videos with an initial latency of ~0.29 seconds, after which frames are generated in a streaming fashion at ~11.1 FPS on a single H100 GPU. Below, we show text-to-video samples generated by our fine-tuned 1.3B Separable Causal Diffusion model with a 25-layer causal encoder and 10-layer diffusion decoder. Click on any video to view in full size with the text prompt.

BibTeX