Causality — referring to temporal, uni-directional cause-effect relationships between components — underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
When generating a frame, early and middle layers of video diffusion models produce nearly identical features across the entire diffusion trajectory. As shown in the cosine similarity matrices below, the 15th-layer features remain highly correlated across all 50 denoising steps. PCA visualizations further confirm that these layers capture global layout and motion from context frames, yet redundantly recompute them at every denoising step. This suggests that early-layer computation can be amortized and shared across steps without loss of information.
Deeper layers exhibit a markedly different behavior: they barely attend to past frames. Despite being trained with dense causal attention masks, the cross-frame attention weights in deeper layers become increasingly sparse with depth. Layers 28 and 29 focus almost exclusively on intra-frame tokens, performing per-frame rendering rather than temporal reasoning. This natural emergence of temporal sparsity motivates a clean architectural separation between cross-frame reasoning and per-frame denoising.
Motivated by these observations, we propose Separable Causal Diffusion (SCD), which explicitly decouples temporal reasoning from iterative denoising. An autoregressive causal encoder operates once per frame, replacing the redundant early layers and producing a compact temporal context—analogous to the backbone of autoregressive language models. A lightweight diffusion decoder then renders each frame independently, conditioned on this context, with no access to past frames. This separation allows the encoder to be made larger without increasing per-step cost, while the decoder remains efficient across multiple denoising steps.
Our model generates high-quality 480P videos with an initial latency of ~0.29 seconds, after which frames are generated in a streaming fashion at ~11.1 FPS on a single H100 GPU. Below, we show text-to-video samples generated by our fine-tuned 1.3B Separable Causal Diffusion model with a 25-layer causal encoder and 10-layer diffusion decoder. Click on any video to view in full size with the text prompt.
@article{bai2026causality,
title={Causality in Video Diffusers is Separable from Denoising},
author={Bai, Xingjian and He, Guande and Li, Zhengqi and Shechtman, Eli and Huang, Xun and Wu, Zongze},
journal={arXiv preprint},
year={2026}
}