VJAM the Video–Jacobian–Action Model
Anonymous Authors · under double-blind review
A 14B video generative model, paired with a faithful inverse dynamics model, solves a wide range of robotics challenges — from zero-shot pick-and-place on a real-world Panda arm to contact-rich re-orientation of a cube with a 16-DoF multi-fingered hand.
Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Many current approaches finetune video models with action-labeled data, turning them into robot foundation models that jointly predict future observations and actions.
In this paper, we study an alternative, underexplored route for transferring the capabilities of video models into robot control: leave the video planner as is, while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several advantages: the video planner can remain embodiment-agnostic; different video models can be interchanged easily without re-training the IDMs; and the inverse dynamics model can be trained with the more readily available autonomous self-play data.
With this in mind, we present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully designed IDM based on the robot embodiment Jacobian. We find that such a structure yields a faithful video-to-action translator that is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video–Jacobian–Action Model (VJAM), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs.
Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control.
From a short observation history, a video generative model imagines a visual plan; the Jacobian-IDM translates each step of that plan into actions, the robot executes them, and the loop repeats.
A single action-free video model, post-trained on a mixture of DROID and Allegro robot videos, generates plans for both a 7-DoF Panda arm and a 16-DoF Allegro hand — we then pair it with an embodiment-specific Jacobian-IDM per body. Both Panda manipulation and contact-rich in-hand reorientation are produced zero-shot.
With identical initial observations, the prompt alone steers the planner — and therefore the robot — to different objects on the table. The video model's prompt-following capability flows directly through the faithful translator into the executed action chunk.
A controlled study isolates what makes faithful translation possible: gains come from constraining actions to enter through a learned tangent map.
We cluttered the table with props and hid the target button behind a wall so that it was visible from only one of three camera views. To solve the task, the policy must integrate the language prompt with multi-view evidence and search around the occluder. DreamZero and $\pi_{0.5}$ struggle on this task; our model produces a coherent visual plan that finds the hidden button and a faithful translation that presses it. We attribute this to preserving the video branch without action-head contamination.
The same six-second VJAM rollout, decomposed into what the planner imagined, what the Jacobian-IDM read from it, and what the robot did.
The path from a better video model to a better robot runs through the inverse dynamics model. We have shown that our Jacobian-IDMs, as one realization among many possible faithful IDMs — paired with an action-free video planner — enable zero-shot, multi-embodiment robot control. We hope our results motivate careful attention to the design and evaluation of future IDMs.
@inproceedings{vjam2026anonymous,
title = {Turning Video Models into Generalist Robot Policies},
author = {Anonymous},
booktitle = {Submitted to the Tenth Conference on Robot Learning (CoRL)},
year = {2026},
note = {Under double-blind review.}
}