VJAM — Turning Video Models into Generalist Robot Policies

Overview

A single video planner generalizes across embodiments

A 14B video generative model, paired with a faithful inverse dynamics model, solves a wide range of robotics challenges — from zero-shot pick-and-place on a real-world Panda arm to contact-rich re-orientation of a cube with a 16-DoF multi-fingered hand.

Controlling robots across embodiments, skills, and environments

Abstract

Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Many current approaches finetune video models with action-labeled data, turning them into robot foundation models that jointly predict future observations and actions.

In this paper, we study an alternative, underexplored route for transferring the capabilities of video models into robot control: leave the video planner as is, while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several advantages: the video planner can remain embodiment-agnostic; different video models can be interchanged easily without re-training the IDMs; and the inverse dynamics model can be trained with the more readily available autonomous self-play data.

With this in mind, we present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully designed IDM based on the robot embodiment Jacobian. We find that such a structure yields a faithful video-to-action translator that is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video–Jacobian–Action Model (VJAM), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs.

Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control.

Method

A planner, a translator, a closed loop

From a short observation history, a video generative model imagines a visual plan; the Jacobian-IDM translates each step of that plan into actions, the robot executes them, and the loop repeats.

Step 1 · Observation history

The last few RGB frames from the robot's cameras, plus a language goal. They seed the next plan.

Main results · multi-embodiment

One video planner, two very different bodies

A single action-free video model, post-trained on a mixture of DROID and Allegro robot videos, generates plans for both a 7-DoF Panda arm and a 16-DoF Allegro hand — we then pair it with an embodiment-specific Jacobian-IDM per body. Both Panda manipulation and contact-rich in-hand reorientation are produced zero-shot.

Franka Panda

7-DoF arm · zero-shot prompts

Prompt “Pick up the yellow cube and place it on the yellow brick.”

Wrist view

External view

Press-Hidden-Button An occluded target visible from only one camera view — the policy must integrate the prompt with multi-view evidence.

Wrist view

External view

Allegro Hand

16-DoF dexterous reorientation

Real robot in-hand cube reorientation.

Jacobian field visualization of the visuomotor Jacobian during reorientation.

Simulator counter-clockwise reorientation.

Zero-shot evaluation on Panda

Task setup, the Jacobian we learn, and dream–execution alignment

a. After training on DROID, we deploy zero-shot on a Panda arm in an unseen scene with ad-hoc camera placements. In Press-Hidden-Button, the robot must press the blue button specified by the language command; the button is visible from only one of the three views, and an orange distractor probes instruction grounding.

b. Visualization of the visuomotor Jacobian predicted by our model across a set of action channels. Columns correspond to action channels; rows correspond to viewpoints.

c. Generated frames and executed rollouts shown side by side, together with the text prompt used for zero-shot evaluation. The robot's actions remain closely aligned with the generated frames — the IDM is doing its job.

Zero-shot evaluation on Allegro Hand

In-hand object reorientation

a. We train on cube-reorientation demonstrations with three language instructions — clockwise, counter-clockwise, and random — and evaluate prompt-conditioned in-hand reorientation.

b. Visualization of the visuomotor Jacobian predicted by our model across a set of action channels. Columns correspond to action channels; rows correspond to viewpoints.

c. Generated frames and executed rollouts shown side by side, together with the text prompt. The model carefully orchestrates dexterous finger motions to follow the visual plan.

Instruction following on basic tasks

Same scene, three different goals

With identical initial observations, the prompt alone steers the planner — and therefore the robot — to different objects on the table. The video model's prompt-following capability flows directly through the faithful translator into the executed action chunk.

“Approach the cup.”

“Approach the lego block.”

“Approach the tennis ball.”

Why this works · ablation

J-IDM scales with action dimensionality

A controlled study isolates what makes faithful translation possible: gains come from constraining actions to enter through a learned tangent map.

J-IDM scales more favorably with action dimensionality

a. Rollouts of recovered actions for direct-IDM baselines and J-IDM as DoFs grow.

b. At a fixed data budget, the J-IDM gap over direct-IDMs widens with DoF.

c. At high DoF, J-IDM has a better data–accuracy trade-off across training-set sizes.

VJAM vs robot foundation models

Reasoning through occlusions

We cluttered the table with props and hid the target button behind a wall so that it was visible from only one of three camera views. To solve the task, the policy must integrate the language prompt with multi-view evidence and search around the occluder. DreamZero and $\pi_{0.5}$ struggle on this task; our model produces a coherent visual plan that finds the hidden button and a faithful translation that presses it. We attribute this to preserving the video branch without action-head contamination.

VJAM (ours)

“push the blue button.” Dream rollout (top) and matching execution (bottom) — the planner imagines a path around the occluder, and the J-IDM translates it into an action chunk that presses the target.

DreamZero (baseline)

Same prompt, same scene. The baseline approaches the wall and never reaches the hidden target.

The same six-second VJAM rollout, decomposed into what the planner imagined, what the Jacobian-IDM read from it, and what the robot did.

Dream Generated video lookahead.

Jacobian Per-pixel action–to–motion map.

Execution Action chunk executed on the robot.

Conclusion

Through the inverse dynamics model

The path from a better video model to a better robot runs through the inverse dynamics model. We have shown that our Jacobian-IDMs, as one realization among many possible faithful IDMs — paired with an action-free video planner — enable zero-shot, multi-embodiment robot control. We hope our results motivate careful attention to the design and evaluation of future IDMs.

Cite

BibTeX

@inproceedings{vjam2026anonymous,
  title     = {Turning Video Models into Generalist Robot Policies},
  author    = {Anonymous},
  booktitle = {Submitted to the Tenth Conference on Robot Learning (CoRL)},
  year      = {2026},
  note      = {Under double-blind review.}
}