π0: A Vision-Language-Action Flow Model

General Robot Control via Flow Matching
VLA Flow Matching Robot Foundation Model Physical Intelligence 2024

1 — What the Robot Actually Sees

Before understanding the model, you need to understand the inputs. π0 controls real physical robots — it sees through cameras, reads language instructions, and feels its own joint positions. Let's make each input concrete.

What π0 Receives at Every Timestep CAMERA 1 (overhead) 224×224 RGB CAMERA 2 (wrist) 224×224 RGB "Pick up the yellow ball and place it in the bowl" LANGUAGE INSTRUCTION Tokenized text sequence PROPRIOCEPTIVE STATE qt J1: +0.32 rad J2: -1.47 rad J3: +0.85 rad ... (18-dim) ot = [ I1, I2, ..., In, lt, qt ] 2–3 camera images + language tokens + 18-dim joint state π0 Model

Camera Images (2–3 per robot)

[n, 224, 224, 3] RGB

Each robot has 2–3 cameras providing different viewpoints: typically an overhead/third-person view showing the workspace and a wrist-mounted camera for close-up manipulation detail. Some bimanual robots add a third camera. Images are 224×224 RGB, processed by a pre-trained SigLIP vision encoder.

Language Instruction

tokenized text → embeddings

Natural language commands like "fold the shirt," "pick up the egg and place it in the carton," or "clear the table." These can be high-level task descriptions or fine-grained segment labels (~2-second sub-instructions). Tokenized by the PaliGemma tokenizer and processed through the VLM backbone.

Proprioceptive State qt

[18] joint angles (zero-padded)

The robot's own joint angles — its sense of where its arms are. Standardized to 18 dimensions across all 7 robot embodiments via zero-padding. A 7-DoF single arm uses 8 values (7 joints + gripper); a bimanual 14-DoF system uses 16 values; remaining dimensions are zeros.

2 — What the Model Outputs

π0 doesn't output a single action — it outputs an entire action chunk of H=50 future steps. This is critical for smooth, coordinated motion.

Action Chunk Output: 50 Steps into the Future at at+1 at+2 at+3 at+15 at+49 EXECUTED (16 steps at 20Hz = 0.8s) Remaining 34 steps discarded — re-plan with new observation EACH ACTION at+i CONTAINS: Δjoint1, Δjoint2, ..., Δjoint7, gripper → continuous values via flow matching
Open-loop chunking: The robot executes 16 of the 50 predicted actions (0.8 seconds at 20Hz), then re-plans from scratch with a new observation. This provides reactive closed-loop behavior at the re-planning frequency while maintaining smooth, coordinated motion within each chunk. Mobile robots at 50Hz execute 25 steps (0.5s) before re-planning.

3 — Architecture Overview

π0 combines a pre-trained Vision-Language Model (PaliGemma, 3B params) with a dedicated Action Expert module (300M params). The VLM understands what it sees and reads; the Action Expert generates smooth motor commands via flow matching.

π0 Full Architecture Images I1..n Language lt State qt Noisy Actions Atτ (50 steps × action_dim) SigLIP Encoder (frozen) Image tokens + τ timestep embedding Transformer (Blockwise Causal Attention) VLM Backbone Images + Language 3B params (PaliGemma) Bidirectional attention State qt Causal Action Expert 50 action tokens + flow τ 300M params (width=1024) Full attention to all inputs Velocity Field vθ(Atτ, ot) Euler Integration (10 steps) Aτ+δ = Aτ + δ · vθ, δ=0.1 Clean Actions At (50 steps)

4 — The VLM Backbone (PaliGemma)

π0 builds on PaliGemma, a 3B-parameter vision-language model from Google. This gives π0 a rich understanding of what it sees and reads — pre-trained on billions of image-text pairs from the internet before ever touching a robot.

SigLIP Vision Encoder (per camera)

[B, 3, 224, 224][B, 256, 1152]

Each 224×224 image is split into 14×14 = 256 patches and encoded by the frozen SigLIP-So400m ViT. For n cameras the outputs are concatenated into a [B, n·256, 1152] token sequence (typically 512–768 tokens total). A learned projection maps these to the 2048-dim width of the VLM backbone: [B, n·256, 2048].

Gemma Language Model

[B, L, 2048][B, L, 2048]

PaliGemma's 2B Gemma backbone (width = 2048, MLP dim = 8192, 18 layers) jointly processes the projected image tokens and text tokens. Sequence length L is typically ~600–800 (n·256 image tokens + language tokens). Attention is bidirectional within this block (see §5). By starting from a pre-trained VLM, π0 inherits web-scale object and spatial knowledge before seeing a single robot trajectory.

Why start from a VLM? Robot demonstration data is scarce (~10,000 hours). Internet data is effectively infinite. By initializing from PaliGemma, π0 already knows what a "shirt" looks like, understands "left of," and can parse complex language instructions — all before seeing a single robot trajectory. This is the same pre-training/fine-tuning paradigm that made LLMs successful.

5 — The Action Expert

The Action Expert is π0's secret weapon — a separate 300M-parameter module that lives inside the transformer and is dedicated exclusively to generating smooth, continuous robot actions.

Why a Separate Expert?

300M params, width = 1024, 50 action tokens

Previous VLA models like RT-2 and OpenVLA discretized actions into text tokens (e.g., "move left 0.3"). This is lossy and slow — you're forcing a language model to output numbers as words. π0's Action Expert operates in continuous space via flow matching, producing precise floating-point joint commands at much higher bandwidth. It reads the same 2048-dim residual stream as the VLM but its own weights project through a narrower 1024-dim width, giving it dedicated capacity for motor control without bloating the whole model.

Token Shapes Inside the Transformer

image + language + state + action tokens

The fused token stream at each transformer layer has roughly this structure:

  • Image tokens: [B, n·256, 2048] — from SigLIP, ~512–768 total
  • Language tokens: [B, L_lang, 2048] — up to ~100 tokens
  • State token: [B, 1, 1024] — one token encoding qt (18 dims projected up)
  • Action tokens: [B, 50, 1024] — one per future step in the chunk, plus sinusoidal flow-time embedding φ(τ)

State and action tokens live in the 1024-dim Action Expert subspace; image/language tokens live in the 2048-dim VLM subspace. The blockwise causal mask (below) controls which tokens see which.

Blockwise Causal Attention

The transformer uses three attention blocks with a specific causal structure:

Blockwise Causal Attention Mask Images+Lang State Actions Block 1: Images+Lang Block 2: State Block 3: Actions Bidirectional Blocked Blocked Attends ← Self Blocked Attends ← Attends ← Full Attention
Why blockwise causal? The VLM tokens (images + language) never attend to action tokens — this lets us cache the VLM computation across flow matching steps. The action tokens attend to everything (they need full context). The state block bridges the two: it can see the VLM outputs but not the actions, maintaining a clean information flow.

Action Expert Dimensions

ComponentVLM BackboneAction Expert
Parameters3B300M
Hidden width20481024
MLP dim81924096
Tokens handledimage + language (~600–800)1 state + 50 action (51)
Token shape[B, L_vlm, 2048][B, 51, 1024]
ProcessesImages + languageState + noisy actions
Initialized fromPaliGemma (internet)Random

6 — Flow Matching: From Noise to Actions

The core generative mechanism in π0 is flow matching — a modern alternative to diffusion that learns to transform Gaussian noise into coherent action sequences via a continuous velocity field.

Flow Matching: Noise → Actions in 10 Steps ~ N(0, I) τ = 0 Pure noise (50 × action_dim) τ=.1 τ=.2 τ=.3 τ=.7 τ=.9 Clean Actions τ = 1.0 50 joint commands vθ predicts velocity at each τ At each step: Aτ+0.1 = Aτ + 0.1 · vθ(Aτ, observation) — Euler integration The model runs 10 forward passes through the Action Expert, each refining the action sequence

Training Loss: Conditional Flow Matching

tensors: At, Atτ, v, u all [B, 50, 18]

L(θ) = E[ || vθ(Atτ, ot) − u(Atτ | At) ||² ]

During training, we know the ground-truth clean actions At ∈ [B, 50, 18] from demonstrations. We sample flow time τ ∼ U(0, 1) and Gaussian noise ε ∈ [B, 50, 18], then form the noisy input Atτ = τ At + (1−τ) ε. The model predicts the velocity field vθ ∈ [B, 50, 18], and the MSE loss matches it against the target direction u = ε − At — both have the same shape, so the loss is a plain elementwise squared error.

Flow matching vs diffusion: Both transform noise into structured outputs. Diffusion uses a discrete Markov chain with many steps (50–1000). Flow matching defines a continuous ODE that can be integrated with fewer steps (just 10 in π0). Inference is 10 forward passes through the Action Expert — total inference time is ~73ms on an RTX 4090.

Timestep Embedding

Each noisy action token is embedded with both its own value and the flow timestep τ:

Input embedding: W3 · swish(W2 · concat(W1 · a'tτ, φ(τ))) — where φ(τ) is a sinusoidal encoding of the flow timestep. This tells the Action Expert how noisy the current actions are, so it can calibrate the appropriate denoising velocity.

7 — Cross-Embodiment: 7 Robots, 1 Model

π0 trains a single model across 7 different robot configurations simultaneously — from single-arm table robots to mobile bimanual platforms.

7 Robot Embodiments, 1 Unified Model UR5e 7-DoF, 2 cams Franka 8-DoF, 2 cams Bimanual UR5e 14-DoF, 3 cams Bimanual ALOHA 14-DoF, 3 cams Mobile Bimanual 16–17 DoF, 3 cams Unified 18-Dim Action/State Space Zero-padding for robots with fewer joints EXAMPLE: 7-DOF SINGLE ARM [j1, j2, j3, j4, j5, j6, j7, grip, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] active joints zero-padded to 18
RobotDoFCamerasAction DimTasks
UR5e (single)728Table tasks
Franka (single)828Dexterous manipulation
Bimanual UR5e14316Folding, assembly
Bimanual ALOHA (Trossen)14314Dual-arm tasks
Bimanual ARX/AgileX14314Dual-arm tasks
Mobile Trossen/ARX14316Mobile manipulation
Mobile Fibocom14317Holonomic mobile

8 — Pre-training & Post-training

π0 follows the LLM playbook: massive pre-training on diverse data, then task-specific fine-tuning (which they call "post-training").

Pre-training

903M timesteps, 700K gradient steps

Data: ~10,000 hours of robot demonstrations across all 7 embodiments. 106M timesteps from single-arm, 797M from dual-arm. Plus 9.1% open-source data (OXE, Bridge v2, DROID).

Task weighting: Data from task-robot combinations weighted by n0.43 to prevent over-representation of common tasks.

Language labels: Both high-level task names ("fold the shirt") and fine-grained segment annotations (~2-second sub-instructions like "grasp the right sleeve").

Post-training (Fine-tuning)

5–100+ hours per task

Specializes the pre-trained model to specific downstream tasks using curated, high-quality demonstrations. The key finding: pre-training provides recovery behaviors (what to do when things go wrong) that transfer even to novel tasks, while post-training data teaches efficient task-specific strategies.

9 — Results

Taskπ0OpenVLA (7B)Octo (93M)
Shirt folding~95%~40%<10%
Table bussing (easy)~90%~50%~20%
Table bussing (hard)~70%~30%<10%
Grocery bagging~85%~45%<10%
Toast extraction~80%~35%<10%
Complex multi-stage tasks (10+ minutes): π0 achieves >50% success on laundry folding from arbitrary crumpled configurations, box assembly with deformable cardboard, egg packing (delicate objects), and mobile manipulation — tasks that require dozens of sequential sub-skills coordinated over minutes.

Inference Speed

StageTime
Image encoding (SigLIP)14 ms
Observation processing (VLM)32 ms
Flow matching (10 steps)27 ms
Total on-board (RTX 4090)73 ms

10 — References

Black, K., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164.

Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv:2307.15818.

Kim, M.J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246.

Lipman, Y., et al. (2023). Flow Matching for Generative Modeling. ICLR 2023.

Chi, C., et al. (2024). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS 2024.

Beyer, L., et al. (2024). PaliGemma: A versatile 3B VLM for transfer. arXiv:2407.07726.