Gemini Robotics: Bringing AI into the Physical World

Google DeepMind 2025
VLA Foundation Model Embodied AI Dexterous Manipulation Humanoid Google DeepMind

1 — The Problem

Building robots that can operate as general-purpose assistants in unstructured human environments remains one of the defining challenges in AI. A robot folding laundry, assembling furniture, or preparing a meal must weave together three fundamentally different capabilities: understanding natural language instructions, perceiving the three-dimensional physical world, and generating precise, continuous motor actions in real time.

Prior approaches have attacked these capabilities in isolation. Large language models and vision-language models (VLMs) excel at semantic understanding and can reason about what a robot should do, but they produce discrete text tokens — not the smooth, high-frequency torque commands a physical arm demands. Conversely, task-specific control policies trained via reinforcement learning or imitation learning can achieve remarkable dexterity on narrow benchmarks, but they struggle to generalize across tasks, objects, or embodiments. The field has lacked a unified model that spans the full perception-reasoning-action loop.

Core tension: Language models understand what to do but cannot produce continuous actions. Control policies know how to move but lack broad reasoning. Gemini Robotics attempts to bridge this gap by extending a frontier VLM with native action generation.

Why Is This Hard?

Robotic manipulation introduces constraints absent from pure perception tasks. Actions must be generated at 50 Hz to maintain smooth contact dynamics. The action space is continuous and high-dimensional — a bimanual setup has 14 degrees of freedom per timestep. Long-horizon tasks like origami folding require dozens of sequential sub-skills coordinated over minutes. And the real world is unforgiving: a single missed grasp or misaligned fold can cascade into task failure.

Furthermore, deploying a large model on a robot creates a systems challenge. Frontier VLMs require cloud-scale compute, but control loops demand sub-20ms latency on the robot itself. Any viable architecture must split computation between cloud and edge while maintaining coherent behavior.

2 — Model Family Overview

Gemini Robotics is not a single model but a family of three complementary systems, each targeting a different level of the robotics stack. All three build on the Gemini 2.0 foundation model, inheriting its vision-language capabilities while adding specialized modules for spatial reasoning, action generation, or navigation.

Design principle: Rather than training a monolithic end-to-end model, DeepMind decomposes the problem into specialized variants that share a common backbone. This allows each model to be optimized for its particular output modality while benefiting from shared pre-training.
Model Input Output Primary Use
Gemini Robotics-ER Images + Language Structured spatial understanding Embodied reasoning, scene analysis
Gemini Robotics Images + Language + Proprioception Continuous 7-DoF actions Dexterous manipulation
Gemini Robotics-Nav Images + Language + Odometry Navigation commands Mobile robot navigation

The three models can be composed: Robotics-ER provides high-level scene understanding and task planning, Robotics generates fine-grained manipulation actions, and Robotics-Nav handles locomotion for mobile platforms. This layered approach mirrors how humans decompose tasks — first understanding the scene, then planning actions, then executing movements.

Gemini 2.0 Backbone vision + language pre-training (web-scale) Robotics-ER Embodied Reasoning in: images + language out: 3D boxes, grasps, waypoints, pointing Perception / planning Robotics (VLA) Vision-Language-Action in: images + language + proprioception out: 7-DoF actions @ 50Hz Dexterous control Robotics-Nav Navigation in: images + language + odometry out: nav commands Mobile locomotion composable: ER plans → Robotics executes → Nav moves the base

3 — Gemini Robotics-ER (Embodied Reasoning)

1 Gemini Robotics-ER

Images + Language → Spatial Structures

Robotics-ER extends Gemini 2.0 with spatial reasoning capabilities designed for robotic perception. Given one or more camera images and a natural language query, it produces structured spatial outputs: 3D bounding boxes, 6-DoF grasp pose predictions, pointing and reaching targets, and trajectory sketches.

Unlike standard object detectors that output 2D bounding boxes, Robotics-ER reasons in three dimensions. It predicts oriented 3D bounding boxes with full position and rotation, enabling downstream planners to reason about object geometry, clearances, and stacking. Grasp predictions include both the gripper pose and an approach vector, directly usable by motion planners.

The model also supports trajectory planning: given a task description like "pour water from the pitcher into the glass," it outputs a sequence of waypoints that sketch the desired end-effector path. These waypoints serve as high-level plans that a lower-level controller (such as the full Gemini Robotics VLA) can refine into smooth, executable actions.

Spatial reasoning capabilities: 3D object detection, 6-DoF grasp prediction, reach/point target estimation, trajectory waypoint planning, and scene-level spatial relationship reasoning — all from natural language queries over RGB images.

Why Embodied Reasoning Matters

Many robotic failures stem not from poor motor control but from incorrect perception. A robot that misjudges the height of a shelf, the orientation of a handle, or the deformability of a material will fail regardless of how good its control policy is. By building spatial reasoning directly into the foundation model, Robotics-ER provides a perception backbone that understands physical affordances — not just object categories.

4 — Gemini Robotics (Vision-Language-Action)

Cameras (n) [B, n, 3, H, W] Instruction tokens [B, L_txt] Proprio q_t [B, 7] or [B, 14] Cloud Backbone — Gemini 2.0 Robotics-ER joint vision + language transformer image tokens + text tokens → fused latent [B, L_vlm, d_model] ~0.5–1 Hz reasoning cadence latent conditioning tokens On-robot Action Decoder small transformer, 50 Hz outputs chunk [B, H, 7] 7-DoF Action Chunk [B, H, 7]: Δxyz + Δrpy + grip cloud (big, slow) edge (small, fast)

2 Gemini Robotics VLA

Images + Language + Proprioception → 7-DoF Actions @ 50 Hz

The full Gemini Robotics model is a vision-language-action (VLA) system that closes the loop from perception to physical action. It takes as input camera images, a natural language task instruction, and proprioceptive state (joint positions, gripper state), and outputs continuous 7-DoF action commands: 3D translation, 3D rotation, and gripper open/close.

The model operates at 50 Hz, producing action chunks of multiple future timesteps in each forward pass. This action chunking strategy — predicting a short horizon of future actions rather than a single next action — smooths trajectories and reduces the effective latency of the cloud-edge communication pipeline.

On the standard benchmark of long-horizon dexterous manipulation tasks, Gemini Robotics achieves a 79% success rate, substantially outperforming prior VLA models. Tasks include origami folding (requiring precise crease formation), box assembly (deformable cardboard manipulation), and multi-object rearrangement with specific spatial constraints.

Action chunking: Rather than predicting one action per forward pass, the model outputs a chunk of H future actions. The robot executes these while the next chunk is being computed in the cloud, effectively hiding network latency behind execution time.

Bimanual Control

Gemini Robotics natively supports bimanual control for dual-arm setups. For a bimanual ALOHA configuration, the action space doubles to 14-DoF (7 per arm). The model learns coordinated bimanual strategies — one arm stabilizes an object while the other manipulates it, or both arms collaborate on symmetric tasks like folding cloth. This coordination emerges from the training data without explicit bimanual planning modules.

Instruction Following and Safety

Because the model inherits Gemini 2.0's language understanding, it can follow nuanced natural language instructions: "pick up the red cup but avoid the glass one," "stack the blocks from largest to smallest," or "fold the napkin into a triangle." The model also respects safety constraints expressed in language, such as "do not touch the knife" or "move slowly near the edge of the table."

5 — Gemini Robotics-Nav

3 Gemini Robotics-Nav

Images + Language + Odometry → Navigation Commands

Robotics-Nav is a navigation-specialized variant designed for mobile robots operating in indoor and outdoor environments. It takes camera images, a language goal (e.g., "go to the kitchen" or "find the red chair"), and odometry readings, and produces navigation commands: linear and angular velocities for differential-drive or omnidirectional platforms.

The model handles the full navigation stack: semantic goal interpretation, visual place recognition, obstacle avoidance, and path planning. It can navigate to semantically specified locations ("the room with the whiteboard"), follow natural language route instructions ("go down the hallway, turn left at the elevator"), and adapt to dynamic obstacles like people walking through the scene.

Robotics-Nav is designed to compose with the manipulation models. A mobile manipulator can use Robotics-Nav to navigate to a target location, then switch to Robotics-ER for scene understanding and Gemini Robotics for manipulation — enabling full pick-and-place workflows across large environments.

6 — Cloud-Edge Split Architecture

The defining architectural innovation of Gemini Robotics is its cloud-edge split design. Running a frontier-scale VLM (billions of parameters) on a robot's onboard compute is infeasible — the model would be too slow for real-time control. But running the entire pipeline in the cloud introduces network latency that breaks the tight control loops manipulation requires. The solution is to split the model across both.

CLOUD Gemini 2.0 Backbone Vision-Language Model Billions of parameters Semantic reasoning + spatial features Action Token Head Encodes action intent Compressed latent representation Transmitted to robot NETWORK BOUNDARY — ACTION TOKENS TRANSMITTED ON-ROBOT Action Decoder Lightweight transformer Decodes action chunks @ 50 Hz + Proprioceptive state input Robot Actuators 7-DoF / 14-DoF actions Translation + Rotation + Gripper Real-time execution

How the Split Works

1 Cloud backbone: Camera images and the language instruction are sent to the cloud, where the full Gemini 2.0 model processes them. The backbone produces rich semantic and spatial features that capture the scene layout, object identities, task semantics, and intended action plan.

2 Action token head: A lightweight head on top of the backbone compresses the action-relevant information into a compact set of action tokens — a latent representation that encodes what the robot should do next, without the full detail of the backbone's internal state. These tokens are small enough to transmit over a network with minimal latency.

3 On-robot decoder: A small transformer decoder running on the robot's local GPU receives the action tokens, combines them with real-time proprioceptive feedback (joint positions, forces), and decodes them into executable action chunks. This decoder is lightweight enough to run at 50 Hz on embedded hardware.

4 Action execution: The decoded action chunk is a sequence of future waypoints. The robot executes these while the cloud simultaneously computes the next chunk, creating a pipelined system that hides network round-trip latency.

Latency budget: The cloud forward pass takes ~100–200ms. The on-robot decoder runs in <5ms. By predicting action chunks of ~10 timesteps (200ms at 50 Hz), the system overlaps cloud computation with execution, achieving effective real-time control despite the network round trip.

7 — Training Pipeline

A Pre-training: Internet-Scale Vision-Language

Web data → Gemini 2.0

The foundation is Gemini 2.0, pre-trained on internet-scale multimodal data: text, images, video, code, and structured documents. This massive pre-training provides the model with broad world knowledge, spatial understanding from images and video, physical intuition from watching how objects move and interact, and language grounding for instruction following.

Critically, the vision-language pre-training implicitly teaches physics: the model has seen millions of videos of objects falling, liquids pouring, fabrics deforming, and hands manipulating tools. This gives the robotics variants a prior on how the physical world behaves — something that would take billions of robot interactions to learn from scratch.

B Robotics Fine-tuning: Demonstration Data

Robot demonstrations → Action prediction

The pre-trained model is then fine-tuned on robot demonstration data to learn the mapping from perception to continuous actions. The primary data source is ALOHA bimanual teleoperation, where human operators control dual robot arms through a leader-follower setup. The resulting demonstrations capture expert-level dexterous manipulation with full 14-DoF action trajectories.

Fine-tuning adapts the model's output head to produce continuous action values instead of discrete text tokens. The training objective is behavioral cloning: predict the expert's action given the current observation and instruction. Action chunking is used during training — the model predicts the next H actions simultaneously, which provides a richer training signal and encourages temporally coherent behavior.

C Safety and Constraint Alignment

Constraints → Aligned behavior

A final alignment stage ensures the model respects safety constraints and follows instructions faithfully. The model is trained to recognize and obey constraint specifications in the language instruction: "do not touch the knife," "keep the cup upright," "move slowly near the table edge." This leverages Gemini 2.0's existing instruction-following capabilities, extended to physical action constraints.

The safety alignment also includes workspace boundary enforcement, force limits for delicate object handling, and emergency stop behaviors triggered by anomalous states.

Data efficiency: By building on Gemini 2.0's pre-training, the robotics fine-tuning requires orders of magnitude less demonstration data than training a robot policy from scratch. The pre-trained model already understands objects, spatial relationships, and physical dynamics — the fine-tuning primarily teaches the output format (continuous actions) and embodiment-specific control strategies.

8 — Key Results

Long-Horizon Dexterous Manipulation

The headline result is 79% success on a suite of long-horizon dexterous tasks, each requiring multiple sequential sub-skills coordinated over extended time horizons. These tasks were specifically chosen to stress-test both precision and persistence.

Task Gemini Robotics RT-2 Octo Diffusion Policy
Origami folding 72% 18% 5% 34%
Box assembly 76% 22% 8% 40%
Multi-object rearrangement 85% 45% 20% 55%
Deformable object manipulation 78% 25% 10% 42%
Tool use (pouring, scooping) 82% 38% 12% 48%
Average (long-horizon) 79% 30% 11% 44%
Why the gap? RT-2 uses discrete action tokenization, limiting its precision. Octo is a smaller generalist model without the reasoning depth. Diffusion Policy is strong on single tasks but lacks the language-conditioned generalization. Gemini Robotics combines frontier-scale reasoning with continuous action decoding, enabling both precision and generalization.

Spatial Reasoning (Robotics-ER)

Capability Metric Gemini Robotics-ER Best Prior
3D bounding box prediction IoU@0.5 74.2% 58.1%
Grasp pose prediction Success rate 88.5% 76.3%
Trajectory waypoint planning Task completion 71.8% 52.4%
Spatial relationship QA Accuracy 91.3% 79.6%

Cross-Embodiment Generalization

A critical test of a foundation model approach is whether it transfers across different robot platforms. Gemini Robotics was evaluated on multiple embodiments beyond its primary ALOHA training platform:

Embodiment DoF Adaptation Method Success Rate
ALOHA bimanual (primary) 14 Full fine-tuning 79%
Single UR5e arm 7 Adapter fine-tuning 71%
Franka Panda 7 Adapter fine-tuning 68%
Humanoid upper body 22 Adapter fine-tuning 54%
Humanoid transfer: Adapting to a 22-DoF humanoid upper body with only adapter fine-tuning (freezing the backbone) achieves 54% success — remarkable given the significant morphological difference from the bimanual training platform. This suggests the backbone has learned embodiment-agnostic manipulation strategies.

9 — Component Deep Dive

Vision Encoder

Multi-Camera Image Encoding

[N, H, W, 3] → [N, L, D]

The vision encoder processes images from multiple cameras (typically 2–4 views for manipulation tasks). Each image is independently encoded into a sequence of visual tokens using the Gemini 2.0 vision encoder, then concatenated along the sequence dimension. This multi-view representation gives the model implicit 3D awareness through geometric consistency across viewpoints.

Image resolution is 480 × 640 per camera. The vision encoder produces L tokens per image, each of dimension D. For a 3-camera setup, the total visual token count is 3L, which is concatenated with language tokens and proprioceptive tokens before entering the backbone transformer.

Language Conditioning

Instruction Tokenization

Text → [T, D]

Natural language instructions are tokenized using Gemini 2.0's standard text tokenizer and embedded into the same D-dimensional space as visual tokens. The unified token space means the model attends jointly over visual and language tokens, enabling fine-grained cross-modal reasoning — for example, grounding the phrase "the red cup near the edge" to specific visual features.

Proprioceptive Input

Robot State Encoding

[joint_positions, gripper_state] → [P, D]

The robot's proprioceptive state — joint angles, joint velocities, gripper opening width, and optionally force/torque readings — is projected into the token embedding space via a small MLP. These proprioceptive tokens are appended to the visual and language tokens, giving the backbone full context about the robot's current physical configuration.

Proprioceptive input is critical for closed-loop control: without knowing where its joints currently are, the model cannot produce accurate delta actions to reach a target pose.

Action Decoder

On-Robot Action Chunk Decoder

Action tokens + Proprioception → [H, 7] or [H, 14]

The on-robot decoder is a lightweight transformer (a few layers, small hidden dimension) that takes the compressed action tokens from the cloud and the current proprioceptive state, and outputs an action chunk: a sequence of H future actions. Each action is a 7-dimensional vector for single-arm control (dx, dy, dz, droll, dpitch, dyaw, gripper) or 14-dimensional for bimanual control.

The decoder is trained jointly with the backbone. During training, the action tokens serve as an information bottleneck: the backbone must compress all action-relevant information into a fixed number of tokens, and the decoder must reconstruct precise actions from this compressed representation plus real-time proprioception.

10 — Action Chunking and Temporal Smoothing

Action chunking is a key design choice that addresses multiple challenges simultaneously. Instead of predicting a single action per observation, the model predicts a chunk of H consecutive actions (typically H = 10, corresponding to 200ms at 50 Hz).

Benefits of Action Chunking

1 Latency hiding: While the robot executes the current chunk, the cloud computes the next one. As long as the chunk duration exceeds the network round-trip time, the robot never stalls waiting for actions.

2 Temporal coherence: Predicting multiple future actions encourages the model to plan smooth trajectories rather than making greedy single-step decisions. This is especially important for dynamic motions like pouring or folding.

3 Richer training signal: Each training example provides H action labels instead of one, giving the model more supervision per forward pass and improving sample efficiency.

4 Implicit planning: To predict the next 10 actions accurately, the model must internally plan a short-horizon strategy, not just react to the current frame. This implicit planning improves performance on tasks requiring multi-step manipulation sequences.

Chunk overlap: At inference time, the robot does not wait for a chunk to finish before requesting the next one. Chunks are requested at a higher frequency than their duration, and overlapping predictions are blended using exponential weighting (recent predictions weighted more heavily). This ensures smooth transitions and allows the model to correct for deviations from the predicted trajectory.

11 — Comparison with Prior Work

Feature Gemini Robotics π0 RT-2 OpenVLA
Base model scale Gemini 2.0 (frontier) PaliGemma 3B PaLM-E 55B Llama-2 7B
Action output Continuous (chunk) Continuous (flow matching) Discrete tokens Discrete tokens
Control frequency 50 Hz 50 Hz 3 Hz 5 Hz
Bimanual support Native (14-DoF) Native (14-DoF) Single arm Single arm
Cloud-edge split Yes No (on-device) Cloud only On-device
Spatial reasoning ER variant (3D) Limited 2D only 2D only
Navigation variant Yes (Nav) No No No
Cross-embodiment 4+ platforms 5+ platforms 1 platform Multiple (Open X)
Key differentiator: Gemini Robotics is the first system to combine frontier-scale VLM reasoning with real-time continuous action generation via a cloud-edge split. Prior cloud-based systems (RT-2) ran at only 3 Hz with discrete actions. Prior on-device systems (π0, OpenVLA) sacrifice model scale for on-board inference. Gemini Robotics gets both: frontier reasoning and real-time control.

12 — Practical Implications

For Robotics Researchers

The Gemini Robotics approach validates the hypothesis that scaling up the perception-reasoning backbone — even if it requires cloud inference — can substantially improve manipulation performance. This shifts the bottleneck from model capability to data collection: with a powerful enough backbone, the marginal value of each demonstration increases dramatically. Researchers should consider whether their task requires more model or more data.

For System Designers

The cloud-edge split is an architectural pattern that will likely become standard. Key considerations for deployment include network reliability (what happens during a connectivity dropout?), privacy (camera images are sent to the cloud), and latency variability. The action chunking mechanism provides a natural buffer against transient latency spikes, but sustained connectivity loss requires a graceful fallback — such as a simpler on-device policy that takes over during outages.

For the Field

Gemini Robotics represents a consolidation of the robotics model landscape around the VLA paradigm. Rather than separate models for perception, planning, and control, a single pre-trained backbone (with specialized heads) handles the full stack. This mirrors the consolidation that occurred in NLP (from pipelines of specialized models to unified transformers) and suggests that the future of robot learning is foundation model adaptation, not task-specific policy training.

The foundation model thesis for robotics: Pre-train on internet-scale multimodal data to learn world knowledge and physical intuition. Fine-tune on robot demonstrations to learn the action output format and embodiment-specific strategies. Adapt to new tasks and platforms with minimal additional data. Gemini Robotics is the strongest evidence yet that this thesis holds.

13 — Limitations and Open Questions

Despite impressive results, several important limitations remain:

Cloud dependency. The reliance on cloud inference for the backbone means the system requires reliable, low-latency network connectivity. This limits deployment in environments with poor connectivity (outdoor field robots, disaster response scenarios, remote facilities). While action chunking buffers against short latency spikes, sustained connectivity loss would halt the robot.

Demonstration data bottleneck. The system is trained on human teleoperation data, which is expensive to collect at scale. While the pre-trained backbone provides strong priors, each new embodiment or task domain still requires a non-trivial amount of demonstration data for fine-tuning. Scaling the data collection pipeline remains a challenge.

Long-horizon compounding errors. On the longest tasks (>5 minutes, >100 sub-skills), success rates drop significantly. Small errors compound over time, and the model lacks an explicit mechanism for detecting and recovering from failures mid-task. Integrating explicit error detection and re-planning remains an open problem.

Sim-to-real gap. The paper focuses on real-world evaluation, but the training pipeline does not heavily leverage simulation. Incorporating simulated data (with domain randomization) could potentially reduce the need for expensive real-world demonstrations, but bridging the sim-to-real gap for dexterous manipulation remains challenging.

Open question: As on-device compute improves (custom ASIC accelerators, model distillation), will the cloud-edge split remain necessary? Or will future models run entirely on-robot? The answer likely depends on how quickly model compression techniques can close the gap between frontier-scale and edge-deployable models.

14 — Key Takeaways

Summary

1 Foundation model approach to robotics works. Building on Gemini 2.0's internet-scale pre-training, Gemini Robotics achieves state-of-the-art manipulation performance with substantially less robot-specific training data than prior approaches. The pre-trained backbone provides world knowledge, spatial understanding, and physical intuition that transfers directly to robotic tasks.

2 Cloud-edge architecture enables frontier-scale real-time control. By splitting computation between a cloud backbone (reasoning) and an on-robot decoder (action execution), the system achieves both the reasoning depth of a frontier VLM and the 50 Hz control frequency that dexterous manipulation demands.

3 A family of models covers the full robotics stack. Robotics-ER for spatial reasoning, Robotics for manipulation, and Robotics-Nav for navigation can be composed to build complete autonomous systems. This modular-yet-unified approach is more practical than a single monolithic model.

4 Cross-embodiment generalization is real. Adapting to new robot platforms (including humanoids) via lightweight adapter fine-tuning demonstrates that the backbone learns embodiment-agnostic manipulation strategies. This is a critical step toward truly general-purpose robot foundation models.

5 Action chunking is more than an engineering trick. It simultaneously solves latency hiding, temporal smoothing, richer training signals, and implicit planning. It is likely to become a standard component of future VLA architectures.

15 — References

Google DeepMind. (2025). Gemini Robotics: Bringing AI into the Physical World. arXiv:2503.20020.

Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv:2307.15818.

Black, K., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164.

Kim, M.J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246.

Chi, C., et al. (2024). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS 2024.

Zhao, T.Z., et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS 2023.

Driess, D., et al. (2023). PaLM-E: An Embodied Multimodal Language Model. ICML 2023.