Behavior Cloning: From PilotNet to GR00T N1
1 — What Is Behavior Cloning?
Behavior Cloning (BC) is the simplest form of imitation learning: given a dataset of expert demonstrations — sequences of observations paired with the actions the expert took — train a neural network to predict the expert's action from the observation. That's it. No reward function, no environment interaction during training, no planning, no value estimation. Just supervised learning on (state, action) pairs.
Why BC Is Appealing
BC has an irresistible simplicity. Every advance in supervised learning — bigger models, better optimizers, transfer learning from vision and language pretraining — immediately lifts BC performance with no algorithmic change required. There is no exploration problem, no sparse reward to shape, no distribution over rollouts to sample from. Given enough good data, BC works. The catch is that those three words — "enough good data" — hide one of the field's oldest and deepest failure modes.
2 — PilotNet: The Canonical Case Study
The modern revival of BC traces to Pomerleau's ALVINN (1989), which used a shallow neural net to steer a Chevy van along a road. The ideas lay largely dormant until NVIDIA's PilotNet (Bojarski et al., 2016) dusted them off with modern ConvNets — a 9-layer CNN trained end-to-end on 72 hours of human driving video, producing steering angles directly from a single front-facing camera.
1Pure End-to-End Pipeline
PilotNet's radical claim was that a single network could replace the entire conventional self-driving stack for highway lane-keeping — no hand-coded lane detection, no path planner, no low-level controller. The network learns all the relevant cues (lane markings, road edges, other cars) implicitly from pixels → steering regression. The model is about 250k parameters, tiny by modern standards.
2Data Augmentation for Off-Center Recovery
A critical trick: NVIDIA recorded from three cameras simultaneously (left, center, right) and used the side cameras as synthetic off-center observations with hand-computed steering corrections (e.g., "if the left camera saw this, the correct action is to steer slightly right"). They also applied random shifts and rotations with analytically correct steering adjustments. This tiny bit of augmentation taught the network what to do when it drifted — and foreshadowed the central problem that every subsequent BC method would have to address.
3 — The Covariate Shift Problem
BC has one fundamental, structural flaw. The training data is drawn from the expert's state distribution, but at deployment the policy drives the robot into states from its own distribution — a distribution that includes small errors, drifts, and edge cases the expert never visited. On those unseen states, the policy has no label to imitate. Its errors compound with time.
4 — DAgger: Dataset Aggregation
DAgger (Ross, Gordon & Bagnell, AISTATS 2011) is the classical answer: instead of training once on expert data, iteratively roll out the policy, ask the expert to label the states the policy visited, append those new labels to the dataset, and retrain. Over iterations the dataset comes to cover the policy's own state distribution, and the compounding error bound tightens from O(ε T2) to O(ε T).
DAgger's Trade-Off
DAgger tightens the theoretical bound at the cost of requiring an interactive expert who can label arbitrary states on demand. For a video game this is often cheap; for a human pilot in a real airplane it is prohibitive. This practical limitation is why most modern robotics BC does not use DAgger and instead tackles covariate shift through other means: much larger demonstration datasets, cross-embodiment data mixing, multimodal action predictions, and — the focus of the next two sections — action chunking and diffusion.
5 — Action Chunking (ACT / ALOHA)
Action Chunking with Transformers (Zhao et al., RSS 2023), introduced alongside the low-cost ALOHA bimanual teleoperation platform, proposed a simple but powerful modification to BC: instead of predicting a single next action at, predict a chunk of k consecutive actions (at, at+1, …, at+k-1). The policy then executes the whole chunk (or, with temporal ensembling, a running average of overlapping chunks) before re-planning.
1Why Chunks Reduce Compounding
If the policy is queried every timestep, every timestep is a chance to drift. If it commits to 100 actions at a time, it has 100 fewer chances to compound. More importantly, chunks encode temporal structure in the demonstrations: fine-grained gripper control, multi-step insertion primitives, smooth bimanual coordination. Predicting chunks as a unit captures this structure in a way that per-step BC cannot.
2CVAE Style Latent
ACT wraps the chunk predictor in a conditional VAE. A small "style encoder" (trained alongside the decoder) maps the full ground-truth action chunk to a latent z; at inference time z is sampled from the prior. This gives the decoder a cheap way to express multimodal human behavior — e.g., demos where the human sometimes grasps from the left and sometimes from the right — without the averaging collapse that plagues plain MSE BC on multimodal data.
3ALOHA: Cheap Teleop, Dense Demos
The method paper is inseparable from the ALOHA hardware: a pair of leader-follower arms that lets one person smoothly teleoperate a 14-DoF bimanual system. Cheap, dense, smooth teleoperation produces demonstration data that is well-suited to chunk prediction. ACT + ALOHA could learn tasks like threading a zip tie, inserting a battery, or slotting a socket from only ~50 demonstrations each.
6 — Diffusion Policy
Diffusion Policy (Chi et al., RSS 2023) takes a different tack on the multimodality problem: model the full distribution over action sequences using a denoising diffusion model, conditioned on the current observation. The resulting policy naturally handles multimodal demonstrations, produces smooth high-dimensional action trajectories, and sets new state-of-the-art on dozens of manipulation benchmarks.
1Why Diffusion Helps
A plain MSE regression policy trained on multimodal demonstrations (go left or go right) averages the modes and produces a policy that goes straight into the wall. Diffusion models the data distribution implicitly via score matching: at each denoising step the network predicts the noise that was added, which is equivalent to predicting the gradient of the log-density of the action distribution. The resulting policy samples from the demonstration distribution rather than averaging over it.
2Receding-Horizon Execution
Like ACT, Diffusion Policy predicts a chunk — typically 16 timesteps — then executes a prefix (e.g., 8 steps) before re-planning. This receding-horizon strategy combines the benefits of chunk commitment (smooth trajectories, fewer compounding points) with closed-loop responsiveness (new observations re-enter the prediction often enough to matter).
3Denoising Networks
The denoiser itself can be a 1D temporal CNN-UNet (for purely low-dim action sequences) or a transformer (when conditioning on rich visual embeddings). Both work; the CNN-UNet is lighter and easier to train. The original paper uses 100 denoising steps at training and 100 at inference, though later work (DDIM, consistency models) has reduced inference to 10 or even 1 step with minimal quality loss.
7 — GR00T N1: NVIDIA's Humanoid Foundation Model
GR00T N1 (NVIDIA, March 2025) is the culmination of everything above: a foundation model for humanoid robots that inherits BC's supervised-learning simplicity, borrows diffusion's multimodal action generation, and wraps both inside a dual-system architecture inspired by Kahneman's System 1 / System 2 psychology. It is trained on the largest, most heterogeneous humanoid manipulation dataset assembled to date, and it is open-source.
1Dual-System Architecture
The central architectural choice in GR00T N1 is a separation of concerns modeled on Kahneman's psychology of decision-making. System 2 is a Vision-Language Model (NVIDIA's Eagle-2, ~2B parameters) that processes vision and language at a comfortable ~10 Hz — enough for deliberative scene parsing and instruction following but far too slow for continuous control. System 1 is a Diffusion Transformer (DiT) that takes Eagle-2's latent tokens as conditioning context and produces action chunks at ~120 Hz, fast enough for reactive humanoid control.
2Eagle-2 Vision-Language Backbone
Eagle-2 is a compact multimodal model combining a SigLIP vision encoder with an LLM backbone. For GR00T it ingests egocentric stereo RGB, a language instruction ("pick up the red mug and place it on the shelf"), and a brief history of robot state. Its output — a short sequence of latent tokens — compresses its scene-and-task understanding into conditioning signal for the action generator. Eagle-2 is trained jointly with the rest of GR00T during post-training but can also be used separately as a reasoning-only model.
3Diffusion Transformer Action Head
The action head is a Diffusion Transformer trained with flow matching (a continuous-time variant of diffusion popularized by Stable Diffusion 3 and adopted by π₀). The action space covers the humanoid's full body: arm joint angles, hand finger joints, torso and waist. Chunks are typically 16–32 steps at the robot's control frequency. By running only the action head at 120 Hz and keeping the VLM at 10 Hz, GR00T sidesteps the compute bottleneck of large-model-every-timestep VLA architectures.
4The Data Pyramid
GR00T N1 introduces a three-tier data hierarchy:
- Real teleoperation data — the gold standard, but expensive to collect; a few thousand hours of humanoid teleop across NVIDIA's partners
- Synthetic data — millions of episodes generated in Isaac Sim with domain randomization and retargeted from human motion capture; cheap to produce, imperfect in physics fidelity
- Human egocentric video — public datasets (Ego4D, EgoExo4D, HOI4D) in which a human wears a head-cam while performing manipulation tasks; enormous in scale, but has no robot actions
The human-video tier uses neural trajectories as pseudo-labels: a vision-based hand-pose estimator converts the human's hand trajectory into a synthetic action sequence that can be pre-trained on as if it were a robot demonstration.
5Cross-Embodiment Training
Like RT-X before it (and more aggressively), GR00T N1 is trained on data from multiple humanoid embodiments: Fourier GR-1, Unitree H1, 1X NEO, and others. A learned embodiment embedding tells the model which body it is controlling; the shared backbone transfers manipulation competence across platforms. This was infeasible for prior, per-robot BC systems and is a direct descendant of the dataset-aggregation lineage: collect from as many distributions as possible, and let a large model stitch them together.
8 — Evolution: BC Across Four Decades
| Method | Year | Covariate shift strategy | Action representation | Scale |
|---|---|---|---|---|
| ALVINN | 1989 | None — recovery unreliable | Per-step scalar | ~few hr driving |
| DAgger | 2011 | Iterative expert queries | Per-step | Task-specific |
| PilotNet | 2016 | Side-camera augmentation | Per-step scalar | 72 hr driving |
| ACT | 2023 | Action chunking + CVAE | 100-step chunk | 50 demos / task |
| Diffusion Policy | 2023 | Chunking + multimodal dist. | 16-step chunk | 100s of demos |
| GR00T N1 | 2025 | Scale + data pyramid + cross-embodiment | DiT action chunks | thousands of hrs |
9 — Key Takeaways
BC Is Deceptively Simple
On paper, behavior cloning is the shortest algorithm in robotics: collect demos, minimize MSE, deploy. In practice the field has spent 35 years finding ways to cope with the single failure mode — covariate shift — that the simplicity creates. Every real-world BC system in deployment today uses one or more of: data augmentation (PilotNet), iterative aggregation (DAgger), larger and more diverse demonstration pools (RT-X, GR00T), chunked action prediction (ACT), or distributional action models (Diffusion Policy).
Chunking Is a Cheap, Enormous Win
Predicting action chunks instead of single actions is perhaps the highest-leverage change in modern BC: it reduces compounding decision points, captures temporal structure in demonstrations, and makes downstream features (smoothing, receding horizon, multimodal latent codes) easy to implement. Both ACT and Diffusion Policy use chunks; GR00T N1 does too. Chunks are now close to universal in manipulation policies.
Diffusion Is the Default Action Head
For manipulation tasks with multimodal human demonstrations, diffusion or flow-matching action heads outperform unimodal regression across nearly every benchmark. GR00T N1's System 1 is a DiT; π₀ uses flow matching; RT-2 — which predicted discrete action tokens — has largely been superseded by continuous-action diffusion variants. This is the single most consequential trend in BC since action chunking.
Data Is the Final Constraint
After you have a good architecture (chunks + diffusion + big pretrained VLM), what is left is data. GR00T N1's contribution is as much about its three-tier data pyramid — real teleop + synthetic + human video with neural trajectories — as it is about its architecture. Every SOTA BC system after 2024 is explicitly a data engineering story, not only an algorithmic one.
Dual-System Is the New Blueprint
The VLM-as-System-2 + DiT-as-System-1 split is emerging as the dominant pattern for VLA architectures. It decouples the frequency of reasoning from the frequency of control, which had been the bottleneck in prior single-model VLAs. Expect this pattern to show up in every major robotics foundation-model release for the next several years.
10 — References & Further Reading
- ALVINN: An Autonomous Land Vehicle in a Neural Network — Pomerleau — NeurIPS 1988
- End to End Learning for Self-Driving Cars (PilotNet) — Bojarski, Del Testa, Dworakowski, Firner, Flepp, Goyal, Jackel, Monfort, Muller, Zhang, Zhang, Zhao, Zieba — NVIDIA 2016
- A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (DAgger) — Ross, Gordon, Bagnell — AISTATS 2011
- Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT / ALOHA) — Zhao, Kumar, Levine, Finn — RSS 2023
- Diffusion Policy: Visuomotor Policy Learning via Action Diffusion — Chi, Feng, Du, Xu, Cousineau, Burchfiel, Song — RSS 2023
- GR00T N1: An Open Foundation Model for Generalist Humanoid Robots — NVIDIA GEAR — 2025
- NVIDIA Isaac GR00T Project Page — includes open model weights and simulation tools
- Our π₀ Walkthrough — companion flow-matching VLA from Physical Intelligence
- Our Gemini Robotics Walkthrough — Google's related VLA family
- Our SAFE Walkthrough — failure detection for VLA policies