Model Walkthroughs

Papers with Beginner-Friendly Explanations

Deep learning papers are dense. These walkthroughs break landmark models down layer by layer — with diagrams, example inputs, and plain-language explanations of what each component actually does. Inspired by Papers With Code, but built for the student seeing these architectures for the first time.

Philosophy

Every walkthrough follows the same structure: start with the problem the model solves, show the architecture as a whole, then walk through each layer explaining what goes in, what comes out, and why. No hand-waving — if a tensor changes shape, we show it. If a design choice matters, we explain the alternative that was rejected.

Walkthroughs

ResNet: Deep Residual Learning for Image Recognition

Image Classification CNN Microsoft 2015

The paper that proved depth is achievable with the right structural prior. We trace a 224×224 image through ResNet-50's skip connections, bottleneck blocks, and projection shortcuts — explaining why simply adding the input to the output solves the vanishing gradient problem and enables 152-layer networks.

Read walkthrough →

Vision Transformer (ViT): An Image is Worth 16x16 Words

Image Classification Transformer Google 2020

How patch embeddings, positional encodings, and multi-head self-attention replace convolutions entirely. We trace a 224×224 image through ViT-Base/16 — splitting into 196 patches, the [CLS] token, 12 encoder layers, and why ViT needs massive data to beat CNNs.

Read walkthrough →

YOLOX: Exceeding YOLO Series in 2021

Object Detection Anchor-Free Megvii 2021

A layer-by-layer walkthrough of YOLOX — the anchor-free object detector that introduced decoupled heads and SimOTA label assignment. We trace a 640×640 image from raw pixels through the CSPDarknet backbone, PAFPN neck, and decoupled detection heads, showing tensor shapes and feature maps at every stage.

Read walkthrough →

The DETR Family: From Transformers to Real-Time Detection

Object Detection Transformer DETR → RT-DETR → RF-DETR

Three generations of detection Transformers in one walkthrough. DETR (2020) eliminated anchors and NMS with set prediction. RT-DETR (2023) made it real-time with hybrid encoders. RF-DETR (2025) pushed past 60 AP on COCO using DINOv2 backbone and neural architecture search.

Read walkthrough →

Segment Anything: SAM & SAM 2

Segmentation Foundation Model Meta 2023/2024

Meta's promptable segmentation foundation models. SAM (2023) segments any object from a point click using a ViT-H encoder and lightweight mask decoder. SAM 2 (2024) extends to video with memory attention and streaming architecture — prompting one frame tracks objects through entire videos.

Read walkthrough →

FoundationStereo & Fast-FoundationStereo

Stereo Depth Foundation Model NVIDIA CVPR 2025/2026

NVIDIA's foundation model for stereo depth estimation — and its 10× faster successor. FoundationStereo uses a side-tuning adapter pairing a CNN with a frozen DepthAnythingV2 ViT, trained on 1M synthetic pairs. Fast-FoundationStereo compresses it via knowledge distillation, NAS, and structured pruning for real-time robotics.

Read walkthrough →

FastVLM: Efficient Vision Encoding for Vision Language Models

Vision-Language Efficient Inference Apple CVPR 2025

Apple's fast vision-language model achieves 85× faster time-to-first-token than LLaVA-OneVision through FastViTHD — a hybrid convolutional-transformer encoder that natively produces fewer tokens without pruning. With a 125M parameter encoder that is 3.4× smaller than comparable alternatives, FastVLM enables real-time on-device VLM inference on iPhone, iPad, and Mac.

Read walkthrough →

JEPA Family: Joint-Embedding Predictive Architectures (I-JEPA, V-JEPA, MC-JEPA)

Self-Supervised Predictive Architecture Meta 2023–2024

Yann LeCun’s predict-in-feature-space recipe, in three flavors. I-JEPA (2023) masks image blocks and predicts their embeddings from a context block — no augmentations, no pixel reconstruction. V-JEPA (2024) extends this to spatio-temporal video tubes, matching or beating pixel-MAE on frozen action recognition. MC-JEPA (2023) shares a single encoder across content (dense JEPA) and motion (self-supervised optical flow) for a model that does both at once.

Read walkthrough →

DINO Family: Self-Supervised Vision Transformers (v1 → v2 → v3)

Self-Supervised Vision Foundation Meta 2021–2025

Three generations of self-distillation without labels. DINOv1 (2021) discovered that ViT attention maps naturally segment objects. DINOv2 (2023) scaled to 142M curated images with combined DINO+iBOT loss. DINOv3 (2025) pushes to 7B parameters and 1.7B images with Gram Anchoring — producing universal visual features that work across classification, segmentation, depth, and detection without fine-tuning.

Read walkthrough →

Robotics & Embodied AI

Behavior Cloning: From PilotNet to GR00T N1

Behavior Cloning Imitation Learning NVIDIA 2025

The four-decade arc of behavior cloning — the simplest idea in robotics, and the one the field has spent the longest learning to do well. We start with what BC is (PilotNet on 72 hours of driving data), explain why it fails naively (covariate shift, O(ε T²) compounding), and trace the fixes: DAgger's iterative expert queries, ACT's action chunking, and Diffusion Policy's multimodal action distributions — culminating in NVIDIA's GR00T N1, a dual-system humanoid foundation model combining an Eagle-2 VLM with a Diffusion Transformer action head.

Read walkthrough →

π₀ (Pi-Zero): A Vision-Language-Action Flow Model for General Robot Control

VLA Flow Matching Physical Intelligence 2024

A 3.3B parameter generalist robot policy that combines PaliGemma's vision-language understanding with flow matching for continuous action generation. Trained on 10,000+ hours across 7 robot platforms and 68 tasks, π₀ generates 50-timestep action chunks at up to 50 Hz — enabling dexterous manipulation tasks like laundry folding, box assembly, and egg packing from a single model.

Read walkthrough →

Gemini Robotics: Bringing AI into the Physical World

VLA Foundation Model Google DeepMind 2025

Google's family of embodied AI models built on Gemini. The Robotics-ER model adds spatial reasoning, 3D detection, and grasp prediction to vision-language understanding. The Robotics VLA uses a cloud backbone with an on-robot action decoder for 50 Hz control — achieving 79% success on long-horizon dexterous tasks like origami folding and adapting to new embodiments including humanoid robots.

Read walkthrough →

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Safety Failure Detection 2025

When robots fail, you need to know — fast. SAFE monitors VLA internal features to detect task failures without task-specific detectors. Using MLP or LSTM heads on hidden states plus conformal prediction for calibrated thresholds, SAFE generalizes from seen to unseen tasks — achieving 84% detection on novel tasks across π₀, OpenVLA, and π₀-FAST on both simulated and real Franka Panda robots.

Read walkthrough →

Reinforcement Learning

PPO: Proximal Policy Optimization

Policy Gradient On-Policy OpenAI 2017

The workhorse of modern RL. PPO simplifies TRPO's constrained optimization with a clipped surrogate objective that prevents destructive policy updates. We trace the actor-critic architecture, GAE advantage estimation, and the elegant clipping mechanism that makes PPO both stable and easy to implement.

Read walkthrough →

TD3: Twin Delayed Deep Deterministic Policy Gradient

Actor-Critic Off-Policy Fujimoto 2018

Three tricks to fix DDPG's overestimation bias: twin critics with clipped double Q-learning, delayed policy updates, and target policy smoothing. We trace state through the actor and twin critic networks, showing how each technique stabilizes continuous control.

Read walkthrough →

SAC: Soft Actor-Critic

Maximum Entropy Off-Policy Haarnoja 2018

Entropy-regularized RL for robust continuous control. SAC maximizes both reward and policy entropy — encouraging exploration and learning multi-modal behaviors. We walk through the stochastic actor, twin critics, reparameterization trick, and automatic temperature tuning that make SAC the go-to off-policy algorithm.

Read walkthrough →

Applied Transformers

Multimodal Transformers: Cross-Attention, Unified Embeddings & Contrastive Learning

Multimodal CLIP / LLaVA / Gemini Applied Guide

Three paradigms for fusing vision, language, audio, and actions. Contrastive two-tower models (CLIP/SigLIP) for retrieval, cross-attention fusion (Flamingo/LLaVA) for grounded generation, and unified embeddings (Gemini/GPT-4V) for any-to-any reasoning — with decision frameworks for when to use each.

Read walkthrough →

Encoder-Only Transformers: BERT, DeBERTa & Bidirectional Understanding

Encoder-Only Classification / NLU Applied Guide

When bidirectional attention is all you need. Encoder-only architectures for classification, NER, semantic similarity, and retrieval — with visual architecture diagrams, pretraining objective breakdowns (MLM, RTD), and practical guidance on when BERT-family models beat LLMs.

Read walkthrough →

Decoder-Only Transformers: GPT, LLaMA & Autoregressive Generation

Decoder-Only Generation / Reasoning Applied Guide

The architecture behind modern LLMs. Causal attention masks, KV caching, GQA, RoPE, SwiGLU, and Mixture of Experts — with the full pretraining → SFT → RLHF pipeline and guidance on when decoder-only models dominate.

Read walkthrough →

Transformer Distillation: Student-Teacher Methods for Compressing LLMs

Distillation Compression / Deployment Applied Guide

Compressing large language models into smaller, faster students. Logit distillation, chain-of-thought distillation (Orca), instruction tuning as distillation (Alpaca/Vicuna), progressive distillation, on-policy GKD — with practical recipes and hyperparameter guidance.

Read walkthrough →