Model Walkthroughs
Deep learning papers are dense. These walkthroughs break landmark models down layer by layer — with diagrams, example inputs, and plain-language explanations of what each component actually does. Inspired by Papers With Code, but built for the student seeing these architectures for the first time.
Philosophy
Walkthroughs
ResNet: Deep Residual Learning for Image Recognition
The paper that proved depth is achievable with the right structural prior. We trace a 224×224 image through ResNet-50's skip connections, bottleneck blocks, and projection shortcuts — explaining why simply adding the input to the output solves the vanishing gradient problem and enables 152-layer networks.
Read walkthrough →Vision Transformer (ViT): An Image is Worth 16x16 Words
How patch embeddings, positional encodings, and multi-head self-attention replace convolutions entirely. We trace a 224×224 image through ViT-Base/16 — splitting into 196 patches, the [CLS] token, 12 encoder layers, and why ViT needs massive data to beat CNNs.
Read walkthrough →YOLOX: Exceeding YOLO Series in 2021
A layer-by-layer walkthrough of YOLOX — the anchor-free object detector that introduced decoupled heads and SimOTA label assignment. We trace a 640×640 image from raw pixels through the CSPDarknet backbone, PAFPN neck, and decoupled detection heads, showing tensor shapes and feature maps at every stage.
Read walkthrough →The DETR Family: From Transformers to Real-Time Detection
Three generations of detection Transformers in one walkthrough. DETR (2020) eliminated anchors and NMS with set prediction. RT-DETR (2023) made it real-time with hybrid encoders. RF-DETR (2025) pushed past 60 AP on COCO using DINOv2 backbone and neural architecture search.
Read walkthrough →Segment Anything: SAM & SAM 2
Meta's promptable segmentation foundation models. SAM (2023) segments any object from a point click using a ViT-H encoder and lightweight mask decoder. SAM 2 (2024) extends to video with memory attention and streaming architecture — prompting one frame tracks objects through entire videos.
Read walkthrough →FoundationStereo & Fast-FoundationStereo
NVIDIA's foundation model for stereo depth estimation — and its 10× faster successor. FoundationStereo uses a side-tuning adapter pairing a CNN with a frozen DepthAnythingV2 ViT, trained on 1M synthetic pairs. Fast-FoundationStereo compresses it via knowledge distillation, NAS, and structured pruning for real-time robotics.
Read walkthrough →Reinforcement Learning
PPO: Proximal Policy Optimization
The workhorse of modern RL. PPO simplifies TRPO's constrained optimization with a clipped surrogate objective that prevents destructive policy updates. We trace the actor-critic architecture, GAE advantage estimation, and the elegant clipping mechanism that makes PPO both stable and easy to implement.
Read walkthrough →TD3: Twin Delayed Deep Deterministic Policy Gradient
Three tricks to fix DDPG's overestimation bias: twin critics with clipped double Q-learning, delayed policy updates, and target policy smoothing. We trace state through the actor and twin critic networks, showing how each technique stabilizes continuous control.
Read walkthrough →SAC: Soft Actor-Critic
Entropy-regularized RL for robust continuous control. SAC maximizes both reward and policy entropy — encouraging exploration and learning multi-modal behaviors. We walk through the stochastic actor, twin critics, reparameterization trick, and automatic temperature tuning that make SAC the go-to off-policy algorithm.
Read walkthrough →