Model Walkthroughs

Papers with Beginner-Friendly Explanations

Deep learning papers are dense. These walkthroughs break landmark models down layer by layer — with diagrams, example inputs, and plain-language explanations of what each component actually does. Inspired by Papers With Code, but built for the student seeing these architectures for the first time.

Philosophy

Every walkthrough follows the same structure: start with the problem the model solves, show the architecture as a whole, then walk through each layer explaining what goes in, what comes out, and why. No hand-waving — if a tensor changes shape, we show it. If a design choice matters, we explain the alternative that was rejected.

Walkthroughs

ResNet: Deep Residual Learning for Image Recognition

Image Classification CNN Microsoft 2015

The paper that proved depth is achievable with the right structural prior. We trace a 224×224 image through ResNet-50's skip connections, bottleneck blocks, and projection shortcuts — explaining why simply adding the input to the output solves the vanishing gradient problem and enables 152-layer networks.

Read walkthrough →

Vision Transformer (ViT): An Image is Worth 16x16 Words

Image Classification Transformer Google 2020

How patch embeddings, positional encodings, and multi-head self-attention replace convolutions entirely. We trace a 224×224 image through ViT-Base/16 — splitting into 196 patches, the [CLS] token, 12 encoder layers, and why ViT needs massive data to beat CNNs.

Read walkthrough →

YOLOX: Exceeding YOLO Series in 2021

Object Detection Anchor-Free Megvii 2021

A layer-by-layer walkthrough of YOLOX — the anchor-free object detector that introduced decoupled heads and SimOTA label assignment. We trace a 640×640 image from raw pixels through the CSPDarknet backbone, PAFPN neck, and decoupled detection heads, showing tensor shapes and feature maps at every stage.

Read walkthrough →

The DETR Family: From Transformers to Real-Time Detection

Object Detection Transformer DETR → RT-DETR → RF-DETR

Three generations of detection Transformers in one walkthrough. DETR (2020) eliminated anchors and NMS with set prediction. RT-DETR (2023) made it real-time with hybrid encoders. RF-DETR (2025) pushed past 60 AP on COCO using DINOv2 backbone and neural architecture search.

Read walkthrough →

Segment Anything: SAM & SAM 2

Segmentation Foundation Model Meta 2023/2024

Meta's promptable segmentation foundation models. SAM (2023) segments any object from a point click using a ViT-H encoder and lightweight mask decoder. SAM 2 (2024) extends to video with memory attention and streaming architecture — prompting one frame tracks objects through entire videos.

Read walkthrough →

FoundationStereo & Fast-FoundationStereo

Stereo Depth Foundation Model NVIDIA CVPR 2025/2026

NVIDIA's foundation model for stereo depth estimation — and its 10× faster successor. FoundationStereo uses a side-tuning adapter pairing a CNN with a frozen DepthAnythingV2 ViT, trained on 1M synthetic pairs. Fast-FoundationStereo compresses it via knowledge distillation, NAS, and structured pruning for real-time robotics.

Read walkthrough →

Reinforcement Learning

PPO: Proximal Policy Optimization

Policy Gradient On-Policy OpenAI 2017

The workhorse of modern RL. PPO simplifies TRPO's constrained optimization with a clipped surrogate objective that prevents destructive policy updates. We trace the actor-critic architecture, GAE advantage estimation, and the elegant clipping mechanism that makes PPO both stable and easy to implement.

Read walkthrough →

TD3: Twin Delayed Deep Deterministic Policy Gradient

Actor-Critic Off-Policy Fujimoto 2018

Three tricks to fix DDPG's overestimation bias: twin critics with clipped double Q-learning, delayed policy updates, and target policy smoothing. We trace state through the actor and twin critic networks, showing how each technique stabilizes continuous control.

Read walkthrough →

SAC: Soft Actor-Critic

Maximum Entropy Off-Policy Haarnoja 2018

Entropy-regularized RL for robust continuous control. SAC maximizes both reward and policy entropy — encouraging exploration and learning multi-modal behaviors. We walk through the stochastic actor, twin critics, reparameterization trick, and automatic temperature tuning that make SAC the go-to off-policy algorithm.

Read walkthrough →