Deep Residual Learning for Image Recognition
1 — The Problem to Solve
Image classification is the task of assigning a single label to an image: "this is a cat," "this is a car," "this is a mushroom." Before ResNet, the computer vision community had a clear recipe: stack more convolutional layers to learn more complex features. But there was a surprising problem.
The Degradation Problem
By 2015, researchers found that simply making networks deeper actually made accuracy worse — not because of overfitting, but because the training itself broke down. A 56-layer network performed worse than a 20-layer network, even on the training set. Gradients either vanished (shrank to near-zero) or exploded as they propagated back through dozens of layers.
What the Model Receives and Returns
Input: An RGB image resized to 224 × 224 × 3.
Output: A probability distribution over 1,000 ImageNet classes. The highest-scoring class is the prediction.
2 — Architecture Overview
ResNet comes in several depths (18, 34, 50, 101, 152 layers), but they all share the same skeleton: an initial convolution and pooling, then four groups of residual blocks, then global average pooling and a classifier.
3 — Layer-by-Layer Walkthrough
Let's trace a single 224 × 224 × 3 image through ResNet-50.
1 Initial Convolution and Pooling
Conv1 + BatchNorm + ReLU + MaxPool
The first layer is a large 7×7 convolution with 64 filters and stride 2, followed by batch normalization and ReLU. This immediately halves spatial dimensions to 112×112×64. A 3×3 max pooling with stride 2 then halves again to 56×56×64.
This aggressive 4× spatial downsampling happens before any residual blocks — it reduces computation for all subsequent stages.
2 The Residual Block (The Core Innovation)
Basic Block (ResNet-18/34)
In the shallower ResNets (18 and 34), each residual block contains two 3×3 convolutions. The input is added directly to the output — this is the skip connection.
x from scratch. With a skip connection, the layers only need to learn F(x) = 0 (the residual), which is much easier to optimize. The network defaults to "change nothing" and only learns deviations from that.
Bottleneck Block (ResNet-50/101/152)
For deeper ResNets, a "bottleneck" design reduces computation. Instead of two 3×3 convolutions, it uses three layers: a 1×1 conv to reduce channels, a 3×3 conv to process spatial features, and a 1×1 conv to expand channels back. The skip connection still adds the input to the output.
Projection Shortcut (When Dimensions Change)
When moving between stages, spatial dimensions halve and channels double. The skip connection can't simply add 56×56×256 to 28×28×512 — they don't match. The solution: a 1×1 convolution with stride 2 on the skip path to project the input to the right shape.
3 The Four Stages
Stage 2 (conv2_x)
Three bottleneck blocks. The first block expands channels from 64 to 256 (using a projection shortcut). The remaining two blocks maintain 56×56×256. No spatial downsampling within this stage — that happened in the initial max pool.
Stage 3 (conv3_x)
Four bottleneck blocks. The first block uses stride 2 to halve spatial dimensions and a projection shortcut to double channels. Features at this stage capture mid-level patterns — object parts, textures, and spatial relationships.
Stage 4 (conv4_x)
Six bottleneck blocks — the most blocks of any stage. Again, stride 2 on the first block. This is where most of the network's capacity lives. Features represent high-level object concepts.
Stage 5 (conv5_x)
Three bottleneck blocks with stride 2 on the first. The output is a highly compressed 7×7 feature map with 2,048 channels — each spatial position encodes rich semantic information about a 32×32 pixel region of the original image.
4 Classification Head
Global Average Pooling + Fully Connected
Global Average Pooling takes the 7×7×2048 feature map and averages each channel across all spatial positions, producing a 2048-dimensional vector. This is far more parameter-efficient than the fully connected layers used in earlier networks like VGG.
A single fully connected layer maps the 2,048 features to 1,000 class logits. Softmax converts these to probabilities.
4 — Why Skip Connections Work
Gradient Flow
During backpropagation, the skip connection creates a "gradient highway." Without it, gradients must flow through every weight layer, and they shrink multiplicatively — 50 layers of 0.9× multiplication gives 0.005 (effectively zero). With skip connections, gradients can flow directly through the addition operation, reaching early layers intact.
Ensemble Effect
A ResNet can be viewed as an ensemble of many shorter networks. Each block can either transform the features or pass them through unchanged. With n blocks, there are 2n possible paths through the network — it implicitly trains an exponential number of sub-networks simultaneously.
5 — Complete Tensor Shape Summary
For ResNet-50 with a 224×224 input:
| Stage | Layer | Output Shape | Blocks | Notes |
|---|---|---|---|---|
| Input | Image | 224×224×3 | — | RGB, normalized |
| Conv1 | 7×7 conv, stride 2 | 112×112×64 | — | + BN + ReLU |
| Pool | 3×3 max pool, stride 2 | 56×56×64 | — | |
| Stage 2 | conv2_x | 56×56×256 | 3 | Bottleneck, no downsample |
| Stage 3 | conv3_x | 28×28×512 | 4 | Stride 2 on first block |
| Stage 4 | conv4_x | 14×14×1024 | 6 | Stride 2 on first block |
| Stage 5 | conv5_x | 7×7×2048 | 3 | Stride 2 on first block |
| Head | AvgPool + FC | 1000 | — | Softmax probabilities |
6 — Results and the ResNet Family
| Model | Layers | Params | Top-1 Error (%) | Top-5 Error (%) |
|---|---|---|---|---|
| ResNet-18 | 18 | 11.7M | 27.9 | 9.6 |
| ResNet-34 | 34 | 21.8M | 25.0 | 7.8 |
| ResNet-50 | 50 | 25.6M | 22.9 | 6.7 |
| ResNet-101 | 101 | 44.5M | 21.8 | 6.0 |
| ResNet-152 | 152 | 60.2M | 21.4 | 5.7 |
ResNet-152 won 1st place in the ILSVRC 2015 classification challenge with a 3.57% top-5 error rate. Perhaps more importantly, ResNet became the default backbone for virtually every downstream task — object detection (Faster R-CNN), segmentation (Mask R-CNN, DeepLab), and even as the image encoder in models like CLIP and DETR.
7 — References & Further Reading
- Deep Residual Learning for Image Recognition — He et al., 2015 (original paper)
- Identity Mappings in Deep Residual Networks — He et al., 2016 (ResNet v2)
- PyTorch Official ResNet Implementation
- Papers With Code: ResNet