Deep Residual Learning for Image Recognition

Skip Connections That Changed Everything

Image Classification CNN Residual Learning Skip Connections Microsoft Research 2015

Paper: arXiv:1512.03385 | Code: torchvision/resnet.py

1 — The Problem to Solve

Image classification is the task of assigning a single label to an image: "this is a cat," "this is a car," "this is a mushroom." Before ResNet, the computer vision community had a clear recipe: stack more convolutional layers to learn more complex features. But there was a surprising problem.

The Degradation Problem

By 2015, researchers found that simply making networks deeper actually made accuracy worse — not because of overfitting, but because the training itself broke down. A 56-layer network performed worse than a 20-layer network, even on the training set. Gradients either vanished (shrank to near-zero) or exploded as they propagated back through dozens of layers.

Key Insight: If a deeper network should be at least as good as a shallower one (the extra layers could just learn identity mappings), then the problem isn't capacity — it's optimization. He et al. proposed making the identity mapping the default by adding skip connections, so layers only need to learn the residual (the difference from identity).

What the Model Receives and Returns

Input: An RGB image resized to 224 × 224 × 3.

Output: A probability distribution over 1,000 ImageNet classes. The highest-scoring class is the prediction.

2 — Architecture Overview

ResNet comes in several depths (18, 34, 50, 101, 152 layers), but they all share the same skeleton: an initial convolution and pooling, then four groups of residual blocks, then global average pooling and a classifier.

3 — Layer-by-Layer Walkthrough

Let's trace a single 224 × 224 × 3 image through ResNet-50.

1 Initial Convolution and Pooling

Conv1 + BatchNorm + ReLU + MaxPool

224×224×3 → 56×56×64

The first layer is a large 7×7 convolution with 64 filters and stride 2, followed by batch normalization and ReLU. This immediately halves spatial dimensions to 112×112×64. A 3×3 max pooling with stride 2 then halves again to 56×56×64.

This aggressive 4× spatial downsampling happens before any residual blocks — it reduces computation for all subsequent stages.

2 The Residual Block (The Core Innovation)

Basic Block (ResNet-18/34)

Two 3×3 convolutions + skip

In the shallower ResNets (18 and 34), each residual block contains two 3×3 convolutions. The input is added directly to the output — this is the skip connection.

Why this works: If the optimal transformation is close to identity, a regular network must learn to produce x from scratch. With a skip connection, the layers only need to learn F(x) = 0 (the residual), which is much easier to optimize. The network defaults to "change nothing" and only learns deviations from that.

Bottleneck Block (ResNet-50/101/152)

1×1 → 3×3 → 1×1 + skip

For deeper ResNets, a "bottleneck" design reduces computation. Instead of two 3×3 convolutions, it uses three layers: a 1×1 conv to reduce channels, a 3×3 conv to process spatial features, and a 1×1 conv to expand channels back. The skip connection still adds the input to the output.

Bottleneck savings: By squeezing 256 channels down to 64, doing the expensive 3×3 convolution on only 64 channels, then expanding back to 256, the bottleneck block uses roughly 8× fewer parameters than doing two 3×3 convolutions on 256 channels directly.

Projection Shortcut (When Dimensions Change)

1×1 conv to match dimensions

When moving between stages, spatial dimensions halve and channels double. The skip connection can't simply add 56×56×256 to 28×28×512 — they don't match. The solution: a 1×1 convolution with stride 2 on the skip path to project the input to the right shape.

3 The Four Stages

Stage 2 (conv2_x)

56×56×64 → 56×56×256

Three bottleneck blocks. The first block expands channels from 64 to 256 (using a projection shortcut). The remaining two blocks maintain 56×56×256. No spatial downsampling within this stage — that happened in the initial max pool.

Stage 3 (conv3_x)

56×56×256 → 28×28×512

Four bottleneck blocks. The first block uses stride 2 to halve spatial dimensions and a projection shortcut to double channels. Features at this stage capture mid-level patterns — object parts, textures, and spatial relationships.

Stage 4 (conv4_x)

28×28×512 → 14×14×1024

Six bottleneck blocks — the most blocks of any stage. Again, stride 2 on the first block. This is where most of the network's capacity lives. Features represent high-level object concepts.

Stage 5 (conv5_x)

14×14×1024 → 7×7×2048

Three bottleneck blocks with stride 2 on the first. The output is a highly compressed 7×7 feature map with 2,048 channels — each spatial position encodes rich semantic information about a 32×32 pixel region of the original image.

4 Classification Head

Global Average Pooling + Fully Connected

7×7×2048 → 1000

Global Average Pooling takes the 7×7×2048 feature map and averages each channel across all spatial positions, producing a 2048-dimensional vector. This is far more parameter-efficient than the fully connected layers used in earlier networks like VGG.

A single fully connected layer maps the 2,048 features to 1,000 class logits. Softmax converts these to probabilities.

4 — Why Skip Connections Work

Gradient Flow

During backpropagation, the skip connection creates a "gradient highway." Without it, gradients must flow through every weight layer, and they shrink multiplicatively — 50 layers of 0.9× multiplication gives 0.005 (effectively zero). With skip connections, gradients can flow directly through the addition operation, reaching early layers intact.

Ensemble Effect

A ResNet can be viewed as an ensemble of many shorter networks. Each block can either transform the features or pass them through unchanged. With n blocks, there are 2ⁿ possible paths through the network — it implicitly trains an exponential number of sub-networks simultaneously.

5 — Complete Tensor Shape Summary

For ResNet-50 with a 224×224 input:

Stage	Layer	Output Shape	Blocks	Notes
Input	Image	224×224×3	—	RGB, normalized
Conv1	7×7 conv, stride 2	112×112×64	—	+ BN + ReLU
Pool	3×3 max pool, stride 2	56×56×64	—
Stage 2	conv2_x	56×56×256	3	Bottleneck, no downsample
Stage 3	conv3_x	28×28×512	4	Stride 2 on first block
Stage 4	conv4_x	14×14×1024	6	Stride 2 on first block
Stage 5	conv5_x	7×7×2048	3	Stride 2 on first block
Head	AvgPool + FC	1000	—	Softmax probabilities

6 — Results and the ResNet Family

Model	Layers	Params	Top-1 Error (%)	Top-5 Error (%)
ResNet-18	18	11.7M	27.9	9.6
ResNet-34	34	21.8M	25.0	7.8
ResNet-50	50	25.6M	22.9	6.7
ResNet-101	101	44.5M	21.8	6.0
ResNet-152	152	60.2M	21.4	5.7

ResNet-152 won 1st place in the ILSVRC 2015 classification challenge with a 3.57% top-5 error rate. Perhaps more importantly, ResNet became the default backbone for virtually every downstream task — object detection (Faster R-CNN), segmentation (Mask R-CNN, DeepLab), and even as the image encoder in models like CLIP and DETR.

Legacy: Skip connections are now a universal pattern in deep learning — they appear in DenseNet, U-Net, Transformers (layer norm + residual), GPT, and virtually every modern architecture. ResNet didn't just solve image classification — it proved that depth is achievable with the right structural prior.

7 — References & Further Reading

Deep Residual Learning for Image Recognition — He et al., 2015 (original paper)
Identity Mappings in Deep Residual Networks — He et al., 2016 (ResNet v2)
PyTorch Official ResNet Implementation
Papers With Code: ResNet