Deep Residual Learning for Image Recognition

Skip Connections That Changed Everything
Image Classification CNN Residual Learning Skip Connections Microsoft Research 2015

1 — The Problem to Solve

Image classification is the task of assigning a single label to an image: "this is a cat," "this is a car," "this is a mushroom." Before ResNet, the computer vision community had a clear recipe: stack more convolutional layers to learn more complex features. But there was a surprising problem.

The Degradation Problem

By 2015, researchers found that simply making networks deeper actually made accuracy worse — not because of overfitting, but because the training itself broke down. A 56-layer network performed worse than a 20-layer network, even on the training set. Gradients either vanished (shrank to near-zero) or exploded as they propagated back through dozens of layers.

The Degradation Problem 20-layer: 8.5% error Shallow 56-layer: 10.3% error (WORSE!) Deeper ?! More layers = higher error? ResNet-152: 3.6% error ResNet Skip connections fix it!
Key Insight: If a deeper network should be at least as good as a shallower one (the extra layers could just learn identity mappings), then the problem isn't capacity — it's optimization. He et al. proposed making the identity mapping the default by adding skip connections, so layers only need to learn the residual (the difference from identity).

What the Model Receives and Returns

Input: An RGB image resized to 224 × 224 × 3.

Output: A probability distribution over 1,000 ImageNet classes. The highest-scoring class is the prediction.

224x224x3 RGB Image ResNet 1000-class probabilities golden retriever: 0.92 labrador: 0.04 tennis ball: 0.01 ...

2 — Architecture Overview

ResNet comes in several depths (18, 34, 50, 101, 152 layers), but they all share the same skeleton: an initial convolution and pooling, then four groups of residual blocks, then global average pooling and a classifier.

Input 224x224x3 Conv1 7x7, stride 2 + MaxPool Stage 2 56x56x64 3 blocks (ResNet-50) Stage 3 28x28x128 4 blocks Stage 4 14x14x256 6 blocks Stage 5 7x7x512 3 blocks AvgPool 7x7→1x1 FC 1000 Softmax 112x112x64 ↓ pool to 56x56 Each block within a stage contains skip connections (the key innovation) Total: 50 weighted layers in ResNet-50 (16 + 3×2 + 4×2 + 6×2 + 3×2 bottleneck layers)

3 — Layer-by-Layer Walkthrough

Let's trace a single 224 × 224 × 3 image through ResNet-50.

1 Initial Convolution and Pooling

Conv1 + BatchNorm + ReLU + MaxPool

224×224×356×56×64

The first layer is a large 7×7 convolution with 64 filters and stride 2, followed by batch normalization and ReLU. This immediately halves spatial dimensions to 112×112×64. A 3×3 max pooling with stride 2 then halves again to 56×56×64.

224x224x3 Input 7x7 Conv stride 2, 64 filters + BN + ReLU 112x112x64 after conv 3x3 MaxPool stride 2 pad 1 56x56x64 to Stage 2

This aggressive 4× spatial downsampling happens before any residual blocks — it reduces computation for all subsequent stages.

2 The Residual Block (The Core Innovation)

Basic Block (ResNet-18/34)

Two 3×3 convolutions + skip

In the shallower ResNets (18 and 34), each residual block contains two 3×3 convolutions. The input is added directly to the output — this is the skip connection.

Basic Residual Block x 3x3 Conv BN + ReLU 3x3 Conv BN only Identity skip connection (just copy x) + ReLU F(x) + x F(x) = learned residual x = identity
Why this works: If the optimal transformation is close to identity, a regular network must learn to produce x from scratch. With a skip connection, the layers only need to learn F(x) = 0 (the residual), which is much easier to optimize. The network defaults to "change nothing" and only learns deviations from that.

Bottleneck Block (ResNet-50/101/152)

1×1 → 3×3 → 1×1 + skip

For deeper ResNets, a "bottleneck" design reduces computation. Instead of two 3×3 convolutions, it uses three layers: a 1×1 conv to reduce channels, a 3×3 conv to process spatial features, and a 1×1 conv to expand channels back. The skip connection still adds the input to the output.

Bottleneck Residual Block (ResNet-50+) 256ch x 1x1 Conv 256→64 (reduce) 3x3 Conv 64 channels 1x1 Conv 64→256 (expand) Identity shortcut (256ch → 256ch) + ReLU F(x)+x Bottleneck: 256→64→64→256 uses 69K params vs 590K for two 3x3 on 256ch
Bottleneck savings: By squeezing 256 channels down to 64, doing the expensive 3×3 convolution on only 64 channels, then expanding back to 256, the bottleneck block uses roughly 8× fewer parameters than doing two 3×3 convolutions on 256 channels directly.

Projection Shortcut (When Dimensions Change)

1×1 conv to match dimensions

When moving between stages, spatial dimensions halve and channels double. The skip connection can't simply add 56×56×256 to 28×28×512 — they don't match. The solution: a 1×1 convolution with stride 2 on the skip path to project the input to the right shape.

56x56 x256 Bottleneck (stride 2 on 3x3) 1x1 Conv, stride 2, 512 filters Projection shortcut (changes shape) + 28x28 x512

3 The Four Stages

Stage 2 (conv2_x)

56×56×6456×56×256

Three bottleneck blocks. The first block expands channels from 64 to 256 (using a projection shortcut). The remaining two blocks maintain 56×56×256. No spatial downsampling within this stage — that happened in the initial max pool.

Stage 3 (conv3_x)

56×56×25628×28×512

Four bottleneck blocks. The first block uses stride 2 to halve spatial dimensions and a projection shortcut to double channels. Features at this stage capture mid-level patterns — object parts, textures, and spatial relationships.

Stage 4 (conv4_x)

28×28×51214×14×1024

Six bottleneck blocks — the most blocks of any stage. Again, stride 2 on the first block. This is where most of the network's capacity lives. Features represent high-level object concepts.

Stage 5 (conv5_x)

14×14×10247×7×2048

Three bottleneck blocks with stride 2 on the first. The output is a highly compressed 7×7 feature map with 2,048 channels — each spatial position encodes rich semantic information about a 32×32 pixel region of the original image.

4 Classification Head

Global Average Pooling + Fully Connected

7×7×20481000

Global Average Pooling takes the 7×7×2048 feature map and averages each channel across all spatial positions, producing a 2048-dimensional vector. This is far more parameter-efficient than the fully connected layers used in earlier networks like VGG.

A single fully connected layer maps the 2,048 features to 1,000 class logits. Softmax converts these to probabilities.

7x7x2048 Stage 5 out Global AvgPool 7x7 → 1x1 2048-d FC Layer 2048→1000 Softmax 1000 probs

4 — Why Skip Connections Work

Gradient Flow

During backpropagation, the skip connection creates a "gradient highway." Without it, gradients must flow through every weight layer, and they shrink multiplicatively — 50 layers of 0.9× multiplication gives 0.005 (effectively zero). With skip connections, gradients can flow directly through the addition operation, reaching early layers intact.

Gradient flow comparison Without skip connections Layer 1 ×0.9 Layer 2 ×0.9 Layer 3 Layer 50 = 0.9^50 = 0.005 Gradient vanishes! With skip connections Layer 1 Layer 2 Layer 3 Layer 50 = 1.0 (identity) Gradient flows freely

Ensemble Effect

A ResNet can be viewed as an ensemble of many shorter networks. Each block can either transform the features or pass them through unchanged. With n blocks, there are 2n possible paths through the network — it implicitly trains an exponential number of sub-networks simultaneously.

5 — Complete Tensor Shape Summary

For ResNet-50 with a 224×224 input:

Stage Layer Output Shape Blocks Notes
InputImage224×224×3RGB, normalized
Conv17×7 conv, stride 2112×112×64+ BN + ReLU
Pool3×3 max pool, stride 256×56×64
Stage 2conv2_x56×56×2563Bottleneck, no downsample
Stage 3conv3_x28×28×5124Stride 2 on first block
Stage 4conv4_x14×14×10246Stride 2 on first block
Stage 5conv5_x7×7×20483Stride 2 on first block
HeadAvgPool + FC1000Softmax probabilities

6 — Results and the ResNet Family

Model Layers Params Top-1 Error (%) Top-5 Error (%)
ResNet-181811.7M27.99.6
ResNet-343421.8M25.07.8
ResNet-505025.6M22.96.7
ResNet-10110144.5M21.86.0
ResNet-15215260.2M21.45.7

ResNet-152 won 1st place in the ILSVRC 2015 classification challenge with a 3.57% top-5 error rate. Perhaps more importantly, ResNet became the default backbone for virtually every downstream task — object detection (Faster R-CNN), segmentation (Mask R-CNN, DeepLab), and even as the image encoder in models like CLIP and DETR.

Legacy: Skip connections are now a universal pattern in deep learning — they appear in DenseNet, U-Net, Transformers (layer norm + residual), GPT, and virtually every modern architecture. ResNet didn't just solve image classification — it proved that depth is achievable with the right structural prior.

7 — References & Further Reading