An Image is Worth 16x16 Words
1 — The Problem to Solve
By 2020, Transformers had conquered NLP — BERT, GPT, and friends dominated every language benchmark. But computer vision still ran on CNNs. The question was simple: can we apply the Transformer architecture directly to images?
The challenge: Transformers process sequences of tokens. Images are 2D grids of pixels. A 224×224 image has 50,176 pixels — self-attention over that many tokens is computationally impossible (it scales quadratically). ViT's solution: cut the image into patches and treat each patch as a "word."
What the Model Receives and Returns
Input: An RGB image resized to 224 × 224 × 3.
Output: A probability distribution over 1,000 ImageNet classes (same task as ResNet).
2 — Architecture Overview
ViT is remarkably simple compared to CNNs. There are no pooling layers, no convolutions (except one for patch embedding), and no multi-scale feature maps. It's just: split into patches, embed, add position info, and run through a standard Transformer encoder.
3 — Layer-by-Layer Walkthrough
Let's trace a single 224 × 224 × 3 image through ViT-Base/16.
1 Patch Embedding
2 Transformer Encoder
Multi-Head Self-Attention (MSA)
Each token (patch embedding) attends to every other token. This is where the magic happens — a patch showing a dog's ear can "look at" a patch showing a dog's nose and a patch showing a tail, connecting distant image regions that a CNN with its limited receptive field might miss.
With 12 attention heads, each head operates on 768/12 = 64 dimensions. Each head computes Q (query), K (key), V (value) matrices, then attention weights = softmax(QKT/√64). The outputs of all 12 heads are concatenated and projected back to 768 dimensions.
MLP (Feed-Forward Network)
After self-attention, each token independently passes through a two-layer MLP: expand to 4× the embedding dimension (768 → 3072), apply GELU activation, then project back (3072 → 768). This is where the model learns non-linear feature transformations — self-attention mixes information between tokens, and the MLP processes it.
Residual Connections and Layer Normalization
Just like ResNet's skip connections, ViT adds the input of each sub-layer to its output: x = x + MSA(LN(x)) and x = x + MLP(LN(x)). ViT uses Pre-Norm (Layer Norm before the sub-layer), which was found to train more stably than Post-Norm used in the original Transformer.
3 Classification Head
Extract [CLS] Token + MLP Head
After 12 encoder layers, the [CLS] token (position 0) has attended to all 196 patch tokens repeatedly. It now encodes a global representation of the entire image. A final Layer Norm and an MLP head (one hidden layer during pre-training, single linear layer during fine-tuning) maps this 768-dimensional vector to class logits.
4 — CNN vs. ViT: What Changed
Receptive Field
Inductive Bias Trade-off
CNNs have strong inductive biases: locality (nearby pixels matter more), translation equivariance (a cat is a cat regardless of position), and hierarchical features (edges → textures → parts → objects). ViT has almost none of these — it must learn spatial structure entirely from data.
This means ViT needs much more data to train well. On ImageNet alone (1.2M images), ViT underperforms ResNet. But when pre-trained on JFT-300M (300M images) or ImageNet-21K (14M images), ViT surpasses CNNs on every benchmark — with fewer computational resources at inference.
Attention Patterns: What ViT Learns to See
The authors visualized attention maps from different heads and layers. Early layers attend to nearby patches (learning local features like edges — similar to CNN conv layers). Later layers attend to semantically related patches across the entire image (a dog's face attends to its tail). Different heads specialize: some track horizontal patterns, others vertical, others diagonal.
Learned Position Embeddings: ViT Discovers 2D Structure
Even though ViT's position embeddings are 1D learned vectors (not explicitly 2D), the model discovers spatial structure on its own. Visualizing cosine similarity between position embeddings reveals a clear 2D grid pattern — nearby patches in the image end up with similar position embeddings.
4.5 — ViT Model Variants
ViT comes in three sizes. The naming convention is ViT-{Size}/{Patch Size} — so ViT-B/16 means Base model with 16×16 patches.
5 — Tensor Shape Summary
For ViT-Base/16 with a 224×224 input:
| Stage | Operation | Output Shape | Notes |
|---|---|---|---|
| Input | Image | 224×224×3 | RGB, normalized |
| Patch Embed | 16×16 conv, stride 16 | 196×768 | 14×14 patches |
| Prepend [CLS] | Concat learned token | 197×768 | +1 classification token |
| + Pos Embed | Element-wise add | 197×768 | Learned positions |
| Encoder (×12) | Multi-Head Self-Attention | 197×768 | 12 heads, 64 dim each |
| MLP | 197×768 | 768→3072→768 | |
| Head | Extract [CLS] + Linear | 1000 | Softmax probabilities |
6 — Results
| Model | Pre-training Data | ImageNet Top-1 | Params |
|---|---|---|---|
| ResNet-152 (baseline) | ImageNet-1K | 78.3% | 60M |
| ViT-B/16 | ImageNet-1K | 77.9% | 86M |
| ViT-B/16 | ImageNet-21K | 84.0% | 86M |
| ViT-L/16 | JFT-300M | 87.8% | 307M |
| ViT-H/14 | JFT-300M | 88.6% | 632M |
7 — ViT's Descendants: The Vision Transformer Family Tree
ViT's impact was enormous — it launched an entire subfield. Here's how the key descendants address ViT's limitations:
| Model | Key Innovation | ViT Limitation Addressed |
|---|---|---|
| DeiT | Distillation token + heavy augmentation | Data hunger — trains on ImageNet-1K alone |
| Swin | Shifted window attention + hierarchical stages | Quadratic cost — O(n) local attention + multi-scale |
| MAE | Mask 75% of patches, reconstruct | Need for labels — self-supervised pre-training |
| DINO/DINOv2 | Self-distillation without labels | Label dependency — learns without any annotation |
| Hiera | Hierarchical ViT, MAE pre-trained | Speed — multi-scale features, 6× faster than ViT-H |
8 — References & Further Reading
- An Image is Worth 16x16 Words — Dosovitskiy et al., 2020 (original paper)
- DeiT: Training Data-Efficient Image Transformers — Touvron et al., 2020
- Swin Transformer: Hierarchical Vision Transformer — Liu et al., 2021
- Official Google Research Implementation