An Image is Worth 16x16 Words

Vision Transformer — Transformers Replace Convolutions
Image Classification Transformer Patch Embeddings Self-Attention Google Brain 2020

1 — The Problem to Solve

By 2020, Transformers had conquered NLP — BERT, GPT, and friends dominated every language benchmark. But computer vision still ran on CNNs. The question was simple: can we apply the Transformer architecture directly to images?

The challenge: Transformers process sequences of tokens. Images are 2D grids of pixels. A 224×224 image has 50,176 pixels — self-attention over that many tokens is computationally impossible (it scales quadratically). ViT's solution: cut the image into patches and treat each patch as a "word."

Key Insight: Split a 224×224 image into a grid of 16×16 patches. Each patch becomes a token. Now you have 196 tokens instead of 50,176 pixels — a manageable sequence length for a Transformer.

What the Model Receives and Returns

Input: An RGB image resized to 224 × 224 × 3.

Output: A probability distribution over 1,000 ImageNet classes (same task as ResNet).

224x224 → 14x14 patches split 196 patch tokens cls + 1 [CLS] token = 197 total Transformer 12 encoder layers (self-attention + MLP) [CLS] out MLP Head 1000 classes "golden retriever"

2 — Architecture Overview

ViT is remarkably simple compared to CNNs. There are no pooling layers, no convolutions (except one for patch embedding), and no multi-scale feature maps. It's just: split into patches, embed, add position info, and run through a standard Transformer encoder.

PATCH EMBEDDING TRANSFORMER ENCODER (x12) CLASSIFICATION HEAD 224x224 Input Linear Projection 16x16 → 768d Patch Embed Pos Embed 197 x 768 [CLS] token 1 x 768 Layer Norm + Multi-Head + residual Self-Attn 12 heads + Layer Norm + Feed-fwd + residual MLP 768→3072→768 + ×12 Layer Norm [CLS] 768d MLP Head 768 → 1000 + softmax

3 — Layer-by-Layer Walkthrough

Let's trace a single 224 × 224 × 3 image through ViT-Base/16.

1 Patch Embedding

Split Image into Patches

224×224×3196 × 768

The image is divided into a grid of non-overlapping 16×16 patches. Each patch is 16×16×3 = 768 values. With a 224×224 image, this gives 14×14 = 196 patches.

Each 768-value patch is linearly projected to a 768-dimensional embedding using a learned weight matrix. In practice, this is implemented as a single 16×16 convolution with stride 16 and 768 output channels — mathematically identical to flattening + linear projection, but more efficient.

224 x 224 image p1 p2 14 x 14 = 196 patches flatten Each patch: 16x16x3 = 768 values patch 1: [r,g,b, r,g,b, ..., r,g,b] 768 vals patch 2: [r,g,b, r,g,b, ..., r,g,b] 768 vals patch 3: [r,g,b, ...] patch 196: [r,g,b, ...] linear 768x768 Patch embeddings (196 x 768) e1: 768-dim embedding e2: 768-dim embedding e3: 768-dim embedding e196: 768-dim embedding [CLS] token prepended → 197 x 768 + positional embeddings (197 x 768) added element-wise

[CLS] Token and Positional Embeddings

196×768197×768

A special learnable [CLS] token is prepended to the sequence (borrowed from BERT). After passing through all Transformer layers, the [CLS] token's output is used as the image representation for classification — it aggregates information from all patches via self-attention.

Positional embeddings — 197 learnable 768-dimensional vectors — are added element-wise to tell the Transformer where each patch came from. Without these, the model wouldn't know that patch 1 is top-left and patch 196 is bottom-right (Transformers are permutation-invariant by default).

Why learned positions, not fixed? ViT uses learned positional embeddings rather than the sinusoidal encodings from the original Transformer. The authors found learned positions worked equally well and are simpler. Interestingly, the learned positions end up encoding a 2D spatial pattern — nearby patches learn similar position embeddings.

2 Transformer Encoder

Multi-Head Self-Attention (MSA)

197×768197×768

Each token (patch embedding) attends to every other token. This is where the magic happens — a patch showing a dog's ear can "look at" a patch showing a dog's nose and a patch showing a tail, connecting distant image regions that a CNN with its limited receptive field might miss.

With 12 attention heads, each head operates on 768/12 = 64 dimensions. Each head computes Q (query), K (key), V (value) matrices, then attention weights = softmax(QKT/√64). The outputs of all 12 heads are concatenated and projected back to 768 dimensions.

Multi-Head Self-Attention 197x768 tokens Q 197x64 K 197x64 V 197x64 QK^T / √64 197x197 softmax × Head out 197x64 ×12 heads Concat + Linear 12×64 = 768 Every patch can attend to every other patch — global receptive field from layer 1

MLP (Feed-Forward Network)

197×768197×768

After self-attention, each token independently passes through a two-layer MLP: expand to 4× the embedding dimension (768 → 3072), apply GELU activation, then project back (3072 → 768). This is where the model learns non-linear feature transformations — self-attention mixes information between tokens, and the MLP processes it.

768 Linear 768 → 3072 GELU Linear 3072 → 768 768 Applied independently to each of 197 tokens

Residual Connections and Layer Normalization

Applied around both MSA and MLP

Just like ResNet's skip connections, ViT adds the input of each sub-layer to its output: x = x + MSA(LN(x)) and x = x + MLP(LN(x)). ViT uses Pre-Norm (Layer Norm before the sub-layer), which was found to train more stably than Post-Norm used in the original Transformer.

3 Classification Head

Extract [CLS] Token + MLP Head

7681000

After 12 encoder layers, the [CLS] token (position 0) has attended to all 196 patch tokens repeatedly. It now encodes a global representation of the entire image. A final Layer Norm and an MLP head (one hidden layer during pre-training, single linear layer during fine-tuning) maps this 768-dimensional vector to class logits.

4 — CNN vs. ViT: What Changed

Receptive Field

CNN (ResNet) 3x3 kernel: sees only neighbors Need many layers for global context ViT (Transformer) Self-attention: sees ALL patches at once Global context from very first layer

Inductive Bias Trade-off

CNNs have strong inductive biases: locality (nearby pixels matter more), translation equivariance (a cat is a cat regardless of position), and hierarchical features (edges → textures → parts → objects). ViT has almost none of these — it must learn spatial structure entirely from data.

This means ViT needs much more data to train well. On ImageNet alone (1.2M images), ViT underperforms ResNet. But when pre-trained on JFT-300M (300M images) or ImageNet-21K (14M images), ViT surpasses CNNs on every benchmark — with fewer computational resources at inference.

Data Scale vs. Accuracy: CNN vs. ViT Pre-training dataset size → ImageNet Top-1 → ImageNet-1K ImageNet-21K JFT-300M ResNet ViT crossover: ViT wins CNN wins (small data) ViT wins (large data)

Attention Patterns: What ViT Learns to See

The authors visualized attention maps from different heads and layers. Early layers attend to nearby patches (learning local features like edges — similar to CNN conv layers). Later layers attend to semantically related patches across the entire image (a dog's face attends to its tail). Different heads specialize: some track horizontal patterns, others vertical, others diagonal.

Attention Distance by Layer (Mean Attention Distance in Pixels) Layer 1 Local attention (~8 px avg dist) Layer 6 Mixed local/global (~40 px avg dist) Layer 12 Global attention (~100 px avg dist) Deeper layers → larger attention distance → more global reasoning Some heads attend globally even in layer 1 — heads specialize for different spatial patterns This emergence of CNN-like local processing in early layers shows ViT "reinvents" convolutional behavior when given enough data

Learned Position Embeddings: ViT Discovers 2D Structure

Even though ViT's position embeddings are 1D learned vectors (not explicitly 2D), the model discovers spatial structure on its own. Visualizing cosine similarity between position embeddings reveals a clear 2D grid pattern — nearby patches in the image end up with similar position embeddings.

Position Embedding Similarity (cosine similarity with patch at ☆) High similarity (nearby) Medium similarity Low similarity (far away) Key observation: 1D position embeddings spontaneously learn 2D spatial structure!

4.5 — ViT Model Variants

ViT comes in three sizes. The naming convention is ViT-{Size}/{Patch Size} — so ViT-B/16 means Base model with 16×16 patches.

ViT Model Family ViT-Base 12 layers • 12 heads • 768 dim MLP: 3072 • 86M params B/16: 84.0% • B/32: 80.7% (ImageNet-21K pre-train) ViT-Large 24 layers • 16 heads • 1024 dim MLP: 4096 • 307M params L/16: 87.8% (JFT-300M pre-train) ViT-Huge 32 layers • 16 heads • 1280 dim MLP: 5120 • 632M params H/14: 88.6% (SOTA) (JFT-300M pre-train)
Patch size matters: Smaller patches (14×14 vs 16×16) produce more tokens (256 vs 196 for a 224×224 image), giving finer spatial resolution but costing quadratically more in self-attention. ViT-H/14 uses 14×14 patches — the extra resolution helps it hit the best accuracy but increases compute by ~1.7× vs /16.

5 — Tensor Shape Summary

For ViT-Base/16 with a 224×224 input:

StageOperationOutput ShapeNotes
InputImage224×224×3RGB, normalized
Patch Embed16×16 conv, stride 16196×76814×14 patches
Prepend [CLS]Concat learned token197×768+1 classification token
+ Pos EmbedElement-wise add197×768Learned positions
Encoder (×12)Multi-Head Self-Attention197×76812 heads, 64 dim each
MLP197×768768→3072→768
HeadExtract [CLS] + Linear1000Softmax probabilities

6 — Results

ModelPre-training DataImageNet Top-1Params
ResNet-152 (baseline)ImageNet-1K78.3%60M
ViT-B/16ImageNet-1K77.9%86M
ViT-B/16ImageNet-21K84.0%86M
ViT-L/16JFT-300M87.8%307M
ViT-H/14JFT-300M88.6%632M
Legacy: ViT proved that Transformers can match or beat CNNs on vision tasks — given enough data. It spawned an explosion of vision Transformer variants (DeiT, Swin, BEiT, MAE) and became the de facto image encoder in multimodal models like CLIP, DALL-E, SAM, and Stable Diffusion.

7 — ViT's Descendants: The Vision Transformer Family Tree

ViT's impact was enormous — it launched an entire subfield. Here's how the key descendants address ViT's limitations:

ViT Family Tree ViT (2020) DeiT (2020) Train ViT on ImageNet-1K only Knowledge distillation + augmentation Swin Transformer (2021) Shifted windows: local attention Hierarchical: multi-scale features MAE / BEiT (2021) Self-supervised pre-training Mask & reconstruct patches DINO / DINOv2 (2021/23) Self-distillation: no labels needed Emergent object segmentation Downstream: CLIP • DALL-E • SAM • Stable Diffusion • GPT-4V • LLaVA • RF-DETR ViT is now the default vision backbone for multimodal foundation models
ModelKey InnovationViT Limitation Addressed
DeiTDistillation token + heavy augmentationData hunger — trains on ImageNet-1K alone
SwinShifted window attention + hierarchical stagesQuadratic cost — O(n) local attention + multi-scale
MAEMask 75% of patches, reconstructNeed for labels — self-supervised pre-training
DINO/DINOv2Self-distillation without labelsLabel dependency — learns without any annotation
HieraHierarchical ViT, MAE pre-trainedSpeed — multi-scale features, 6× faster than ViT-H

8 — References & Further Reading