An Image is Worth 16x16 Words

Vision Transformer — Transformers Replace Convolutions

Image Classification Transformer Patch Embeddings Self-Attention Google Brain 2020

Paper: arXiv:2010.11929 | Code: google-research/vision_transformer

1 — The Problem to Solve

By 2020, Transformers had conquered NLP — BERT, GPT, and friends dominated every language benchmark. But computer vision still ran on CNNs. The question was simple: can we apply the Transformer architecture directly to images?

The challenge: Transformers process sequences of tokens. Images are 2D grids of pixels. A 224×224 image has 50,176 pixels — self-attention over that many tokens is computationally impossible (it scales quadratically). ViT's solution: cut the image into patches and treat each patch as a "word."

Key Insight: Split a 224×224 image into a grid of 16×16 patches. Each patch becomes a token. Now you have 196 tokens instead of 50,176 pixels — a manageable sequence length for a Transformer.

What the Model Receives and Returns

Input: An RGB image resized to 224 × 224 × 3.

Output: A probability distribution over 1,000 ImageNet classes (same task as ResNet).

2 — Architecture Overview

ViT is remarkably simple compared to CNNs. There are no pooling layers, no convolutions (except one for patch embedding), and no multi-scale feature maps. It's just: split into patches, embed, add position info, and run through a standard Transformer encoder.

3 — Layer-by-Layer Walkthrough

Let's trace a single 224 × 224 × 3 image through ViT-Base/16.

1 Patch Embedding

Split Image into Patches

224×224×3 → 196 × 768

The image is divided into a grid of non-overlapping 16×16 patches. Each patch is 16×16×3 = 768 values. With a 224×224 image, this gives 14×14 = 196 patches.

Each 768-value patch is linearly projected to a 768-dimensional embedding using a learned weight matrix. In practice, this is implemented as a single 16×16 convolution with stride 16 and 768 output channels — mathematically identical to flattening + linear projection, but more efficient.

[CLS] Token and Positional Embeddings

196×768 → 197×768

A special learnable [CLS] token is prepended to the sequence (borrowed from BERT). After passing through all Transformer layers, the [CLS] token's output is used as the image representation for classification — it aggregates information from all patches via self-attention.

Positional embeddings — 197 learnable 768-dimensional vectors — are added element-wise to tell the Transformer where each patch came from. Without these, the model wouldn't know that patch 1 is top-left and patch 196 is bottom-right (Transformers are permutation-invariant by default).

Why learned positions, not fixed? ViT uses learned positional embeddings rather than the sinusoidal encodings from the original Transformer. The authors found learned positions worked equally well and are simpler. Interestingly, the learned positions end up encoding a 2D spatial pattern — nearby patches learn similar position embeddings.

2 Transformer Encoder

Multi-Head Self-Attention (MSA)

197×768 → 197×768

Each token (patch embedding) attends to every other token. This is where the magic happens — a patch showing a dog's ear can "look at" a patch showing a dog's nose and a patch showing a tail, connecting distant image regions that a CNN with its limited receptive field might miss.

With 12 attention heads, each head operates on 768/12 = 64 dimensions. Each head computes Q (query), K (key), V (value) matrices, then attention weights = softmax(QK^T/√64). The outputs of all 12 heads are concatenated and projected back to 768 dimensions.

MLP (Feed-Forward Network)

197×768 → 197×768

After self-attention, each token independently passes through a two-layer MLP: expand to 4× the embedding dimension (768 → 3072), apply GELU activation, then project back (3072 → 768). This is where the model learns non-linear feature transformations — self-attention mixes information between tokens, and the MLP processes it.

Residual Connections and Layer Normalization

Applied around both MSA and MLP

Just like ResNet's skip connections, ViT adds the input of each sub-layer to its output: x = x + MSA(LN(x)) and x = x + MLP(LN(x)). ViT uses Pre-Norm (Layer Norm before the sub-layer), which was found to train more stably than Post-Norm used in the original Transformer.

3 Classification Head

Extract [CLS] Token + MLP Head

768 → 1000

After 12 encoder layers, the [CLS] token (position 0) has attended to all 196 patch tokens repeatedly. It now encodes a global representation of the entire image. A final Layer Norm and an MLP head (one hidden layer during pre-training, single linear layer during fine-tuning) maps this 768-dimensional vector to class logits.

4 — CNN vs. ViT: What Changed

Receptive Field

Inductive Bias Trade-off

CNNs have strong inductive biases: locality (nearby pixels matter more), translation equivariance (a cat is a cat regardless of position), and hierarchical features (edges → textures → parts → objects). ViT has almost none of these — it must learn spatial structure entirely from data.

This means ViT needs much more data to train well. On ImageNet alone (1.2M images), ViT underperforms ResNet. But when pre-trained on JFT-300M (300M images) or ImageNet-21K (14M images), ViT surpasses CNNs on every benchmark — with fewer computational resources at inference.

Attention Patterns: What ViT Learns to See

The authors visualized attention maps from different heads and layers. Early layers attend to nearby patches (learning local features like edges — similar to CNN conv layers). Later layers attend to semantically related patches across the entire image (a dog's face attends to its tail). Different heads specialize: some track horizontal patterns, others vertical, others diagonal.

Learned Position Embeddings: ViT Discovers 2D Structure

Even though ViT's position embeddings are 1D learned vectors (not explicitly 2D), the model discovers spatial structure on its own. Visualizing cosine similarity between position embeddings reveals a clear 2D grid pattern — nearby patches in the image end up with similar position embeddings.

4.5 — ViT Model Variants

ViT comes in three sizes. The naming convention is ViT-{Size}/{Patch Size} — so ViT-B/16 means Base model with 16×16 patches.

Patch size matters: Smaller patches (14×14 vs 16×16) produce more tokens (256 vs 196 for a 224×224 image), giving finer spatial resolution but costing quadratically more in self-attention. ViT-H/14 uses 14×14 patches — the extra resolution helps it hit the best accuracy but increases compute by ~1.7× vs /16.

5 — Tensor Shape Summary

For ViT-Base/16 with a 224×224 input:

Stage	Operation	Output Shape	Notes
Input	Image	224×224×3	RGB, normalized
Patch Embed	16×16 conv, stride 16	196×768	14×14 patches
Prepend [CLS]	Concat learned token	197×768	+1 classification token
+ Pos Embed	Element-wise add	197×768	Learned positions
Encoder (×12)	Multi-Head Self-Attention	197×768	12 heads, 64 dim each
Encoder (×12)	MLP	197×768	768→3072→768
Head	Extract [CLS] + Linear	1000	Softmax probabilities

6 — Results

Model	Pre-training Data	ImageNet Top-1	Params
ResNet-152 (baseline)	ImageNet-1K	78.3%	60M
ViT-B/16	ImageNet-1K	77.9%	86M
ViT-B/16	ImageNet-21K	84.0%	86M
ViT-L/16	JFT-300M	87.8%	307M
ViT-H/14	JFT-300M	88.6%	632M

Legacy: ViT proved that Transformers can match or beat CNNs on vision tasks — given enough data. It spawned an explosion of vision Transformer variants (DeiT, Swin, BEiT, MAE) and became the de facto image encoder in multimodal models like CLIP, DALL-E, SAM, and Stable Diffusion.

7 — ViT's Descendants: The Vision Transformer Family Tree

ViT's impact was enormous — it launched an entire subfield. Here's how the key descendants address ViT's limitations:

Model	Key Innovation	ViT Limitation Addressed
DeiT	Distillation token + heavy augmentation	Data hunger — trains on ImageNet-1K alone
Swin	Shifted window attention + hierarchical stages	Quadratic cost — O(n) local attention + multi-scale
MAE	Mask 75% of patches, reconstruct	Need for labels — self-supervised pre-training
DINO/DINOv2	Self-distillation without labels	Label dependency — learns without any annotation
Hiera	Hierarchical ViT, MAE pre-trained	Speed — multi-scale features, 6× faster than ViT-H

8 — References & Further Reading

An Image is Worth 16x16 Words — Dosovitskiy et al., 2020 (original paper)
DeiT: Training Data-Efficient Image Transformers — Touvron et al., 2020
Swin Transformer: Hierarchical Vision Transformer — Liu et al., 2021
Official Google Research Implementation