YOLOX: Exceeding YOLO Series in 2021

Anchor-Free Object Detection, Layer by Layer

Object Detection Anchor-Free CSPDarknet Decoupled Head Megvii 2021

Paper: arXiv:2107.08430 | Code: Megvii-BaseDetection/YOLOX

1 — The Problem to Solve

Object detection is the task of finding every object in an image and drawing a tight bounding box around it while also labeling what it is. Unlike image classification (one label per image), detection must answer two questions simultaneously for potentially dozens of objects: what is it? and where is it?

A Concrete Example

Imagine a self-driving car's front camera captures a busy intersection. The model needs to find every pedestrian, car, bicycle, traffic light, and stop sign — and report each one's bounding box coordinates and class label — all in under 10 milliseconds so the car can react in real time.

Why YOLOX? Previous YOLO models used hand-designed "anchor boxes" — predefined bounding box templates the model would adjust. YOLOX removes anchors entirely, adds a decoupled head that handles classification and localization separately, and introduces a smarter label assignment strategy called SimOTA. The result: simpler design, fewer hyperparameters, better accuracy.

What the Model Receives and Returns

Input: An RGB image resized to 640 × 640 × 3. The three channels are red, green, and blue pixel intensities, typically normalized to [0, 1].

Output: A list of detections, each containing:

A bounding box: (x_center, y_center, width, height)
A class label (e.g., "person", "car", "dog") with a confidence score
An objectness score — how confident the model is that something is there at all

2 — Architecture Overview

YOLOX follows the classic three-stage detector pattern: Backbone extracts features, Neck fuses them across scales, and Head makes predictions. Here's the full pipeline:

The image flows left to right. The backbone produces feature maps at three scales (80×80, 40×40, 20×20). The PAFPN neck fuses these features both top-down and bottom-up. Each fused scale feeds into its own decoupled head, which independently predicts classes, bounding boxes, and objectness.

3 — Example Inputs

YOLOX expects images resized to 640 × 640 pixels. During training, aggressive augmentations (Mosaic + MixUp) are applied. At inference, images are letterbox-resized to preserve aspect ratio with gray padding.

Mosaic augmentation: During training, four random images are stitched into a single 640×640 tile. This forces the model to see objects at varied scales and positions in a single pass — one of the key reasons YOLOX doesn't need ImageNet pretraining.

4 — Layer-by-Layer Walkthrough

Let's trace a single 640 × 640 × 3 image through every stage. We'll track the tensor shape at each step so you always know what's happening to the data.

B Backbone: CSPDarknet53

The backbone's job is to convert raw pixels into rich feature representations at multiple scales. Think of it as the model's "eyes" — it learns to see edges, textures, parts, and eventually whole objects as you go deeper.

Focus Layer

640×640×3 → 320×320×64

The Focus layer is YOLOX's entry point. Instead of a standard convolution, it takes every other pixel to create four sub-images (like looking at even rows/even cols, even rows/odd cols, etc.), then stacks them along the channel dimension. This turns a 640×640×3 image into a 320×320×12 tensor without losing any information — the spatial resolution halves but the channels multiply by 4.

A single 3×3 convolution with 64 filters then compresses this to 320×320×64.

Why not just a strided conv? The Focus layer preserves every pixel — a strided convolution would skip pixels and lose information at the very first layer. In later YOLO versions this was replaced by a large-kernel conv, but in YOLOX it's a core design choice.

CBS Module (Conv + BatchNorm + SiLU)

Used throughout

The fundamental building block. Every convolution in CSPDarknet is followed by Batch Normalization (stabilizes training by normalizing activations) and the SiLU activation function (a smooth version of ReLU that allows small negative gradients).

SiLU (also called Swish): f(x) = x · sigmoid(x). Unlike ReLU, it doesn't hard-kill negative values — this helps gradient flow in deep networks.

Dark2: CSP Block

320×320×64 → 160×160×128

A 3×3 convolution with stride 2 halves the spatial dimensions and doubles the channels. Then the CSP (Cross-Stage Partial) structure splits the feature map into two halves along the channel dimension: one half passes through a series of residual blocks, the other skips ahead. The two halves are concatenated back together.

What CSP does: By splitting channels, CSP lets one path learn complex transformations while the other preserves the original signal. This reduces computation by ~50% compared to processing all channels through the residual blocks, while maintaining accuracy. Think of it as: "let half the channels do the hard work, then share notes."

Dark3: CSP Block — Output C3

160×160×128 → 80×80×256

Same structure as Dark2 — downsample by 2×, then CSP residual processing. The output at this stage, called C3, captures fine-grained features: edges, textures, small parts. This is the first feature map sent to the neck, and it's responsible for detecting small objects (stride 8 — each cell covers an 8×8 pixel region).

Dark4: CSP Block — Output C4

80×80×256 → 40×40×512

Another downsample + CSP stage. C4 captures mid-level features — object parts, shapes, and spatial relationships. Each grid cell now "sees" a 16×16 pixel region. This feature map handles medium-sized objects.

Dark5: CSP Block + SPP — Output C5

40×40×512 → 20×20×1024

The deepest stage. After the CSP block, a Spatial Pyramid Pooling (SPP) module is applied. SPP runs max pooling at three different kernel sizes (5×5, 9×9, 13×13) in parallel and concatenates the results. This lets the network aggregate context from different spatial extents without resizing the feature map.

C5 captures high-level semantic features — the model "understands" what objects are, not just their textures. Each cell covers a 32×32 pixel region. This handles large objects.

SPP intuition: Imagine looking at a photo through windows of three different sizes. The small window sees fine details. The large window sees the big picture. SPP gives the model all three perspectives simultaneously, enriching the feature representation at the deepest layer.

N Neck: PAFPN (Path Aggregation Feature Pyramid Network)

The backbone gave us three feature maps at different scales (C3, C4, C5). But there's a problem: C3 has great spatial detail but weak semantics, while C5 has strong semantics but poor localization. The PAFPN neck fixes this by fusing information in both directions.

Top-Down Path (FPN): C5 → C4 → C3

Deep semantics flow to shallow layers

Step 1: Take C5 (20×20×1024), reduce channels with a 1×1 conv to 20×20×512, then upsample (bilinear interpolation) to 40×40×512. Concatenate with C4 and pass through a CSP block to produce P4 (40×40×256).

Step 2: Upsample P4 to 80×80×256, concatenate with C3, and pass through another CSP block to produce P3 (80×80×256).

What this achieves: The shallow layers (which are good at detecting small objects) now have access to the deep network's "understanding" of what objects actually look like. A small blob at 80×80 that looked ambiguous now has semantic context from C5 telling it "that's probably a person."

Bottom-Up Path (PAN): P3 → P4 → P5

Fine-grained details flow to deep layers

Step 3: Take P3, downsample with a stride-2 convolution to 40×40×256, concatenate with P4 (from the top-down path), and refine through a CSP block. This produces the final 40×40 feature map.

Step 4: Downsample again to 20×20×256, concatenate with P5, and refine to produce the final 20×20 feature map.

Why two passes? The top-down pass gives shallow layers semantic power. The bottom-up pass gives deep layers spatial precision. After both passes, every scale has both strong semantics and fine-grained localization — the best of both worlds.

After the PAFPN, we have three fused feature maps, all with 256 channels:

Feature Map	Shape	Stride	Best For
P3	80 × 80 × 256	8	Small objects (pedestrians far away)
P4	40 × 40 × 256	16	Medium objects (cars, people nearby)
P5	20 × 20 × 256	32	Large objects (trucks, buildings)

H Head: Decoupled Detection Head

This is YOLOX's signature innovation. Previous YOLOs used a single "coupled" head — one set of convolutions predicted both the class and the bounding box. YOLOX found that these two tasks conflict: classification wants features that are invariant to position, while regression wants features that are highly position-sensitive. Decoupling them improves both.

1×1 Conv: Channel Reduction

H×W×256 → H×W×256

Each scale's feature map first passes through a 1×1 convolution. This acts as a per-pixel channel mixer — it recombines the 256 channels into a new 256-dimensional representation optimized for the prediction tasks. This is where the "shared stem" ends and the two branches diverge.

Classification Branch

H×W×256 → H×W×num_classes

Two 3×3 convolutions (each followed by BatchNorm + SiLU) process the features specifically for classification. The final layer outputs a score for each of the 80 COCO classes at every spatial position. The output shape for the 80×80 scale would be 80×80×80 (one score per class per grid cell).

What it learns: "Is this a person? A car? A dog?" — position-invariant pattern matching.

Regression Branch + IoU

H×W×256 → H×W×4 + H×W×1

A separate pair of 3×3 convolutions specializes in predicting where objects are. It outputs:

4 regression values per cell: (x_center, y_center, width, height) — the bounding box
1 objectness score (IoU branch): "Is there actually an object here, or is this background?"

The IoU (Intersection over Union) branch is attached to regression rather than classification because it's fundamentally about spatial accuracy — how well does the predicted box overlap with the true box?

Combining Predictions Across Scales

8,400 total predictions

The three scales produce predictions at every grid cell:

80 × 80 = 6,400 predictions (stride 8, small objects)
40 × 40 = 1,600 predictions (stride 16, medium objects)
20 × 20 = 400 predictions (stride 32, large objects)

Total: 8,400 candidate detections. Each one has 80 class scores + 4 box coordinates + 1 objectness score = 85 values. Non-Maximum Suppression (NMS) then filters these down to the final clean set of detections — typically a few dozen per image.

5 — Key Innovations

Anchor-Free Detection

Older YOLO models placed multiple "anchor boxes" (predefined bounding box templates of different aspect ratios) at each grid cell and predicted adjustments to these templates. This required careful tuning of anchor sizes per dataset. YOLOX simply predicts the box directly — each grid cell outputs one (x, y, w, h) prediction with no templates. Fewer hyperparameters, simpler code, faster inference.

SimOTA: Dynamic Label Assignment

During training, the model needs to decide which grid cells are "responsible" for each ground-truth object. Older methods used fixed rules (e.g., the center cell). SimOTA dynamically assigns labels by treating it as an optimal transport problem — it finds the globally best matching between predictions and ground-truth boxes based on both classification and localization quality. This means easy objects get fewer assignees, and hard objects get more.

Strong Data Augmentation (No Pretraining)

YOLOX uses Mosaic (4 images stitched together) and MixUp (two images blended) augmentation so aggressively that the model can train from scratch on COCO — no ImageNet pretraining needed. This simplifies the training pipeline and avoids domain mismatch between pretraining and fine-tuning datasets.

6 — Complete Tensor Shape Summary

For YOLOX-L with a 640×640 input:

Stage	Layer	Output Shape	Notes
Input	Image	640×640×3	RGB, normalized
Backbone	Focus	320×320×64	Slice + 3×3 conv
	Dark2	160×160×128	CSP block
	Dark3 (C3)	80×80×256	→ sent to Neck
	Dark4 (C4)	40×40×512	→ sent to Neck
	Dark5 + SPP (C5)	20×20×1024	→ sent to Neck
Neck (FPN ↓)	P4 (top-down)	40×40×256	Upsample C5 + C4
Neck (FPN ↓)	P3 (top-down)	80×80×256	Upsample P4 + C3
Neck (PAN ↑)	P4 (bottom-up)	40×40×256	Downsample P3 + P4
Neck (PAN ↑)	P5 (bottom-up)	20×20×256	Downsample P4 + P5
Head	Cls branch (×3)	H×W×80	80 COCO classes
	Reg branch (×3)	H×W×4	x, y, w, h
	Obj branch (×3)	H×W×1	Objectness / IoU

7 — Results

On the COCO val2017 benchmark:

Model	AP (%)	Params	FPS (V100)
YOLOX-Nano	25.3	0.91M	—
YOLOX-S	40.5	9.0M	102
YOLOX-M	46.9	25.3M	81
YOLOX-L	49.7	54.2M	69
YOLOX-X	51.1	99.1M	58

YOLOX-L exceeded YOLOv5-L by 1.8% AP while running at comparable speed. The team also won 1st place in the Streaming Perception Challenge at CVPR 2021's Workshop on Autonomous Driving.

8 — References & Further Reading

YOLOX: Exceeding YOLO Series in 2021 — Ge et al., 2021 (original paper)
Official GitHub Repository — Megvii-BaseDetection
YOLOX: Boosting Object Detection Performance — viso.ai overview
Papers With Code: YOLOX