YOLOX: Exceeding YOLO Series in 2021

Anchor-Free Object Detection, Layer by Layer
Object Detection Anchor-Free CSPDarknet Decoupled Head Megvii 2021

1 — The Problem to Solve

Object detection is the task of finding every object in an image and drawing a tight bounding box around it while also labeling what it is. Unlike image classification (one label per image), detection must answer two questions simultaneously for potentially dozens of objects: what is it? and where is it?

A Concrete Example

Imagine a self-driving car's front camera captures a busy intersection. The model needs to find every pedestrian, car, bicycle, traffic light, and stop sign — and report each one's bounding box coordinates and class label — all in under 10 milliseconds so the car can react in real time.

Why YOLOX? Previous YOLO models used hand-designed "anchor boxes" — predefined bounding box templates the model would adjust. YOLOX removes anchors entirely, adds a decoupled head that handles classification and localization separately, and introduces a smarter label assignment strategy called SimOTA. The result: simpler design, fewer hyperparameters, better accuracy.

What the Model Receives and Returns

Input: An RGB image resized to 640 × 640 × 3. The three channels are red, green, and blue pixel intensities, typically normalized to [0, 1].

Output: A list of detections, each containing:

RGB Pixels 640 x 640 x 3 YOLOX person car dog Boxes + Labels + Scores Each detection: Bounding box (x, y, w, h) Class label 0.95 confidence 0.87 objectness

2 — Architecture Overview

YOLOX follows the classic three-stage detector pattern: Backbone extracts features, Neck fuses them across scales, and Head makes predictions. Here's the full pipeline:

BACKBONE (CSPDarknet) NECK (PAFPN) HEAD (Decoupled) Input 640×640×3 Focus 320×320×12 Dark2 160×160×128 Dark3 (C3) 80×80×256 Dark4 (C4) 40×40×512 Dark5 + SPP (C5) 20×20×1024 P5 20×20×256 P4 40×40×256 P3 80×80×256 upsample ↑ downsample ↓ 1×1 Conv → 256 Cls Reg+IoU 1×1 Conv → 256 Cls Reg+IoU 1×1 Conv → 256 Cls Reg+IoU 80×80 (small obj) 40×40 (med obj) 20×20 (large obj) Stride 8 Stride 16 Stride 32

The image flows left to right. The backbone produces feature maps at three scales (80×80, 40×40, 20×20). The PAFPN neck fuses these features both top-down and bottom-up. Each fused scale feeds into its own decoupled head, which independently predicts classes, bounding boxes, and objectness.

3 — Example Inputs

YOLOX expects images resized to 640 × 640 pixels. During training, aggressive augmentations (Mosaic + MixUp) are applied. At inference, images are letterbox-resized to preserve aspect ratio with gray padding.

Letterbox Resize Original (any size) e.g. 1920 x 1080 640 x 640 pad Mosaic Augmentation (Training) Image 1 Image 2 Image 3 Image 4 640 x 640 combined tile
Mosaic augmentation: During training, four random images are stitched into a single 640×640 tile. This forces the model to see objects at varied scales and positions in a single pass — one of the key reasons YOLOX doesn't need ImageNet pretraining.

4 — Layer-by-Layer Walkthrough

Let's trace a single 640 × 640 × 3 image through every stage. We'll track the tensor shape at each step so you always know what's happening to the data.

B Backbone: CSPDarknet53

The backbone's job is to convert raw pixels into rich feature representations at multiple scales. Think of it as the model's "eyes" — it learns to see edges, textures, parts, and eventually whole objects as you go deeper.

Focus Layer

640×640×3320×320×64

The Focus layer is YOLOX's entry point. Instead of a standard convolution, it takes every other pixel to create four sub-images (like looking at even rows/even cols, even rows/odd cols, etc.), then stacks them along the channel dimension. This turns a 640×640×3 image into a 320×320×12 tensor without losing any information — the spatial resolution halves but the channels multiply by 4.

A single 3×3 convolution with 64 filters then compresses this to 320×320×64.

Focus Layer: Slice every other pixel, then stack Input (4x4 example) A B C D E F G H I J K L M N O P even row, even col A C I K even row, odd col B D J L odd row, even col E G M O odd row, odd col F H N P stack 2x2 x 4ch = 2x2 x 12 320x320x12 3x3 conv, 64 filters 320x320x64 Why not a strided conv? Stride-2 Conv A × C × × × × × I × K × × × × × 12 of 16 pixels LOST 75% information discarded Focus Layer A B C D E F G H I J K L M N O P ALL 16 pixels KEPT 0% information lost Focus trades spatial size for channels: Half resolution, 4x channels — zero pixel loss
Why not just a strided conv? The Focus layer preserves every pixel — a strided convolution would skip pixels and lose information at the very first layer. In later YOLO versions this was replaced by a large-kernel conv, but in YOLOX it's a core design choice.

CBS Module (Conv + BatchNorm + SiLU)

Used throughout

The fundamental building block. Every convolution in CSPDarknet is followed by Batch Normalization (stabilizes training by normalizing activations) and the SiLU activation function (a smooth version of ReLU that allows small negative gradients).

SiLU (also called Swish): f(x) = x · sigmoid(x). Unlike ReLU, it doesn't hard-kill negative values — this helps gradient flow in deep networks.

Feature Map In Conv k×k filters weights learned BatchNorm normalize activations SiLU x · σ(x) smooth ReLU Out This triplet is used as a single unit everywhere in CSPDarknet

Dark2: CSP Block

320×320×64160×160×128

A 3×3 convolution with stride 2 halves the spatial dimensions and doubles the channels. Then the CSP (Cross-Stage Partial) structure splits the feature map into two halves along the channel dimension: one half passes through a series of residual blocks, the other skips ahead. The two halves are concatenated back together.

320x320x64 stride-2 conv S Skip (64ch) Res Block 1x1 + 3x3 Res Block 1x1 + 3x3 Transform path (64ch) C concat CBS 160x160 x128
What CSP does: By splitting channels, CSP lets one path learn complex transformations while the other preserves the original signal. This reduces computation by ~50% compared to processing all channels through the residual blocks, while maintaining accuracy. Think of it as: "let half the channels do the hard work, then share notes."

Dark3: CSP Block — Output C3

160×160×12880×80×256

Same structure as Dark2 — downsample by 2×, then CSP residual processing. The output at this stage, called C3, captures fine-grained features: edges, textures, small parts. This is the first feature map sent to the neck, and it's responsible for detecting small objects (stride 8 — each cell covers an 8×8 pixel region).

160x160 x128 Dark2 out ÷2 80x80 x256 C3 output To Neck What C3 features capture: Edges, textures, small parts Each cell = 8x8 pixel region (stride 8)

Dark4: CSP Block — Output C4

80×80×25640×40×512

Another downsample + CSP stage. C4 captures mid-level features — object parts, shapes, and spatial relationships. Each grid cell now "sees" a 16×16 pixel region. This feature map handles medium-sized objects.

Dark5: CSP Block + SPP — Output C5

40×40×51220×20×1024

The deepest stage. After the CSP block, a Spatial Pyramid Pooling (SPP) module is applied. SPP runs max pooling at three different kernel sizes (5×5, 9×9, 13×13) in parallel and concatenates the results. This lets the network aggregate context from different spatial extents without resizing the feature map.

C5 captures high-level semantic features — the model "understands" what objects are, not just their textures. Each cell covers a 32×32 pixel region. This handles large objects.

20x20 x1024 MaxPool 5x5 MaxPool 9x9 MaxPool 13x13 C concat CBS 20x20 x1024 C5 output Three receptive fields capture local, medium, and global context simultaneously
SPP intuition: Imagine looking at a photo through windows of three different sizes. The small window sees fine details. The large window sees the big picture. SPP gives the model all three perspectives simultaneously, enriching the feature representation at the deepest layer.

N Neck: PAFPN (Path Aggregation Feature Pyramid Network)

The backbone gave us three feature maps at different scales (C3, C4, C5). But there's a problem: C3 has great spatial detail but weak semantics, while C5 has strong semantics but poor localization. The PAFPN neck fixes this by fusing information in both directions.

Top-Down Path (FPN): C5 → C4 → C3

Deep semantics flow to shallow layers

Step 1: Take C5 (20×20×1024), reduce channels with a 1×1 conv to 20×20×512, then upsample (bilinear interpolation) to 40×40×512. Concatenate with C4 and pass through a CSP block to produce P4 (40×40×256).

Step 2: Upsample P4 to 80×80×256, concatenate with C3, and pass through another CSP block to produce P3 (80×80×256).

C5 20x20x1024 1x1 conv Upsample 2x ↑ C4 40x40x512 C CSP P4 40x40x256 upsample + C3 CSP P3 80x80x256 Top-down: semantic information flows from deep → shallow
What this achieves: The shallow layers (which are good at detecting small objects) now have access to the deep network's "understanding" of what objects actually look like. A small blob at 80×80 that looked ambiguous now has semantic context from C5 telling it "that's probably a person."

Bottom-Up Path (PAN): P3 → P4 → P5

Fine-grained details flow to deep layers

Step 3: Take P3, downsample with a stride-2 convolution to 40×40×256, concatenate with P4 (from the top-down path), and refine through a CSP block. This produces the final 40×40 feature map.

Step 4: Downsample again to 20×20×256, concatenate with P5, and refine to produce the final 20×20 feature map.

P3 80x80 Stride-2 Conv ↓ P4 (td) C CSP P4 final 40x40 stride-2 + P5 CSP P5 final 20x20 Bottom-up: fine spatial details flow from shallow → deep
Why two passes? The top-down pass gives shallow layers semantic power. The bottom-up pass gives deep layers spatial precision. After both passes, every scale has both strong semantics and fine-grained localization — the best of both worlds.

After the PAFPN, we have three fused feature maps, all with 256 channels:

Feature Map Shape Stride Best For
P3 80 × 80 × 256 8 Small objects (pedestrians far away)
P4 40 × 40 × 256 16 Medium objects (cars, people nearby)
P5 20 × 20 × 256 32 Large objects (trucks, buildings)

H Head: Decoupled Detection Head

This is YOLOX's signature innovation. Previous YOLOs used a single "coupled" head — one set of convolutions predicted both the class and the bounding box. YOLOX found that these two tasks conflict: classification wants features that are invariant to position, while regression wants features that are highly position-sensitive. Decoupling them improves both.

1×1 Conv: Channel Reduction

H×W×256H×W×256

Each scale's feature map first passes through a 1×1 convolution. This acts as a per-pixel channel mixer — it recombines the 256 channels into a new 256-dimensional representation optimized for the prediction tasks. This is where the "shared stem" ends and the two branches diverge.

P3/P4/P5 HxWx256 1x1 Conv Shared stem 3x3 CBS x256 3x3 CBS x256 Cls scores HxWx80 What is it? 3x3 CBS x256 3x3 CBS x256 Reg (bbox) HxWx4 Where? IoU/Obj HxWx1 Something there?

Classification Branch

H×W×256H×W×num_classes

Two 3×3 convolutions (each followed by BatchNorm + SiLU) process the features specifically for classification. The final layer outputs a score for each of the 80 COCO classes at every spatial position. The output shape for the 80×80 scale would be 80×80×80 (one score per class per grid cell).

What it learns: "Is this a person? A car? A dog?" — position-invariant pattern matching.

Regression Branch + IoU

H×W×256H×W×4 + H×W×1

A separate pair of 3×3 convolutions specializes in predicting where objects are. It outputs:

  • 4 regression values per cell: (x_center, y_center, width, height) — the bounding box
  • 1 objectness score (IoU branch): "Is there actually an object here, or is this background?"

The IoU (Intersection over Union) branch is attached to regression rather than classification because it's fundamentally about spatial accuracy — how well does the predicted box overlap with the true box?

Combining Predictions Across Scales

8,400 total predictions

The three scales produce predictions at every grid cell:

  • 80 × 80 = 6,400 predictions (stride 8, small objects)
  • 40 × 40 = 1,600 predictions (stride 16, medium objects)
  • 20 × 20 = 400 predictions (stride 32, large objects)

Total: 8,400 candidate detections. Each one has 80 class scores + 4 box coordinates + 1 objectness score = 85 values. Non-Maximum Suppression (NMS) then filters these down to the final clean set of detections — typically a few dozen per image.

5 — Key Innovations

Anchor-Free Detection

Older YOLO models placed multiple "anchor boxes" (predefined bounding box templates of different aspect ratios) at each grid cell and predicted adjustments to these templates. This required careful tuning of anchor sizes per dataset. YOLOX simply predicts the box directly — each grid cell outputs one (x, y, w, h) prediction with no templates. Fewer hyperparameters, simpler code, faster inference.

Anchor-Based (YOLOv3/v5) 3 anchors per cell (predefined sizes) predict Δx, Δy, Δw, Δh relative to each anchor Problems: - Must tune anchor sizes per dataset - 3x more predictions per cell vs Anchor-Free (YOLOX) 1 prediction per cell (direct x, y, w, h) predict x, y, w, h directly from cell center Benefits: - No anchor tuning needed - Simpler, faster, fewer params

SimOTA: Dynamic Label Assignment

During training, the model needs to decide which grid cells are "responsible" for each ground-truth object. Older methods used fixed rules (e.g., the center cell). SimOTA dynamically assigns labels by treating it as an optimal transport problem — it finds the globally best matching between predictions and ground-truth boxes based on both classification and localization quality. This means easy objects get fewer assignees, and hard objects get more.

Strong Data Augmentation (No Pretraining)

YOLOX uses Mosaic (4 images stitched together) and MixUp (two images blended) augmentation so aggressively that the model can train from scratch on COCO — no ImageNet pretraining needed. This simplifies the training pipeline and avoids domain mismatch between pretraining and fine-tuning datasets.

6 — Complete Tensor Shape Summary

For YOLOX-L with a 640×640 input:

Stage Layer Output Shape Notes
InputImage640×640×3RGB, normalized
BackboneFocus320×320×64Slice + 3×3 conv
Dark2160×160×128CSP block
Dark3 (C3)80×80×256→ sent to Neck
Dark4 (C4)40×40×512→ sent to Neck
Dark5 + SPP (C5)20×20×1024→ sent to Neck
Neck (FPN ↓)P4 (top-down)40×40×256Upsample C5 + C4
P3 (top-down)80×80×256Upsample P4 + C3
Neck (PAN ↑)P4 (bottom-up)40×40×256Downsample P3 + P4
P5 (bottom-up)20×20×256Downsample P4 + P5
HeadCls branch (×3)H×W×8080 COCO classes
Reg branch (×3)H×W×4x, y, w, h
Obj branch (×3)H×W×1Objectness / IoU

7 — Results

On the COCO val2017 benchmark:

Model AP (%) Params FPS (V100)
YOLOX-Nano25.30.91M
YOLOX-S40.59.0M102
YOLOX-M46.925.3M81
YOLOX-L49.754.2M69
YOLOX-X51.199.1M58

YOLOX-L exceeded YOLOv5-L by 1.8% AP while running at comparable speed. The team also won 1st place in the Streaming Perception Challenge at CVPR 2021's Workshop on Autonomous Driving.

8 — References & Further Reading