YOLOX: Exceeding YOLO Series in 2021
1 — The Problem to Solve
Object detection is the task of finding every object in an image and drawing a tight bounding box around it while also labeling what it is. Unlike image classification (one label per image), detection must answer two questions simultaneously for potentially dozens of objects: what is it? and where is it?
A Concrete Example
Imagine a self-driving car's front camera captures a busy intersection. The model needs to find every pedestrian, car, bicycle, traffic light, and stop sign — and report each one's bounding box coordinates and class label — all in under 10 milliseconds so the car can react in real time.
What the Model Receives and Returns
Input: An RGB image resized to 640 × 640 × 3. The three channels are red, green, and blue pixel intensities, typically normalized to [0, 1].
Output: A list of detections, each containing:
- A bounding box: (x_center, y_center, width, height)
- A class label (e.g., "person", "car", "dog") with a confidence score
- An objectness score — how confident the model is that something is there at all
2 — Architecture Overview
YOLOX follows the classic three-stage detector pattern: Backbone extracts features, Neck fuses them across scales, and Head makes predictions. Here's the full pipeline:
The image flows left to right. The backbone produces feature maps at three scales (80×80, 40×40, 20×20). The PAFPN neck fuses these features both top-down and bottom-up. Each fused scale feeds into its own decoupled head, which independently predicts classes, bounding boxes, and objectness.
3 — Example Inputs
YOLOX expects images resized to 640 × 640 pixels. During training, aggressive augmentations (Mosaic + MixUp) are applied. At inference, images are letterbox-resized to preserve aspect ratio with gray padding.
4 — Layer-by-Layer Walkthrough
Let's trace a single 640 × 640 × 3 image through every stage. We'll track the tensor shape at each step so you always know what's happening to the data.
B Backbone: CSPDarknet53
The backbone's job is to convert raw pixels into rich feature representations at multiple scales. Think of it as the model's "eyes" — it learns to see edges, textures, parts, and eventually whole objects as you go deeper.
Focus Layer
The Focus layer is YOLOX's entry point. Instead of a standard convolution, it takes every other pixel to create four sub-images (like looking at even rows/even cols, even rows/odd cols, etc.), then stacks them along the channel dimension. This turns a 640×640×3 image into a 320×320×12 tensor without losing any information — the spatial resolution halves but the channels multiply by 4.
A single 3×3 convolution with 64 filters then compresses this to 320×320×64.
CBS Module (Conv + BatchNorm + SiLU)
The fundamental building block. Every convolution in CSPDarknet is followed by Batch Normalization (stabilizes training by normalizing activations) and the SiLU activation function (a smooth version of ReLU that allows small negative gradients).
SiLU (also called Swish): f(x) = x · sigmoid(x). Unlike ReLU, it doesn't hard-kill negative values — this helps gradient flow in deep networks.
Dark2: CSP Block
A 3×3 convolution with stride 2 halves the spatial dimensions and doubles the channels. Then the CSP (Cross-Stage Partial) structure splits the feature map into two halves along the channel dimension: one half passes through a series of residual blocks, the other skips ahead. The two halves are concatenated back together.
Dark3: CSP Block — Output C3
Same structure as Dark2 — downsample by 2×, then CSP residual processing. The output at this stage, called C3, captures fine-grained features: edges, textures, small parts. This is the first feature map sent to the neck, and it's responsible for detecting small objects (stride 8 — each cell covers an 8×8 pixel region).
Dark4: CSP Block — Output C4
Another downsample + CSP stage. C4 captures mid-level features — object parts, shapes, and spatial relationships. Each grid cell now "sees" a 16×16 pixel region. This feature map handles medium-sized objects.
Dark5: CSP Block + SPP — Output C5
The deepest stage. After the CSP block, a Spatial Pyramid Pooling (SPP) module is applied. SPP runs max pooling at three different kernel sizes (5×5, 9×9, 13×13) in parallel and concatenates the results. This lets the network aggregate context from different spatial extents without resizing the feature map.
C5 captures high-level semantic features — the model "understands" what objects are, not just their textures. Each cell covers a 32×32 pixel region. This handles large objects.
N Neck: PAFPN (Path Aggregation Feature Pyramid Network)
The backbone gave us three feature maps at different scales (C3, C4, C5). But there's a problem: C3 has great spatial detail but weak semantics, while C5 has strong semantics but poor localization. The PAFPN neck fixes this by fusing information in both directions.
Top-Down Path (FPN): C5 → C4 → C3
Step 1: Take C5 (20×20×1024), reduce channels with a 1×1 conv to 20×20×512, then upsample (bilinear interpolation) to 40×40×512. Concatenate with C4 and pass through a CSP block to produce P4 (40×40×256).
Step 2: Upsample P4 to 80×80×256, concatenate with C3, and pass through another CSP block to produce P3 (80×80×256).
Bottom-Up Path (PAN): P3 → P4 → P5
Step 3: Take P3, downsample with a stride-2 convolution to 40×40×256, concatenate with P4 (from the top-down path), and refine through a CSP block. This produces the final 40×40 feature map.
Step 4: Downsample again to 20×20×256, concatenate with P5, and refine to produce the final 20×20 feature map.
After the PAFPN, we have three fused feature maps, all with 256 channels:
| Feature Map | Shape | Stride | Best For |
|---|---|---|---|
| P3 | 80 × 80 × 256 | 8 | Small objects (pedestrians far away) |
| P4 | 40 × 40 × 256 | 16 | Medium objects (cars, people nearby) |
| P5 | 20 × 20 × 256 | 32 | Large objects (trucks, buildings) |
H Head: Decoupled Detection Head
This is YOLOX's signature innovation. Previous YOLOs used a single "coupled" head — one set of convolutions predicted both the class and the bounding box. YOLOX found that these two tasks conflict: classification wants features that are invariant to position, while regression wants features that are highly position-sensitive. Decoupling them improves both.
1×1 Conv: Channel Reduction
Each scale's feature map first passes through a 1×1 convolution. This acts as a per-pixel channel mixer — it recombines the 256 channels into a new 256-dimensional representation optimized for the prediction tasks. This is where the "shared stem" ends and the two branches diverge.
Classification Branch
Two 3×3 convolutions (each followed by BatchNorm + SiLU) process the features specifically for classification. The final layer outputs a score for each of the 80 COCO classes at every spatial position. The output shape for the 80×80 scale would be 80×80×80 (one score per class per grid cell).
What it learns: "Is this a person? A car? A dog?" — position-invariant pattern matching.
Regression Branch + IoU
A separate pair of 3×3 convolutions specializes in predicting where objects are. It outputs:
- 4 regression values per cell: (x_center, y_center, width, height) — the bounding box
- 1 objectness score (IoU branch): "Is there actually an object here, or is this background?"
The IoU (Intersection over Union) branch is attached to regression rather than classification because it's fundamentally about spatial accuracy — how well does the predicted box overlap with the true box?
Combining Predictions Across Scales
The three scales produce predictions at every grid cell:
- 80 × 80 = 6,400 predictions (stride 8, small objects)
- 40 × 40 = 1,600 predictions (stride 16, medium objects)
- 20 × 20 = 400 predictions (stride 32, large objects)
Total: 8,400 candidate detections. Each one has 80 class scores + 4 box coordinates + 1 objectness score = 85 values. Non-Maximum Suppression (NMS) then filters these down to the final clean set of detections — typically a few dozen per image.
5 — Key Innovations
Anchor-Free Detection
Older YOLO models placed multiple "anchor boxes" (predefined bounding box templates of different aspect ratios) at each grid cell and predicted adjustments to these templates. This required careful tuning of anchor sizes per dataset. YOLOX simply predicts the box directly — each grid cell outputs one (x, y, w, h) prediction with no templates. Fewer hyperparameters, simpler code, faster inference.
SimOTA: Dynamic Label Assignment
During training, the model needs to decide which grid cells are "responsible" for each ground-truth object. Older methods used fixed rules (e.g., the center cell). SimOTA dynamically assigns labels by treating it as an optimal transport problem — it finds the globally best matching between predictions and ground-truth boxes based on both classification and localization quality. This means easy objects get fewer assignees, and hard objects get more.
Strong Data Augmentation (No Pretraining)
YOLOX uses Mosaic (4 images stitched together) and MixUp (two images blended) augmentation so aggressively that the model can train from scratch on COCO — no ImageNet pretraining needed. This simplifies the training pipeline and avoids domain mismatch between pretraining and fine-tuning datasets.
6 — Complete Tensor Shape Summary
For YOLOX-L with a 640×640 input:
| Stage | Layer | Output Shape | Notes |
|---|---|---|---|
| Input | Image | 640×640×3 | RGB, normalized |
| Backbone | Focus | 320×320×64 | Slice + 3×3 conv |
| Dark2 | 160×160×128 | CSP block | |
| Dark3 (C3) | 80×80×256 | → sent to Neck | |
| Dark4 (C4) | 40×40×512 | → sent to Neck | |
| Dark5 + SPP (C5) | 20×20×1024 | → sent to Neck | |
| Neck (FPN ↓) | P4 (top-down) | 40×40×256 | Upsample C5 + C4 |
| P3 (top-down) | 80×80×256 | Upsample P4 + C3 | |
| Neck (PAN ↑) | P4 (bottom-up) | 40×40×256 | Downsample P3 + P4 |
| P5 (bottom-up) | 20×20×256 | Downsample P4 + P5 | |
| Head | Cls branch (×3) | H×W×80 | 80 COCO classes |
| Reg branch (×3) | H×W×4 | x, y, w, h | |
| Obj branch (×3) | H×W×1 | Objectness / IoU |
7 — Results
On the COCO val2017 benchmark:
| Model | AP (%) | Params | FPS (V100) |
|---|---|---|---|
| YOLOX-Nano | 25.3 | 0.91M | — |
| YOLOX-S | 40.5 | 9.0M | 102 |
| YOLOX-M | 46.9 | 25.3M | 81 |
| YOLOX-L | 49.7 | 54.2M | 69 |
| YOLOX-X | 51.1 | 99.1M | 58 |
YOLOX-L exceeded YOLOv5-L by 1.8% AP while running at comparable speed. The team also won 1st place in the Streaming Perception Challenge at CVPR 2021's Workshop on Autonomous Driving.
8 — References & Further Reading
- YOLOX: Exceeding YOLO Series in 2021 — Ge et al., 2021 (original paper)
- Official GitHub Repository — Megvii-BaseDetection
- YOLOX: Boosting Object Detection Performance — viso.ai overview
- Papers With Code: YOLOX