The DETR Family: From Transformers to Real-Time Detection
1 — The Problem & Why DETR Matters
Object detection traditionally requires a messy pipeline: a backbone extracts features, a neck fuses scales, region proposals or anchor boxes generate candidates, and Non-Maximum Suppression (NMS) removes duplicates. Every stage has hand-designed hyperparameters.
DETR's radical idea: treat object detection as a direct set prediction problem. Use a Transformer to predict a fixed set of detections in parallel — no anchors, no NMS, no hand-designed post-processing. The model learns to output exactly one prediction per object through a bipartite matching loss.
2 — DETR (Facebook AI, 2020)
Carion et al. — arXiv:2005.12872
1 Backbone: CNN Feature Extraction
ResNet-50 Backbone
A standard ResNet-50 (see our ResNet walkthrough) extracts features. The output from the last stage is a 25×25×2048 feature map (for an 800×800 input). A 1×1 convolution reduces channels to 256, giving 25×25×256.
2 Positional Encoding + Flatten
Spatial Positional Encoding
The 2D feature map is flattened to a sequence of 625 tokens (25×25). Fixed sinusoidal 2D positional encodings are added so the Transformer knows where each token sits spatially. This is analogous to positional embeddings in ViT, but using the original Transformer's sine/cosine formulation extended to 2D.
3 Transformer Encoder
6 Encoder Layers
Standard Transformer encoder with multi-head self-attention and feed-forward networks. Each image feature token attends to all others, building global context. After 6 layers, every spatial position has information from the entire image — crucial for reasoning about object relationships and removing duplicate detections.
4 Transformer Decoder + Object Queries
6 Decoder Layers with 100 Object Queries
This is DETR's most novel component. 100 learned object queries — fixed-size set of 256-dimensional vectors — attend to the encoder output via cross-attention. Each query "specializes" in detecting one object. Through 6 decoder layers, queries compete for objects and learn to avoid predicting the same one.
5 Prediction Heads (FFN)
Class + Box FFN per Query
Each decoder output token independently passes through two feed-forward networks: one predicts the class label (91 COCO categories + "no object") and one predicts normalized box coordinates (center_x, center_y, width, height). The "no object" class (∅) is predicted for most queries — it means that query slot found nothing.
Hungarian Matching: How DETR Trains Without NMS
Bipartite Matching Loss
During training, DETR uses the Hungarian algorithm to find the optimal one-to-one matching between its 100 predictions and the ground-truth objects. The matching cost combines classification probability, L1 box distance, and GIoU. Each ground truth is assigned exactly one prediction; unmatched predictions are trained to predict ∅.
DETR Limitations
Decoder Cross-Attention: Where Queries Look
Attention Visualization Across Decoder Layers
A fascinating property of DETR's decoder: in early layers, queries attend broadly across the image (exploring). In later layers, attention sharpens to focus precisely on the object boundaries. This progressive refinement is why the decoder needs multiple layers.
3 — RT-DETR (Baidu, 2023)
Zhao et al. — arXiv:2304.08069 — First real-time DETR
RT-DETR asked: can we make DETR real-time? The answer required three changes: a better backbone, an efficient hybrid encoder, and IoU-aware query selection.
HGNetv2 Backbone (or ResNet)
RT-DETR uses a fast CNN backbone (HGNetv2) that outputs multi-scale feature maps — unlike DETR's single-scale approach. This gives access to fine-grained features for small objects (S3 at stride 8) alongside high-level features for large objects (S5 at stride 32).
Hybrid Encoder: Intra-Scale + Cross-Scale
Instead of DETR's expensive all-to-all attention, RT-DETR uses a two-stage encoder: Attention-based Intra-Scale Modules (AISM) apply self-attention within each scale separately (much cheaper), then CNN-based Cross-Scale Fusion (CCFM) merges information across scales using efficient convolutions rather than attention.
| Model | Backbone | COCO AP | FPS (T4) |
|---|---|---|---|
| RT-DETR-R50 | ResNet-50 | 53.1% | 108 |
| RT-DETR-L | HGNetv2 | 53.0% | 114 |
| RT-DETR-X | HGNetv2 | 54.8% | 74 |
4 — RF-DETR (Roboflow, 2025)
Robinson et al. — arXiv:2511.09554 — ICLR 2026 — First real-time detector to exceed 60 AP on COCO
RF-DETR's key insight: CNN backbones don't benefit from large-scale pre-training the way Transformers do. By swapping in a pre-trained DINOv2 ViT backbone (which learned rich visual representations from massive unsupervised data), and using Neural Architecture Search to find optimal sub-networks, RF-DETR pushes accuracy far beyond CNN-based detectors.
DINOv2 ViT Backbone (Pre-trained)
Unlike RT-DETR's CNN backbone that naturally produces multi-scale features, RF-DETR uses a single-scale DINOv2 Vision Transformer (see our ViT walkthrough). A projector module bilinearly interpolates the ViT output to create multi-scale feature maps. The backbone interleaves windowed attention (efficient, local) with global attention (expensive, full-image) blocks to balance accuracy and speed.
LW-DETR Decoder with Deformable Cross-Attention
The decoder follows the LW-DETR framework, using deformable cross-attention (from Deformable DETR) instead of standard cross-attention. Rather than attending to all spatial positions, each query only attends to a small set of learned sampling points — dramatically reducing computation while maintaining accuracy.
Neural Architecture Search (NAS)
RF-DETR's most distinctive contribution: weight-sharing NAS. After training a single large "supernet," thousands of sub-network configurations are evaluated without retraining. This discovers accuracy-latency Pareto curves — the nano, small, and medium variants are NAS-discovered, while base and large are hand-designed.
| Model | Params | COCO AP | Design |
|---|---|---|---|
| RF-DETR Nano | 30.5M | 48.0 | NAS-discovered |
| RF-DETR Base | 29.0M | 53.3 | Hand-designed |
| RF-DETR Large | 129.0M | ~56 | Hand-designed |
| RF-DETR 2XL | — | 60+ | NAS + hand |
5 — Architecture Comparison
| Component | DETR (2020) | RT-DETR (2023) | RF-DETR (2025) |
|---|---|---|---|
| Backbone | ResNet-50 (single scale) | HGNetv2 (multi-scale S3/S4/S5) | DINOv2 ViT (pre-trained, multi-scale via projector) |
| Encoder | 6× standard self-attention | AISM (intra-scale) + CCFM (cross-scale CNN) | ViT backbone + windowed/global attention mix |
| Decoder | 6× cross-attention + self-attention | 6× standard decoder (flexible: 1-6 at inference) | LW-DETR decoder w/ deformable cross-attention |
| Queries | 100 fixed learned embeddings | IoU-aware top-K from encoder features | Content-based from encoder features |
| NMS | Not needed | Not needed | Not needed |
| Training | 500 epochs | 72 epochs | Fast convergence via pre-trained backbone |
| Best COCO AP | 43.3 | 54.8 | 60+ |
| Real-time? | No (~12 FPS) | Yes (~114 FPS) | Yes (NAS-optimized) |
6 — Evolution Summary
7 — References & Further Reading
- DETR: End-to-End Object Detection with Transformers — Carion et al., Facebook AI, 2020
- Deformable DETR — Zhu et al., 2020 (deformable attention)
- RT-DETR: DETRs Beat YOLOs on Real-time Object Detection — Zhao et al., Baidu, 2023
- RF-DETR: Neural Architecture Search for Real-Time Detection Transformers — Robinson et al., Roboflow, 2025
- RF-DETR GitHub Repository