The DETR Family: From Transformers to Real-Time Detection

DETR → RT-DETR → RF-DETR
Object Detection Transformer Set Prediction No NMS No Anchors

1 — The Problem & Why DETR Matters

Object detection traditionally requires a messy pipeline: a backbone extracts features, a neck fuses scales, region proposals or anchor boxes generate candidates, and Non-Maximum Suppression (NMS) removes duplicates. Every stage has hand-designed hyperparameters.

DETR's radical idea: treat object detection as a direct set prediction problem. Use a Transformer to predict a fixed set of detections in parallel — no anchors, no NMS, no hand-designed post-processing. The model learns to output exactly one prediction per object through a bipartite matching loss.

Traditional Pipeline vs. DETR Traditional (Faster R-CNN, YOLO) Backbone FPN Anchors RPN RoI Pool Head NMS 7+ stages DETR (end-to-end) Backbone Transformer Set Prediction 3 stages, no NMS!

2 — DETR (Facebook AI, 2020)

Carion et al. — arXiv:2005.12872

1 Backbone: CNN Feature Extraction

ResNet-50 Backbone

800×800×325×25×2048

A standard ResNet-50 (see our ResNet walkthrough) extracts features. The output from the last stage is a 25×25×2048 feature map (for an 800×800 input). A 1×1 convolution reduces channels to 256, giving 25×25×256.

2 Positional Encoding + Flatten

Spatial Positional Encoding

25×25×256625×256

The 2D feature map is flattened to a sequence of 625 tokens (25×25). Fixed sinusoidal 2D positional encodings are added so the Transformer knows where each token sits spatially. This is analogous to positional embeddings in ViT, but using the original Transformer's sine/cosine formulation extended to 2D.

3 Transformer Encoder

6 Encoder Layers

625×256625×256

Standard Transformer encoder with multi-head self-attention and feed-forward networks. Each image feature token attends to all others, building global context. After 6 layers, every spatial position has information from the entire image — crucial for reasoning about object relationships and removing duplicate detections.

4 Transformer Decoder + Object Queries

6 Decoder Layers with 100 Object Queries

100×256 (queries) + 625×256 (memory)

This is DETR's most novel component. 100 learned object queries — fixed-size set of 256-dimensional vectors — attend to the encoder output via cross-attention. Each query "specializes" in detecting one object. Through 6 decoder layers, queries compete for objects and learn to avoid predicting the same one.

100 Object Queries (learned embeddings) query 1: "find object #1" query 2: "find object #2" query 3 ... ⋮ (100 total) Encoder Memory 625 image tokens cross-attention: queries look at image Parallel Predictions Q1: person (0.95) [x,y,w,h] Q2: car (0.89) [x,y,w,h] Q3: no object (background) Most queries predict ∅ (no object) No NMS needed! 1 query = 1 object
Hungarian matching loss: During training, DETR uses the Hungarian algorithm to find the optimal one-to-one assignment between the 100 predictions and ground-truth objects. This bipartite matching ensures each ground-truth gets exactly one prediction — eliminating the need for NMS.

5 Prediction Heads (FFN)

Class + Box FFN per Query

100 × 256 → 100 × (class + 4 coords)

Each decoder output token independently passes through two feed-forward networks: one predicts the class label (91 COCO categories + "no object") and one predicts normalized box coordinates (center_x, center_y, width, height). The "no object" class (∅) is predicted for most queries — it means that query slot found nothing.

Per-Query Prediction Heads Query i 256-dim Class FFN 256 → 92 "person" (0.95) Box FFN 256 → 4 [0.4, 0.3, 0.5, 0.7] Most queries predict: ∅ (no object) Only 5-20 of 100 queries typically find objects

Hungarian Matching: How DETR Trains Without NMS

Bipartite Matching Loss

N predictions × M ground truths → 1-to-1 assignment

During training, DETR uses the Hungarian algorithm to find the optimal one-to-one matching between its 100 predictions and the ground-truth objects. The matching cost combines classification probability, L1 box distance, and GIoU. Each ground truth is assigned exactly one prediction; unmatched predictions are trained to predict ∅.

Hungarian Matching: One Prediction per Object 100 Predictions P1: dog [0.4,0.3,0.5,0.6] P2: cat [0.2,0.5,0.3,0.4] P3: ∅ (no object) P4: ∅ ⋮ (96 more ∅) Hungarian Algorithm Find min-cost 1-to-1 matching L_match = class + L1 + GIoU Ground Truths GT1: dog [0.4,0.3,0.5,0.6] GT2: cat [0.2,0.5,0.3,0.4] After matching: P1 ↔ GT1: train on dog box P2 ↔ GT2: train on cat box P3-P100: train to predict ∅ No duplicates possible!

DETR Limitations

Slow convergence: DETR requires 500 epochs to train (vs. 36 for Faster R-CNN). Weak on small objects: Using only the final backbone feature map loses fine-grained detail. These limitations motivated the next generation.

Decoder Cross-Attention: Where Queries Look

Attention Visualization Across Decoder Layers

A fascinating property of DETR's decoder: in early layers, queries attend broadly across the image (exploring). In later layers, attention sharpens to focus precisely on the object boundaries. This progressive refinement is why the decoder needs multiple layers.

Decoder Attention Sharpening Decoder Layer 1 Broad, diffuse attention Decoder Layer 3 Narrowing to objects Decoder Layer 6 Extremities: head, feet, tail Final layer attends to object extremities to precisely locate the box

3 — RT-DETR (Baidu, 2023)

Zhao et al. — arXiv:2304.08069 — First real-time DETR

RT-DETR asked: can we make DETR real-time? The answer required three changes: a better backbone, an efficient hybrid encoder, and IoU-aware query selection.

HGNetv2 Backbone (or ResNet)

Multi-scale features: S3, S4, S5

RT-DETR uses a fast CNN backbone (HGNetv2) that outputs multi-scale feature maps — unlike DETR's single-scale approach. This gives access to fine-grained features for small objects (S3 at stride 8) alongside high-level features for large objects (S5 at stride 32).

Hybrid Encoder: Intra-Scale + Cross-Scale

Efficient multi-scale fusion

Instead of DETR's expensive all-to-all attention, RT-DETR uses a two-stage encoder: Attention-based Intra-Scale Modules (AISM) apply self-attention within each scale separately (much cheaper), then CNN-based Cross-Scale Fusion (CCFM) merges information across scales using efficient convolutions rather than attention.

S3 S4 S5 AISM Self-attn within each scale only S3: 80x80 self-attn S4: 40x40 self-attn S5: 20x20 self-attn CCFM Cross-scale fusion via CNN (not attn) IoU-Aware Query Selection Select top-K encoder features as queries (not random learned) Decoder 6 layers
ModelBackboneCOCO APFPS (T4)
RT-DETR-R50ResNet-5053.1%108
RT-DETR-LHGNetv253.0%114
RT-DETR-XHGNetv254.8%74
Key achievement: RT-DETR proved that DETR-family models can be real-time. It matched YOLO accuracy while keeping DETR's elegant NMS-free design. The flexible decoder also allows speed-accuracy trade-off by adjusting the number of decoder layers at inference time.

4 — RF-DETR (Roboflow, 2025)

Robinson et al. — arXiv:2511.09554 — ICLR 2026 — First real-time detector to exceed 60 AP on COCO

RF-DETR's key insight: CNN backbones don't benefit from large-scale pre-training the way Transformers do. By swapping in a pre-trained DINOv2 ViT backbone (which learned rich visual representations from massive unsupervised data), and using Neural Architecture Search to find optimal sub-networks, RF-DETR pushes accuracy far beyond CNN-based detectors.

DINOv2 ViT Backbone (Pre-trained)

Single-scale ViT → projected multi-scale features

Unlike RT-DETR's CNN backbone that naturally produces multi-scale features, RF-DETR uses a single-scale DINOv2 Vision Transformer (see our ViT walkthrough). A projector module bilinearly interpolates the ViT output to create multi-scale feature maps. The backbone interleaves windowed attention (efficient, local) with global attention (expensive, full-image) blocks to balance accuracy and speed.

Why DINOv2? DINOv2 was pre-trained on 142M images with self-supervised learning. It learned to recognize objects, textures, and spatial relationships without any labels. RF-DETR inherits all this knowledge, so it converges faster and generalizes better than CNN backbones that must learn everything from detection data alone.

LW-DETR Decoder with Deformable Cross-Attention

Lightweight decoder from LW-DETR

The decoder follows the LW-DETR framework, using deformable cross-attention (from Deformable DETR) instead of standard cross-attention. Rather than attending to all spatial positions, each query only attends to a small set of learned sampling points — dramatically reducing computation while maintaining accuracy.

Neural Architecture Search (NAS)

Discover optimal accuracy-latency trade-offs

RF-DETR's most distinctive contribution: weight-sharing NAS. After training a single large "supernet," thousands of sub-network configurations are evaluated without retraining. This discovers accuracy-latency Pareto curves — the nano, small, and medium variants are NAS-discovered, while base and large are hand-designed.

NAS: Train once, evaluate thousands of sub-networks Supernet (train once) sub-net 1: fast sub-net 2: balanced sub-net 3: accurate Pareto curve (accuracy vs latency) Latency → AP → nano base 2XL
ModelParamsCOCO APDesign
RF-DETR Nano30.5M48.0NAS-discovered
RF-DETR Base29.0M53.3Hand-designed
RF-DETR Large129.0M~56Hand-designed
RF-DETR 2XL60+NAS + hand

5 — Architecture Comparison

ComponentDETR (2020)RT-DETR (2023)RF-DETR (2025)
BackboneResNet-50 (single scale)HGNetv2 (multi-scale S3/S4/S5)DINOv2 ViT (pre-trained, multi-scale via projector)
Encoder6× standard self-attentionAISM (intra-scale) + CCFM (cross-scale CNN)ViT backbone + windowed/global attention mix
Decoder6× cross-attention + self-attention6× standard decoder (flexible: 1-6 at inference)LW-DETR decoder w/ deformable cross-attention
Queries100 fixed learned embeddingsIoU-aware top-K from encoder featuresContent-based from encoder features
NMSNot neededNot neededNot needed
Training500 epochs72 epochsFast convergence via pre-trained backbone
Best COCO AP43.354.860+
Real-time?No (~12 FPS)Yes (~114 FPS)Yes (NAS-optimized)
The DETR family trend: Each generation keeps the core idea (set prediction, no NMS, bipartite matching) while fixing a specific limitation: DETR fixed the pipeline complexity, RT-DETR fixed the speed, RF-DETR fixed the accuracy ceiling by leveraging self-supervised pre-training.

6 — Evolution Summary

DETR Family Evolution DETR (2020) ResNet backbone • 43.3 AP 500 epochs • Not real-time fix RT-DETR (2023) CNN backbone • 54.8 AP Real-time • Multi-scale scale RF-DETR (2025) DINOv2 ViT • 60+ AP NAS • Pre-training power Pioneered end-to-end detection with Transformers Made it fast: hybrid encoder, multi-scale, IoU queries Made it accurate: pre-trained ViT backbone, NAS for optimal configs

7 — References & Further Reading