PPO: Proximal Policy Optimization
1 — The Problem: Unstable Policy Gradients
The vanilla policy gradient theorem gives us a beautiful result: we can improve a policy by estimating gradients from sampled trajectories. REINFORCE and its variants compute:
The problem? Step size is critical and fragile. Policy gradient methods compute a direction to move in parameter space, but they say nothing about how far to move. Too small a step and training is painfully slow. Too large a step and the policy can catastrophically collapse — a single bad update can destroy a policy that took millions of timesteps to learn.
TRPO (Trust Region Policy Optimization) solved this with a hard constraint: never let the new policy deviate too far from the old one, measured by KL divergence. But TRPO requires computing second-order derivatives and solving a constrained optimization problem — complex to implement and expensive to compute.
2 — Actor-Critic Architecture
PPO uses an actor-critic framework where two functions work together: the actor (policy network) decides what to do, and the critic (value network) evaluates how good the current state is. In practice, these can share a backbone or be completely separate networks.
Actor Network (Policy)
state → action distributionMaps a state to a probability distribution over actions. For discrete actions (e.g., Atari), outputs a softmax over action logits. For continuous actions (e.g., MuJoCo), outputs a mean and standard deviation for a Gaussian distribution.
Typical architecture: 2–3 fully-connected layers with 64 or 256 hidden units and tanh/ReLU activations. Input is the state vector [B, state_dim], output is [B, action_dim] (discrete) or [B, 2 × action_dim] for mean + log_std (continuous).
Critic Network (Value Function)
state → scalar valueEstimates V(s) — the expected cumulative discounted reward from state s under the current policy. Used to compute advantages but not used during action selection at test time.
Same architecture as the actor but with a single scalar output: [B, state_dim] → [B, 1]. When sharing a backbone, only the final layer differs.
3 — The Clipped Surrogate Objective
This is the core innovation of PPO. Instead of constraining the KL divergence (TRPO), PPO clips the objective function itself so the policy can never move too far in a single update.
The Probability Ratio
Define the probability ratio between the new and old policy:
The Clipped Objective
The PPO-Clip objective is:
LCLIP(θ)
ε = 0.2 (default)LCLIP = E[ min( r(θ)Â, clip(r(θ), 1−ε, 1+ε)Â ) ]
The min() takes the more pessimistic (lower) of two terms: the unclipped surrogate and the clipped surrogate. This creates a "trust region" without ever computing KL divergence.
Why Does This Work?
When  > 0 (good action)
Encourage, but not too muchThe objective wants to increase r(θ) — make the good action more likely. But once r(θ) exceeds 1+ε, the clipped term becomes flat. The min() picks the flat clipped value, so there's no gradient to push r further. The policy can increase the action's probability, but only up to 20% more than before.
When  < 0 (bad action)
Discourage, but not too muchThe objective wants to decrease r(θ) — make the bad action less likely. But once r(θ) drops below 1−ε, the clipped term is again flat. The policy can decrease the action's probability, but only down to 20% less than before.
4 — Generalized Advantage Estimation (GAE)
The advantage Â(s,a) measures how much better action a is compared to the average action in state s. Computing good advantage estimates is critical — noisy advantages lead to noisy gradients and poor updates.
TD Residuals
The one-step TD residual at time t is:
The Bias-Variance Tradeoff
We could use just δt as the advantage (low variance, high bias) or a full Monte Carlo return (no bias, high variance). GAE provides a smooth interpolation controlled by λ ∈ [0, 1]:
GAE(γ, λ)
γ = 0.99, λ = 0.95 typicalÂGAEt = Σl=0∞ (γλ)l δt+l
This is an exponentially-weighted sum of TD residuals. When λ = 0, we get the one-step TD advantage (high bias, low variance). When λ = 1, we get the Monte Carlo advantage (no bias, high variance). The default λ = 0.95 provides a good balance.
| λ Value | Estimator | Bias | Variance | Behavior |
|---|---|---|---|---|
| λ = 0 | One-step TD | High | Low | Only uses immediate reward + next value |
| λ = 0.95 | GAE (default) | Low | Moderate | Good balance for most tasks |
| λ = 1 | Monte Carlo | Zero | High | Full episode return minus baseline |
5 — The PPO Training Loop
PPO alternates between collecting data and optimizing the policy. Unlike off-policy methods (SAC, TD3), PPO uses each batch of data for only a few epochs before discarding it — the data is "on-policy" and becomes stale as the policy changes.
Step 1: Collect Trajectories
N × T timesteps per iterationRun the current policy πθold in N parallel environments for T timesteps each. Store (s, a, r, s', logπ(a|s), V(s)) for each transition. Typical: N=8 environments, T=2048 steps = 16,384 transitions per batch.
Step 2: Compute Advantages
GAE with normalizationUsing the collected rewards and value estimates, compute GAE advantages Ât for every timestep (computed in reverse order for efficiency). Then normalize advantages to zero mean and unit variance across the batch — this is a critical implementation detail that significantly improves stability.
Step 3: K Epochs of Mini-Batch Updates
K=3–10, mini-batch size 64–4096Shuffle the collected data and split into M mini-batches. For each mini-batch, compute the clipped surrogate loss and update the policy. Repeat for K epochs on the same data. The clipping ensures that even after K passes, the policy hasn't moved too far from πold.
6 — Implementation Details That Matter
PPO's simplicity is deceptive. The algorithm pseudocode fits on a napkin, but the implementation details can make a 10× difference in performance. Here are the ones that matter most:
Entropy Bonus
c2 = 0.01 typicalAdd an entropy bonus to the loss: L = LCLIP − c1LVF + c2 H(π). Entropy regularization prevents the policy from collapsing to a deterministic action too early. Without it, PPO can get stuck in suboptimal deterministic policies, especially in environments with sparse rewards.
Advantage Normalization
per mini-batchNormalize advantages to zero mean and unit standard deviation: Â ← (Â − μ) / (σ + 10−8). This is done per mini-batch. Without normalization, the magnitude of advantages varies wildly across training, making the effective learning rate inconsistent.
Value Function Clipping
optional, debatedSome implementations clip the value function loss similarly to the policy loss: LVF = max((V − Vtarget)², (Vclipped − Vtarget)²). This is controversial — the original paper includes it, but ablation studies suggest it sometimes hurts. Most modern implementations skip this.
Learning Rate Annealing
linear decay to 0Linearly decay the learning rate from its initial value (typically 3×10−4) to zero over the course of training. Combined with the clip range, this provides a smooth reduction in update magnitude as training progresses.
Gradient Clipping
max_grad_norm = 0.5Clip the global gradient norm to 0.5. While PPO's objective clipping prevents catastrophic policy updates, gradient clipping adds an additional safety net against rare pathological mini-batches that produce extremely large gradients.
Orthogonal Initialization
gain varies by layerInitialize network weights with orthogonal matrices. Use gain=√2 for hidden layers (works well with ReLU), gain=0.01 for the policy output layer (initially near-uniform action distribution), and gain=1.0 for the value output layer.
| Hyperparameter | Atari (Discrete) | MuJoCo (Continuous) |
|---|---|---|
| Parallel environments N | 8 | 1–64 |
| Rollout length T | 128 | 2048 |
| Mini-batch size | 256 | 64 |
| Epochs K | 4 | 10 |
| Clip ε | 0.1 | 0.2 |
| Learning rate | 2.5×10−4 | 3×10−4 |
| Discount γ | 0.99 | 0.99 |
| GAE λ | 0.95 | 0.95 |
| Entropy coeff c2 | 0.01 | 0.0 |
7 — PPO in Practice
PPO has become the default RL algorithm across an extraordinary range of applications. Its combination of stability, simplicity, and competitive performance makes it the first algorithm most practitioners try.
RLHF: PPO for Language Models
PPO's most high-profile application is Reinforcement Learning from Human Feedback (RLHF), used to align large language models like ChatGPT. The setup: the LLM is the policy, generating text is the action sequence, and a trained reward model scores the output. PPO fine-tunes the LLM to maximize the reward model's score while staying close to the original model (via a KL penalty). The clip mechanism is perfect here — it prevents the LLM from diverging too far into reward-hacked outputs.
Algorithm Comparison
| Property | REINFORCE | TRPO | PPO |
|---|---|---|---|
| Year | 1992 | 2015 | 2017 |
| Update rule | Vanilla gradient | KL-constrained | Clipped surrogate |
| Trust region | None | Hard KL constraint | Implicit (clipping) |
| Optimization | First-order | Second-order (conjugate gradient) | First-order (Adam) |
| Implementation complexity | Low | High | Low |
| Data reuse per batch | 1 epoch | 1 step | K epochs (3–10) |
| Stability | Low | High | High |
| Sample efficiency | Low | Low–Medium | Medium |
| Wall-clock speed | Fast per update | Slow (Hessian-vector products) | Fast per update |
8 — References
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. ICML 2015.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR 2016.
Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janber, F., & Madry, A. (2020). Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. ICLR 2020.
Huang, S., et al. (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track 2022.