PPO: Proximal Policy Optimization

Stable Policy Gradient Methods for Reinforcement Learning
Policy Gradient On-Policy OpenAI 2017 Actor-Critic

1 — The Problem: Unstable Policy Gradients

The vanilla policy gradient theorem gives us a beautiful result: we can improve a policy by estimating gradients from sampled trajectories. REINFORCE and its variants compute:

Vanilla Policy Gradient: ∇J(θ) = E[∇ log πθ(a|s) · Â(s,a)] — move in the direction that increases the probability of actions with positive advantage.

The problem? Step size is critical and fragile. Policy gradient methods compute a direction to move in parameter space, but they say nothing about how far to move. Too small a step and training is painfully slow. Too large a step and the policy can catastrophically collapse — a single bad update can destroy a policy that took millions of timesteps to learn.

Training Steps Return Small step (slow but stable) PPO (fast & stable) Large step (collapse!) Bad update

TRPO (Trust Region Policy Optimization) solved this with a hard constraint: never let the new policy deviate too far from the old one, measured by KL divergence. But TRPO requires computing second-order derivatives and solving a constrained optimization problem — complex to implement and expensive to compute.

PPO's insight: Instead of a hard KL constraint (TRPO), use a simple clipped objective that automatically prevents large updates. Same stability, first-order optimization only, trivial to implement.

2 — Actor-Critic Architecture

PPO uses an actor-critic framework where two functions work together: the actor (policy network) decides what to do, and the critic (value network) evaluates how good the current state is. In practice, these can share a backbone or be completely separate networks.

State s SHARED BACKBONE (optional) Actor πθ(a|s) Critic Vφ(s) Action probs Sample a ~ π Value V(s) Environment r, s' → back to top

Actor Network (Policy)

state → action distribution

Maps a state to a probability distribution over actions. For discrete actions (e.g., Atari), outputs a softmax over action logits. For continuous actions (e.g., MuJoCo), outputs a mean and standard deviation for a Gaussian distribution.

Typical architecture: 2–3 fully-connected layers with 64 or 256 hidden units and tanh/ReLU activations. Input is the state vector [B, state_dim], output is [B, action_dim] (discrete) or [B, 2 × action_dim] for mean + log_std (continuous).

Critic Network (Value Function)

state → scalar value

Estimates V(s) — the expected cumulative discounted reward from state s under the current policy. Used to compute advantages but not used during action selection at test time.

Same architecture as the actor but with a single scalar output: [B, state_dim][B, 1]. When sharing a backbone, only the final layer differs.

Shared vs. Separate: Atari PPO typically shares a CNN backbone between actor and critic (saves parameters). MuJoCo PPO typically uses separate MLPs (more stable for continuous control). The loss becomes L = LCLIP − c1LVF + c2S[π] when sharing.

3 — The Clipped Surrogate Objective

This is the core innovation of PPO. Instead of constraining the KL divergence (TRPO), PPO clips the objective function itself so the policy can never move too far in a single update.

The Probability Ratio

Define the probability ratio between the new and old policy:

r(θ) = πθ(a|s) / πθold(a|s) — How much more (or less) likely is action a under the new policy compared to the old one? When r = 1, the policies are identical for this state-action pair. r > 1 means the new policy is more likely to take this action; r < 1 means less likely.

The Clipped Objective

The PPO-Clip objective is:

LCLIP(θ)

ε = 0.2 (default)

LCLIP = E[ min( r(θ)Â, clip(r(θ), 1−ε, 1+ε)Â ) ]

The min() takes the more pessimistic (lower) of two terms: the unclipped surrogate and the clipped surrogate. This creates a "trust region" without ever computing KL divergence.

Clipped Surrogate Objective ADVANTAGE Â > 0 (good action) r(θ) LCLIP 1.0 0.8 1.2 unclipped r·Â Clipped! No more benefit ADVANTAGE Â < 0 (bad action) r(θ) 1.0 0.8 1.2 unclipped r·Â Clipped! The min() operator selects the lower (more pessimistic) curve at each point.

Why Does This Work?

When  > 0 (good action)

Encourage, but not too much

The objective wants to increase r(θ) — make the good action more likely. But once r(θ) exceeds 1+ε, the clipped term becomes flat. The min() picks the flat clipped value, so there's no gradient to push r further. The policy can increase the action's probability, but only up to 20% more than before.

When  < 0 (bad action)

Discourage, but not too much

The objective wants to decrease r(θ) — make the bad action less likely. But once r(θ) drops below 1−ε, the clipped term is again flat. The policy can decrease the action's probability, but only down to 20% less than before.

The elegance of PPO: The clipping mechanism creates an implicit trust region. No second-order optimization, no KL penalty tuning, no line search. Just a simple min() and clip() in the loss function. This is why PPO became the default algorithm — it's almost as stable as TRPO but trivial to implement.

4 — Generalized Advantage Estimation (GAE)

The advantage Â(s,a) measures how much better action a is compared to the average action in state s. Computing good advantage estimates is critical — noisy advantages lead to noisy gradients and poor updates.

TD Residuals

The one-step TD residual at time t is:

δt = rt + γV(st+1) − V(st) — The actual reward plus the estimated future value, minus what we expected. Positive δ means things went better than expected.

The Bias-Variance Tradeoff

We could use just δt as the advantage (low variance, high bias) or a full Monte Carlo return (no bias, high variance). GAE provides a smooth interpolation controlled by λ ∈ [0, 1]:

GAE(γ, λ)

γ = 0.99, λ = 0.95 typical

ÂGAEt = Σl=0 (γλ)l δt+l

This is an exponentially-weighted sum of TD residuals. When λ = 0, we get the one-step TD advantage (high bias, low variance). When λ = 1, we get the Monte Carlo advantage (no bias, high variance). The default λ = 0.95 provides a good balance.

GAE: Exponentially-Weighted TD Residuals st st+1 st+2 st+3 δt δt+1 δt+2 × 1.0 × γλ × (γλ)² × (γλ)³ Ât = δt + γλδt+1 + (γλ)²δt+2 + …
λ Value Estimator Bias Variance Behavior
λ = 0 One-step TD High Low Only uses immediate reward + next value
λ = 0.95 GAE (default) Low Moderate Good balance for most tasks
λ = 1 Monte Carlo Zero High Full episode return minus baseline

5 — The PPO Training Loop

PPO alternates between collecting data and optimizing the policy. Unlike off-policy methods (SAC, TD3), PPO uses each batch of data for only a few epochs before discarding it — the data is "on-policy" and becomes stale as the policy changes.

PPO Training Loop 1. Collect Trajectories Run πold for T steps in N environments 2. Compute Advantages GAE(γ=0.99, λ=0.95) + normalize 3. K Epochs of Mini-Batch Updates Shuffle data → split into M mini-batches Update π with LCLIP + entropy bonus 4. Update Value Function MSE loss: (V(s) − Rtarget REPEAT N=8 parallel envs T=2048 steps each K=3–10 epochs M=32–64 batches

Step 1: Collect Trajectories

N × T timesteps per iteration

Run the current policy πθold in N parallel environments for T timesteps each. Store (s, a, r, s', logπ(a|s), V(s)) for each transition. Typical: N=8 environments, T=2048 steps = 16,384 transitions per batch.

Step 2: Compute Advantages

GAE with normalization

Using the collected rewards and value estimates, compute GAE advantages Ât for every timestep (computed in reverse order for efficiency). Then normalize advantages to zero mean and unit variance across the batch — this is a critical implementation detail that significantly improves stability.

Step 3: K Epochs of Mini-Batch Updates

K=3–10, mini-batch size 64–4096

Shuffle the collected data and split into M mini-batches. For each mini-batch, compute the clipped surrogate loss and update the policy. Repeat for K epochs on the same data. The clipping ensures that even after K passes, the policy hasn't moved too far from πold.

Why K > 1 works: In vanilla policy gradient, you use each batch exactly once (one gradient step). PPO's clipping lets you reuse data for multiple epochs because it automatically stops the policy from diverging — once r(θ) hits the clip boundary, gradients become zero. This dramatically improves sample efficiency.

6 — Implementation Details That Matter

PPO's simplicity is deceptive. The algorithm pseudocode fits on a napkin, but the implementation details can make a 10× difference in performance. Here are the ones that matter most:

Entropy Bonus

c2 = 0.01 typical

Add an entropy bonus to the loss: L = LCLIP − c1LVF + c2 H(π). Entropy regularization prevents the policy from collapsing to a deterministic action too early. Without it, PPO can get stuck in suboptimal deterministic policies, especially in environments with sparse rewards.

Advantage Normalization

per mini-batch

Normalize advantages to zero mean and unit standard deviation: Â ← (Â − μ) / (σ + 10−8). This is done per mini-batch. Without normalization, the magnitude of advantages varies wildly across training, making the effective learning rate inconsistent.

Value Function Clipping

optional, debated

Some implementations clip the value function loss similarly to the policy loss: LVF = max((V − Vtarget)², (Vclipped − Vtarget)²). This is controversial — the original paper includes it, but ablation studies suggest it sometimes hurts. Most modern implementations skip this.

Learning Rate Annealing

linear decay to 0

Linearly decay the learning rate from its initial value (typically 3×10−4) to zero over the course of training. Combined with the clip range, this provides a smooth reduction in update magnitude as training progresses.

Gradient Clipping

max_grad_norm = 0.5

Clip the global gradient norm to 0.5. While PPO's objective clipping prevents catastrophic policy updates, gradient clipping adds an additional safety net against rare pathological mini-batches that produce extremely large gradients.

Orthogonal Initialization

gain varies by layer

Initialize network weights with orthogonal matrices. Use gain=√2 for hidden layers (works well with ReLU), gain=0.01 for the policy output layer (initially near-uniform action distribution), and gain=1.0 for the value output layer.

Hyperparameter Atari (Discrete) MuJoCo (Continuous)
Parallel environments N81–64
Rollout length T1282048
Mini-batch size25664
Epochs K410
Clip ε0.10.2
Learning rate2.5×10−43×10−4
Discount γ0.990.99
GAE λ0.950.95
Entropy coeff c20.010.0

7 — PPO in Practice

PPO has become the default RL algorithm across an extraordinary range of applications. Its combination of stability, simplicity, and competitive performance makes it the first algorithm most practitioners try.

PPO Application Landscape PPO 2017 Game AI Robotics RLHF / LLMs Autonomous Finance Chip Design

RLHF: PPO for Language Models

PPO's most high-profile application is Reinforcement Learning from Human Feedback (RLHF), used to align large language models like ChatGPT. The setup: the LLM is the policy, generating text is the action sequence, and a trained reward model scores the output. PPO fine-tunes the LLM to maximize the reward model's score while staying close to the original model (via a KL penalty). The clip mechanism is perfect here — it prevents the LLM from diverging too far into reward-hacked outputs.

PPO for RLHF: L = E[rreward(x, y) − β KL(πθ || πref)] — maximize reward model score while staying close to the reference (pre-RLHF) model. The KL term prevents "reward hacking" where the model finds nonsensical outputs that fool the reward model.

Algorithm Comparison

Property REINFORCE TRPO PPO
Year 1992 2015 2017
Update rule Vanilla gradient KL-constrained Clipped surrogate
Trust region None Hard KL constraint Implicit (clipping)
Optimization First-order Second-order (conjugate gradient) First-order (Adam)
Implementation complexity Low High Low
Data reuse per batch 1 epoch 1 step K epochs (3–10)
Stability Low High High
Sample efficiency Low Low–Medium Medium
Wall-clock speed Fast per update Slow (Hessian-vector products) Fast per update
Why PPO won: TRPO proved that trust regions work. PPO showed that you don't need the mathematical machinery — a simple clip achieves the same effect. Combined with parallel environments, GAE, and mini-batch updates, PPO gets TRPO-level stability at REINFORCE-level implementation complexity. That combination is why every major RL library implements PPO first.

8 — References

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. ICML 2015.

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR 2016.

Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256.

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.

Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janber, F., & Madry, A. (2020). Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. ICLR 2020.

Huang, S., et al. (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track 2022.