PPO: Proximal Policy Optimization

Stable Policy Gradient Methods for Reinforcement Learning

Policy Gradient On-Policy OpenAI 2017 Actor-Critic

Paper: Proximal Policy Optimization Algorithms (Schulman et al., 2017)

1 — The Problem: Unstable Policy Gradients

The vanilla policy gradient theorem gives us a beautiful result: we can improve a policy by estimating gradients from sampled trajectories. REINFORCE and its variants compute:

Vanilla Policy Gradient: ∇J(θ) = E[∇ log π_θ(a|s) · Â(s,a)] — move in the direction that increases the probability of actions with positive advantage.

The problem? Step size is critical and fragile. Policy gradient methods compute a direction to move in parameter space, but they say nothing about how far to move. Too small a step and training is painfully slow. Too large a step and the policy can catastrophically collapse — a single bad update can destroy a policy that took millions of timesteps to learn.

TRPO (Trust Region Policy Optimization) solved this with a hard constraint: never let the new policy deviate too far from the old one, measured by KL divergence. But TRPO requires computing second-order derivatives and solving a constrained optimization problem — complex to implement and expensive to compute.

PPO's insight: Instead of a hard KL constraint (TRPO), use a simple clipped objective that automatically prevents large updates. Same stability, first-order optimization only, trivial to implement.

2 — Actor-Critic Architecture

PPO uses an actor-critic framework where two functions work together: the actor (policy network) decides what to do, and the critic (value network) evaluates how good the current state is. In practice, these can share a backbone or be completely separate networks.

Actor Network (Policy)

state → action distribution

Maps a state to a probability distribution over actions. For discrete actions (e.g., Atari), outputs a softmax over action logits. For continuous actions (e.g., MuJoCo), outputs a mean and standard deviation for a Gaussian distribution.

Typical architecture: 2–3 fully-connected layers with 64 or 256 hidden units and tanh/ReLU activations. Input is the state vector [B, state_dim], output is [B, action_dim] (discrete) or [B, 2 × action_dim] for mean + log_std (continuous).

Critic Network (Value Function)

state → scalar value

Estimates V(s) — the expected cumulative discounted reward from state s under the current policy. Used to compute advantages but not used during action selection at test time.

Same architecture as the actor but with a single scalar output: [B, state_dim] → [B, 1]. When sharing a backbone, only the final layer differs.

Shared vs. Separate: Atari PPO typically shares a CNN backbone between actor and critic (saves parameters). MuJoCo PPO typically uses separate MLPs (more stable for continuous control). The loss becomes L = L^CLIP − c₁L^VF + c₂S[π] when sharing.

3 — The Clipped Surrogate Objective

This is the core innovation of PPO. Instead of constraining the KL divergence (TRPO), PPO clips the objective function itself so the policy can never move too far in a single update.

The Probability Ratio

Define the probability ratio between the new and old policy:

r(θ) = π_θ(a|s) / π_{θ_old}(a|s) — How much more (or less) likely is action a under the new policy compared to the old one? When r = 1, the policies are identical for this state-action pair. r > 1 means the new policy is more likely to take this action; r < 1 means less likely.

The Clipped Objective

The PPO-Clip objective is:

L^CLIP(θ)

ε = 0.2 (default)

L^CLIP = E[ min( r(θ)Â, clip(r(θ), 1−ε, 1+ε)Â ) ]

The min() takes the more pessimistic (lower) of two terms: the unclipped surrogate and the clipped surrogate. This creates a "trust region" without ever computing KL divergence.

Why Does This Work?

When Â > 0 (good action)

Encourage, but not too much

The objective wants to increase r(θ) — make the good action more likely. But once r(θ) exceeds 1+ε, the clipped term becomes flat. The min() picks the flat clipped value, so there's no gradient to push r further. The policy can increase the action's probability, but only up to 20% more than before.

When Â < 0 (bad action)

Discourage, but not too much

The objective wants to decrease r(θ) — make the bad action less likely. But once r(θ) drops below 1−ε, the clipped term is again flat. The policy can decrease the action's probability, but only down to 20% less than before.

The elegance of PPO: The clipping mechanism creates an implicit trust region. No second-order optimization, no KL penalty tuning, no line search. Just a simple min() and clip() in the loss function. This is why PPO became the default algorithm — it's almost as stable as TRPO but trivial to implement.

4 — Generalized Advantage Estimation (GAE)

The advantage Â(s,a) measures how much better action a is compared to the average action in state s. Computing good advantage estimates is critical — noisy advantages lead to noisy gradients and poor updates.

TD Residuals

The one-step TD residual at time t is:

δ_t = r_t + γV(s_t+1) − V(s_t) — The actual reward plus the estimated future value, minus what we expected. Positive δ means things went better than expected.

The Bias-Variance Tradeoff

We could use just δ_t as the advantage (low variance, high bias) or a full Monte Carlo return (no bias, high variance). GAE provides a smooth interpolation controlled by λ ∈ [0, 1]:

GAE(γ, λ)

γ = 0.99, λ = 0.95 typical

Â^GAE_t = Σ_l=0^∞ (γλ)^l δ_t+l

This is an exponentially-weighted sum of TD residuals. When λ = 0, we get the one-step TD advantage (high bias, low variance). When λ = 1, we get the Monte Carlo advantage (no bias, high variance). The default λ = 0.95 provides a good balance.

λ Value	Estimator	Bias	Variance	Behavior
λ = 0	One-step TD	High	Low	Only uses immediate reward + next value
λ = 0.95	GAE (default)	Low	Moderate	Good balance for most tasks
λ = 1	Monte Carlo	Zero	High	Full episode return minus baseline

5 — The PPO Training Loop

PPO alternates between collecting data and optimizing the policy. Unlike off-policy methods (SAC, TD3), PPO uses each batch of data for only a few epochs before discarding it — the data is "on-policy" and becomes stale as the policy changes.

Step 1: Collect Trajectories

N × T timesteps per iteration

Run the current policy π_θold in N parallel environments for T timesteps each. Store (s, a, r, s', logπ(a|s), V(s)) for each transition. Typical: N=8 environments, T=2048 steps = 16,384 transitions per batch.

Step 2: Compute Advantages

GAE with normalization

Using the collected rewards and value estimates, compute GAE advantages Â_t for every timestep (computed in reverse order for efficiency). Then normalize advantages to zero mean and unit variance across the batch — this is a critical implementation detail that significantly improves stability.

Step 3: K Epochs of Mini-Batch Updates

K=3–10, mini-batch size 64–4096

Shuffle the collected data and split into M mini-batches. For each mini-batch, compute the clipped surrogate loss and update the policy. Repeat for K epochs on the same data. The clipping ensures that even after K passes, the policy hasn't moved too far from π_old.

Why K > 1 works: In vanilla policy gradient, you use each batch exactly once (one gradient step). PPO's clipping lets you reuse data for multiple epochs because it automatically stops the policy from diverging — once r(θ) hits the clip boundary, gradients become zero. This dramatically improves sample efficiency.

6 — Implementation Details That Matter

PPO's simplicity is deceptive. The algorithm pseudocode fits on a napkin, but the implementation details can make a 10× difference in performance. Here are the ones that matter most:

Entropy Bonus

c₂ = 0.01 typical

Add an entropy bonus to the loss: L = L^CLIP − c₁L^VF + c₂ H(π). Entropy regularization prevents the policy from collapsing to a deterministic action too early. Without it, PPO can get stuck in suboptimal deterministic policies, especially in environments with sparse rewards.

Advantage Normalization

per mini-batch

Normalize advantages to zero mean and unit standard deviation: Â ← (Â − μ) / (σ + 10⁻⁸). This is done per mini-batch. Without normalization, the magnitude of advantages varies wildly across training, making the effective learning rate inconsistent.

Value Function Clipping

optional, debated

Some implementations clip the value function loss similarly to the policy loss: L^VF = max((V − V_target)², (V_clipped − V_target)²). This is controversial — the original paper includes it, but ablation studies suggest it sometimes hurts. Most modern implementations skip this.

Learning Rate Annealing

linear decay to 0

Linearly decay the learning rate from its initial value (typically 3×10⁻⁴) to zero over the course of training. Combined with the clip range, this provides a smooth reduction in update magnitude as training progresses.

Gradient Clipping

max_grad_norm = 0.5

Clip the global gradient norm to 0.5. While PPO's objective clipping prevents catastrophic policy updates, gradient clipping adds an additional safety net against rare pathological mini-batches that produce extremely large gradients.

Orthogonal Initialization

gain varies by layer

Initialize network weights with orthogonal matrices. Use gain=√2 for hidden layers (works well with ReLU), gain=0.01 for the policy output layer (initially near-uniform action distribution), and gain=1.0 for the value output layer.

Hyperparameter	Atari (Discrete)	MuJoCo (Continuous)
Parallel environments N	8	1–64
Rollout length T	128	2048
Mini-batch size	256	64
Epochs K	4	10
Clip ε	0.1	0.2
Learning rate	2.5×10⁻⁴	3×10⁻⁴
Discount γ	0.99	0.99
GAE λ	0.95	0.95
Entropy coeff c₂	0.01	0.0

7 — PPO in Practice

PPO has become the default RL algorithm across an extraordinary range of applications. Its combination of stability, simplicity, and competitive performance makes it the first algorithm most practitioners try.

RLHF: PPO for Language Models

PPO's most high-profile application is Reinforcement Learning from Human Feedback (RLHF), used to align large language models like ChatGPT. The setup: the LLM is the policy, generating text is the action sequence, and a trained reward model scores the output. PPO fine-tunes the LLM to maximize the reward model's score while staying close to the original model (via a KL penalty). The clip mechanism is perfect here — it prevents the LLM from diverging too far into reward-hacked outputs.

PPO for RLHF: L = E[r_reward(x, y) − β KL(π_θ || π_ref)] — maximize reward model score while staying close to the reference (pre-RLHF) model. The KL term prevents "reward hacking" where the model finds nonsensical outputs that fool the reward model.

Algorithm Comparison

Property	REINFORCE	TRPO	PPO
Year	1992	2015	2017
Update rule	Vanilla gradient	KL-constrained	Clipped surrogate
Trust region	None	Hard KL constraint	Implicit (clipping)
Optimization	First-order	Second-order (conjugate gradient)	First-order (Adam)
Implementation complexity	Low	High	Low
Data reuse per batch	1 epoch	1 step	K epochs (3–10)
Stability	Low	High	High
Sample efficiency	Low	Low–Medium	Medium
Wall-clock speed	Fast per update	Slow (Hessian-vector products)	Fast per update

Why PPO won: TRPO proved that trust regions work. PPO showed that you don't need the mathematical machinery — a simple clip achieves the same effect. Combined with parallel environments, GAE, and mini-batch updates, PPO gets TRPO-level stability at REINFORCE-level implementation complexity. That combination is why every major RL library implements PPO first.

8 — References

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. ICML 2015.

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR 2016.

Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256.

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.

Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janber, F., & Madry, A. (2020). Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. ICLR 2020.

Huang, S., et al. (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track 2022.

PPO: Proximal Policy Optimization

1 — The Problem: Unstable Policy Gradients

2 — Actor-Critic Architecture

Actor Network (Policy)

Critic Network (Value Function)

3 — The Clipped Surrogate Objective

The Probability Ratio

The Clipped Objective

LCLIP(θ)

Why Does This Work?

When Â > 0 (good action)

When Â < 0 (bad action)

4 — Generalized Advantage Estimation (GAE)

TD Residuals

The Bias-Variance Tradeoff

GAE(γ, λ)

5 — The PPO Training Loop

Step 1: Collect Trajectories

Step 2: Compute Advantages

Step 3: K Epochs of Mini-Batch Updates

6 — Implementation Details That Matter

Entropy Bonus

Advantage Normalization

Value Function Clipping

Learning Rate Annealing

Gradient Clipping

Orthogonal Initialization

7 — PPO in Practice

RLHF: PPO for Language Models

Algorithm Comparison

8 — References

L^CLIP(θ)