SAC: Soft Actor-Critic

Maximum Entropy Reinforcement Learning for Continuous Control
Maximum Entropy Off-Policy Continuous Control Haarnoja 2018

1 — Why Stochastic? Why Entropy?

Most RL algorithms aim to learn a single best action for each state — a deterministic policy. But deterministic policies have fundamental weaknesses:

Problems with Deterministic Policies

Brittle exploration: A deterministic policy explores only via added noise (e.g., Gaussian perturbation in DDPG/TD3). This noise is task-agnostic and doesn't adapt — it doesn't know where to explore more or less.

Mode collapse: When multiple action sequences lead to equally good outcomes, a deterministic policy collapses to one arbitrary mode. If that mode becomes suboptimal due to environment changes, recovery is difficult.

Fragility: Small perturbations in the environment (friction changes, sensor noise) can disproportionately affect a policy that commits to one exact action per state.

SAC's solution is elegant: instead of finding the single best action, learn a distribution over good actions. The policy should be as random as possible while still achieving high reward. This idea comes from the maximum entropy framework.

Standard RL vs Maximum Entropy RL STANDARD RL J(π) = Σ E[r(s, a)] Maximize reward only Single action per state Fragile, poor exploration MAXIMUM ENTROPY RL (SAC) J(π) = Σ E[r(s,a) + αH(π)] Maximize reward AND entropy Distribution over good actions Robust, explores automatically
Maximum entropy principle: Among all policies that achieve a given expected reward, prefer the one with maximum entropy (maximum randomness). Entropy H(π) = −E[log π(a|s)] measures how spread out the action distribution is. High entropy = exploring many actions; low entropy = committing to few. SAC balances reward and entropy automatically.

2 — The Maximum Entropy Framework

SAC optimizes a modified RL objective that includes an entropy bonus at every timestep:

Soft RL Objective

α controls entropy weight

J(π) = Σt E[ r(st, at) + α H(π(·|st)) ]

= Σt E[ r(st, at) − α log π(at|st) ]

The temperature parameter α > 0 controls the tradeoff: large α prioritizes entropy (exploration), small α prioritizes reward (exploitation). When α → 0, SAC reduces to a standard RL algorithm.

This modified objective gives rise to soft versions of standard RL quantities:

Soft Value Functions

Soft Q-function: Qsoft(s, a) = r(s,a) + γ E[Qsoft(s', a') − α log π(a'|s')]

The soft Q-value includes future entropy bonuses — an action is "valuable" if it leads to states where the agent can both get high reward and maintain randomness.

Soft V-function: Vsoft(s) = Ea~π[Qsoft(s,a) − α log π(a|s)]

Note: SAC v2 (2019) eliminates the explicit V-function network and computes it implicitly. This simplification reduces the number of networks from 6 to 5.

Entropy benefits in practice: (1) Exploration: the entropy term drives the policy to try different actions, discovering potentially better strategies. (2) Multi-modality: if two equally good paths exist, SAC's policy can represent both. (3) Robustness: a stochastic policy is more resilient to perturbations. (4) Pre-training transfer: high-entropy policies preserve optionality, making fine-tuning to new tasks easier.

3 — Architecture Overview

SAC Architecture (v2, 2019) State s Stochastic Actor πθ μ(s), logσ(s) → Gaussian a = tanh(μ + σ ⊙ ε), ε~N(0,1) Critic Q1(s, a) Critic Q2(s, a) a ~ π(·|s) α (temperature) auto-tuned scales entropy TARGET NETWORKS (τ = 0.005) Target Q'1, Q'2 No target actor needed!

SAC v2 uses 5 networks: a stochastic actor, two critics, and two target critics. Notably, there is no target actor — the current actor is used to sample next-state actions for target computation. This works because SAC's stochastic policy produces diverse actions naturally (no need for target policy smoothing like TD3).

SAC vs TD3 networks: TD3 has 6 networks (actor + 2 critics + 3 targets). SAC has 5 (actor + 2 critics + 2 target critics). SAC drops the target actor because its stochastic policy already provides the smoothing that TD3 achieves through explicit target noise.

4 — The Stochastic Actor & Reparameterization

SAC's actor outputs a Gaussian distribution, not a single action. The key challenge: how do you backpropagate through random sampling?

Squashed Gaussian Policy

state → (μ, logσ) → tanh(sample)

The actor network outputs two vectors: a mean μ(s) and log standard deviation logσ(s). An action is sampled from the resulting Gaussian, then squashed through tanh to bound it to [-1, 1]:

u = μ(s) + σ(s) ⊙ ε, where ε ~ N(0, I)

a = tanh(u)

The log-probability must account for the tanh squashing via the change-of-variables formula:

log π(a|s) = log N(u; μ, σ) − Σ log(1 − tanh²(ui))

Reparameterization Trick State s Neural Net μ(s) logσ(s) ε ~ N(0,I) u = μ + σε reparameterize tanh a ∈ [-1, 1] GRADIENT FLOWS THROUGH (deterministic path) No gradient needed ε is external noise — randomness is "outside" the computation graph
The reparameterization trick: Instead of sampling a ~ π(·|s) directly (which blocks gradient flow), we sample ε from a fixed distribution and compute a = f(ε, s; θ) deterministically. Now ∇θa is well-defined, and we can backpropagate through the sampling operation. This is the same trick used in VAEs.

Actor Network Architecture

state → (μ, logσ)
State [B, s_dim] FC+ReLU [B, 256] FC+ReLU [B, 256] FC (mean) [B, a_dim] FC (log_std) [B, a_dim] μ(s) logσ(s) clamped to [-20, 2]

The network outputs both mean and log-standard-deviation. The log_std is clamped to [-20, 2] for numerical stability (prevents the variance from becoming zero or exploding). The actual standard deviation is σ = exp(log_std).

5 — Soft Bellman Equation & Twin Critics

SAC's critics learn the soft Q-function, which accounts for future entropy bonuses. The target includes a log-probability penalty from the current policy:

Soft TD Target

includes entropy from next-state actions

y = r + γ ( min(Q'1(s', a'), Q'2(s', a')) − α log π(a'|s') )

where a' ~ π(·|s') is sampled from the current actor (not a target actor)

The −α log π term is the entropy bonus: actions that are more random (higher log-probability magnitude, more spread-out distribution) get higher soft Q-values. This encourages the critic to value states where the policy has room to be uncertain.

Soft Bellman Backup s, a r + γ × s' a' ~ π(·|s') Q'1(s', a') Q'2(s', a') min(Q'1, Q'2) −α log π(a'|s') subtracted from target

Both critics Q1 and Q2 are trained independently on the same target (using the twin minimum from TD3). The critic loss for each is simply MSE:

Critic loss: L(Qi) = E[(Qi(s, a) − y)²] for i ∈ {1, 2}, where transitions (s, a, r, s') come from the replay buffer and a' is freshly sampled from the current policy.

6 — Automatic Temperature Tuning

The temperature α is the most critical hyperparameter in SAC — it determines the reward-entropy tradeoff. The 2019 follow-up paper showed how to tune it automatically, removing the burden from the practitioner entirely.

Constrained Optimization for α

target entropy H̄ = −dim(A)

Instead of fixing α, solve a constrained optimization problem: maximize expected return subject to the policy's entropy being at least H̄ (a target entropy). Using Lagrangian duality, this becomes:

α* = arg minα E[−α (log π(a|s) + H̄)]

In practice: if the policy's entropy is below the target, α increases (more entropy weight). If above the target, α decreases (more reward focus). The target entropy is heuristically set to H̄ = −dim(A), i.e., negative the number of action dimensions.

Automatic Temperature Regulation Current Entropy H(π) = E[−log π(a|s)] Target: H̄ = −dim(A) e.g., −6 for 6-DoF arm Compare H(π) vs H̄ H(π) < H̄ α increases ↑ (more exploration) H(π) > H̄ α decreases ↓ (more exploitation)
Why this matters: Before auto-tuning, α required careful per-task tuning. Too high and the agent never converges (just explores). Too low and you lose entropy benefits (fragile, poor exploration). Auto-tuning removed the single most sensitive hyperparameter from the algorithm, making SAC dramatically more robust out of the box.
SAC Version Temperature Networks Key Difference
SAC v1 (2018) Fixed α (hyperparameter) 6 (actor, V, Q1, Q2, V', Q1') Separate value network V(s)
SAC v2 (2019) Auto-tuned α (learned) 5 (actor, Q1, Q2, Q1', Q2') No V network, auto-temperature

7 — Training Algorithm

SAC Training Loop 1. Sample mini-batch from replay buffer (s, a, r, s', done) × 256 2. Compute soft target a' ~ π(·|s'), compute log π(a'|s') y = r + γ(min(Q'1, Q'2) − α log π(a'|s')) 3. Update both critics L(Qi) = E[(Qi(s,a) − y)²] 4. Update actor anew ~ π(·|s), maximize: E[min(Q1, Q2)(s, anew) − α log π(anew|s)] (gradient through reparameterized sample) 5. Update temperature α L(α) = E[−α(log π(a|s) + H̄)] 6. Soft-update target critics Q'i ← τQi + (1−τ)Q'i REPEAT Every step (no delay!)
No delayed updates: Unlike TD3, SAC updates the actor every step. The stochastic policy provides natural regularization (entropy prevents the actor from exploiting sharp Q-peaks), making delayed updates unnecessary. This simplifies the algorithm.

Key Hyperparameters

Hyperparameter Default Value
Hidden layers2 × 256 (actor and critic)
Learning rate3 × 10−4 (Adam, all networks + α)
Discount γ0.99
Soft update τ0.005
Target entropy H̄−dim(A)
Initial α0.2 (then auto-tuned)
Replay buffer size106
Batch size256
Warmup steps5,000–10,000 (random actions)
Updates per step1 (gradient step per environment step)

8 — SAC vs TD3 vs PPO

These three algorithms represent the dominant approaches to different RL problem settings. Choosing between them depends on your specific requirements.

Property PPO TD3 SAC
On/Off-Policy On-policy Off-policy Off-policy
Policy type Stochastic Deterministic Stochastic
Action space Discrete & continuous Continuous only Continuous (discrete variants exist)
Exploration Policy entropy + bonus Gaussian noise (manual) Entropy-driven (automatic)
Sample efficiency Low (on-policy) High High
Wall-clock efficiency High (parallel envs) Medium Medium
Hyperparameter sensitivity Medium Low–Medium Low (auto-tuned α)
Multi-modal behaviors Yes (stochastic) No (deterministic) Yes (entropy-encouraged)
Replay buffer No (uses rollout buffer) Yes Yes
Key innovation Clipped surrogate Twin critics + delay Max entropy + auto-α
When to use SAC: SAC is the go-to off-policy algorithm for continuous control. It's more robust than TD3 (automatic exploration tuning, no noise schedule needed), more sample-efficient than PPO (off-policy replay), and handles multi-modal tasks better than either. Use it when: (1) your action space is continuous, (2) you want sample efficiency, (3) your task may have multiple valid strategies, or (4) you want minimal hyperparameter tuning.
When NOT to use SAC: (1) Discrete action spaces (use PPO). (2) You have cheap simulation and need maximum throughput (PPO with parallel envs may be faster in wall-clock time). (3) Your task provably has a single optimal deterministic policy and you want maximum performance (TD3 may slightly outperform on some benchmarks).
The evolution: DDPG (2016) showed continuous control was possible → TD3 (2018) fixed DDPG's instability with twin critics, delay, and smoothing → SAC (2018) replaced handcrafted fixes with a principled entropy framework. SAC adopted TD3's twin critics but replaced delay and smoothing with entropy regularization, which provides similar benefits more elegantly.

9 — References

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018.

Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & Levine, S. (2019). Soft Actor-Critic Algorithms and Applications. arXiv:1812.05905.

Ziebart, B.D. (2010). Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University.

Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018.

Kingma, D.P. & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR 2014. (Reparameterization trick origin.)

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.