SAC: Soft Actor-Critic

Maximum Entropy Reinforcement Learning for Continuous Control

Maximum Entropy Off-Policy Continuous Control Haarnoja 2018

Paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor (Haarnoja et al., 2018)
Follow-up: Soft Actor-Critic Algorithms and Applications (Haarnoja et al., 2019)

1 — Why Stochastic? Why Entropy?

Most RL algorithms aim to learn a single best action for each state — a deterministic policy. But deterministic policies have fundamental weaknesses:

Problems with Deterministic Policies

Brittle exploration: A deterministic policy explores only via added noise (e.g., Gaussian perturbation in DDPG/TD3). This noise is task-agnostic and doesn't adapt — it doesn't know where to explore more or less.

Mode collapse: When multiple action sequences lead to equally good outcomes, a deterministic policy collapses to one arbitrary mode. If that mode becomes suboptimal due to environment changes, recovery is difficult.

Fragility: Small perturbations in the environment (friction changes, sensor noise) can disproportionately affect a policy that commits to one exact action per state.

SAC's solution is elegant: instead of finding the single best action, learn a distribution over good actions. The policy should be as random as possible while still achieving high reward. This idea comes from the maximum entropy framework.

Maximum entropy principle: Among all policies that achieve a given expected reward, prefer the one with maximum entropy (maximum randomness). Entropy H(π) = −E[log π(a|s)] measures how spread out the action distribution is. High entropy = exploring many actions; low entropy = committing to few. SAC balances reward and entropy automatically.

2 — The Maximum Entropy Framework

SAC optimizes a modified RL objective that includes an entropy bonus at every timestep:

Soft RL Objective

α controls entropy weight

J(π) = Σ_t E[ r(s_t, a_t) + α H(π(·|s_t)) ]

= Σ_t E[ r(s_t, a_t) − α log π(a_t|s_t) ]

The temperature parameter α > 0 controls the tradeoff: large α prioritizes entropy (exploration), small α prioritizes reward (exploitation). When α → 0, SAC reduces to a standard RL algorithm.

This modified objective gives rise to soft versions of standard RL quantities:

Soft Value Functions

Soft Q-function: Q^soft(s, a) = r(s,a) + γ E[Q^soft(s', a') − α log π(a'|s')]

The soft Q-value includes future entropy bonuses — an action is "valuable" if it leads to states where the agent can both get high reward and maintain randomness.

Soft V-function: V^soft(s) = E_a~π[Q^soft(s,a) − α log π(a|s)]

Note: SAC v2 (2019) eliminates the explicit V-function network and computes it implicitly. This simplification reduces the number of networks from 6 to 5.

Entropy benefits in practice: (1) Exploration: the entropy term drives the policy to try different actions, discovering potentially better strategies. (2) Multi-modality: if two equally good paths exist, SAC's policy can represent both. (3) Robustness: a stochastic policy is more resilient to perturbations. (4) Pre-training transfer: high-entropy policies preserve optionality, making fine-tuning to new tasks easier.

3 — Architecture Overview

SAC v2 uses 5 networks: a stochastic actor, two critics, and two target critics. Notably, there is no target actor — the current actor is used to sample next-state actions for target computation. This works because SAC's stochastic policy produces diverse actions naturally (no need for target policy smoothing like TD3).

SAC vs TD3 networks: TD3 has 6 networks (actor + 2 critics + 3 targets). SAC has 5 (actor + 2 critics + 2 target critics). SAC drops the target actor because its stochastic policy already provides the smoothing that TD3 achieves through explicit target noise.

4 — The Stochastic Actor & Reparameterization

SAC's actor outputs a Gaussian distribution, not a single action. The key challenge: how do you backpropagate through random sampling?

Squashed Gaussian Policy

state → (μ, logσ) → tanh(sample)

The actor network outputs two vectors: a mean μ(s) and log standard deviation logσ(s). An action is sampled from the resulting Gaussian, then squashed through tanh to bound it to [-1, 1]:

u = μ(s) + σ(s) ⊙ ε, where ε ~ N(0, I)

a = tanh(u)

The log-probability must account for the tanh squashing via the change-of-variables formula:

log π(a|s) = log N(u; μ, σ) − Σ log(1 − tanh²(u_i))

The reparameterization trick: Instead of sampling a ~ π(·|s) directly (which blocks gradient flow), we sample ε from a fixed distribution and compute a = f(ε, s; θ) deterministically. Now ∇_θa is well-defined, and we can backpropagate through the sampling operation. This is the same trick used in VAEs.

Actor Network Architecture

state → (μ, logσ)

The network outputs both mean and log-standard-deviation. The log_std is clamped to [-20, 2] for numerical stability (prevents the variance from becoming zero or exploding). The actual standard deviation is σ = exp(log_std).

5 — Soft Bellman Equation & Twin Critics

SAC's critics learn the soft Q-function, which accounts for future entropy bonuses. The target includes a log-probability penalty from the current policy:

Soft TD Target

includes entropy from next-state actions

y = r + γ ( min(Q'₁(s', a'), Q'₂(s', a')) − α log π(a'|s') )

where a' ~ π(·|s') is sampled from the current actor (not a target actor)

The −α log π term is the entropy bonus: actions that are more random (higher log-probability magnitude, more spread-out distribution) get higher soft Q-values. This encourages the critic to value states where the policy has room to be uncertain.

Both critics Q₁ and Q₂ are trained independently on the same target (using the twin minimum from TD3). The critic loss for each is simply MSE:

Critic loss: L(Q_i) = E[(Q_i(s, a) − y)²] for i ∈ {1, 2}, where transitions (s, a, r, s') come from the replay buffer and a' is freshly sampled from the current policy.

6 — Automatic Temperature Tuning

The temperature α is the most critical hyperparameter in SAC — it determines the reward-entropy tradeoff. The 2019 follow-up paper showed how to tune it automatically, removing the burden from the practitioner entirely.

Constrained Optimization for α

target entropy H̄ = −dim(A)

Instead of fixing α, solve a constrained optimization problem: maximize expected return subject to the policy's entropy being at least H̄ (a target entropy). Using Lagrangian duality, this becomes:

α* = arg min_α E[−α (log π(a|s) + H̄)]

In practice: if the policy's entropy is below the target, α increases (more entropy weight). If above the target, α decreases (more reward focus). The target entropy is heuristically set to H̄ = −dim(A), i.e., negative the number of action dimensions.

Why this matters: Before auto-tuning, α required careful per-task tuning. Too high and the agent never converges (just explores). Too low and you lose entropy benefits (fragile, poor exploration). Auto-tuning removed the single most sensitive hyperparameter from the algorithm, making SAC dramatically more robust out of the box.

SAC Version	Temperature	Networks	Key Difference
SAC v1 (2018)	Fixed α (hyperparameter)	6 (actor, V, Q1, Q2, V', Q1')	Separate value network V(s)
SAC v2 (2019)	Auto-tuned α (learned)	5 (actor, Q1, Q2, Q1', Q2')	No V network, auto-temperature

7 — Training Algorithm

No delayed updates: Unlike TD3, SAC updates the actor every step. The stochastic policy provides natural regularization (entropy prevents the actor from exploiting sharp Q-peaks), making delayed updates unnecessary. This simplifies the algorithm.

Key Hyperparameters

Hyperparameter	Default Value
Hidden layers	2 × 256 (actor and critic)
Learning rate	3 × 10⁻⁴ (Adam, all networks + α)
Discount γ	0.99
Soft update τ	0.005
Target entropy H̄	−dim(A)
Initial α	0.2 (then auto-tuned)
Replay buffer size	10⁶
Batch size	256
Warmup steps	5,000–10,000 (random actions)
Updates per step	1 (gradient step per environment step)

8 — SAC vs TD3 vs PPO

These three algorithms represent the dominant approaches to different RL problem settings. Choosing between them depends on your specific requirements.

Property	PPO	TD3	SAC
On/Off-Policy	On-policy	Off-policy	Off-policy
Policy type	Stochastic	Deterministic	Stochastic
Action space	Discrete & continuous	Continuous only	Continuous (discrete variants exist)
Exploration	Policy entropy + bonus	Gaussian noise (manual)	Entropy-driven (automatic)
Sample efficiency	Low (on-policy)	High	High
Wall-clock efficiency	High (parallel envs)	Medium	Medium
Hyperparameter sensitivity	Medium	Low–Medium	Low (auto-tuned α)
Multi-modal behaviors	Yes (stochastic)	No (deterministic)	Yes (entropy-encouraged)
Replay buffer	No (uses rollout buffer)	Yes	Yes
Key innovation	Clipped surrogate	Twin critics + delay	Max entropy + auto-α

When to use SAC: SAC is the go-to off-policy algorithm for continuous control. It's more robust than TD3 (automatic exploration tuning, no noise schedule needed), more sample-efficient than PPO (off-policy replay), and handles multi-modal tasks better than either. Use it when: (1) your action space is continuous, (2) you want sample efficiency, (3) your task may have multiple valid strategies, or (4) you want minimal hyperparameter tuning.

When NOT to use SAC: (1) Discrete action spaces (use PPO). (2) You have cheap simulation and need maximum throughput (PPO with parallel envs may be faster in wall-clock time). (3) Your task provably has a single optimal deterministic policy and you want maximum performance (TD3 may slightly outperform on some benchmarks).

The evolution: DDPG (2016) showed continuous control was possible → TD3 (2018) fixed DDPG's instability with twin critics, delay, and smoothing → SAC (2018) replaced handcrafted fixes with a principled entropy framework. SAC adopted TD3's twin critics but replaced delay and smoothing with entropy regularization, which provides similar benefits more elegantly.

9 — References

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018.

Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & Levine, S. (2019). Soft Actor-Critic Algorithms and Applications. arXiv:1812.05905.

Ziebart, B.D. (2010). Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University.

Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018.

Kingma, D.P. & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR 2014. (Reparameterization trick origin.)

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.