SAC: Soft Actor-Critic
Follow-up: Soft Actor-Critic Algorithms and Applications (Haarnoja et al., 2019)
1 — Why Stochastic? Why Entropy?
Most RL algorithms aim to learn a single best action for each state — a deterministic policy. But deterministic policies have fundamental weaknesses:
Problems with Deterministic Policies
Brittle exploration: A deterministic policy explores only via added noise (e.g., Gaussian perturbation in DDPG/TD3). This noise is task-agnostic and doesn't adapt — it doesn't know where to explore more or less.
Mode collapse: When multiple action sequences lead to equally good outcomes, a deterministic policy collapses to one arbitrary mode. If that mode becomes suboptimal due to environment changes, recovery is difficult.
Fragility: Small perturbations in the environment (friction changes, sensor noise) can disproportionately affect a policy that commits to one exact action per state.
SAC's solution is elegant: instead of finding the single best action, learn a distribution over good actions. The policy should be as random as possible while still achieving high reward. This idea comes from the maximum entropy framework.
2 — The Maximum Entropy Framework
SAC optimizes a modified RL objective that includes an entropy bonus at every timestep:
Soft RL Objective
α controls entropy weightJ(π) = Σt E[ r(st, at) + α H(π(·|st)) ]
= Σt E[ r(st, at) − α log π(at|st) ]
The temperature parameter α > 0 controls the tradeoff: large α prioritizes entropy (exploration), small α prioritizes reward (exploitation). When α → 0, SAC reduces to a standard RL algorithm.
This modified objective gives rise to soft versions of standard RL quantities:
Soft Value Functions
Soft Q-function: Qsoft(s, a) = r(s,a) + γ E[Qsoft(s', a') − α log π(a'|s')]
The soft Q-value includes future entropy bonuses — an action is "valuable" if it leads to states where the agent can both get high reward and maintain randomness.
Soft V-function: Vsoft(s) = Ea~π[Qsoft(s,a) − α log π(a|s)]
Note: SAC v2 (2019) eliminates the explicit V-function network and computes it implicitly. This simplification reduces the number of networks from 6 to 5.
3 — Architecture Overview
SAC v2 uses 5 networks: a stochastic actor, two critics, and two target critics. Notably, there is no target actor — the current actor is used to sample next-state actions for target computation. This works because SAC's stochastic policy produces diverse actions naturally (no need for target policy smoothing like TD3).
4 — The Stochastic Actor & Reparameterization
SAC's actor outputs a Gaussian distribution, not a single action. The key challenge: how do you backpropagate through random sampling?
Squashed Gaussian Policy
state → (μ, logσ) → tanh(sample)The actor network outputs two vectors: a mean μ(s) and log standard deviation logσ(s). An action is sampled from the resulting Gaussian, then squashed through tanh to bound it to [-1, 1]:
u = μ(s) + σ(s) ⊙ ε, where ε ~ N(0, I)
a = tanh(u)
The log-probability must account for the tanh squashing via the change-of-variables formula:
log π(a|s) = log N(u; μ, σ) − Σ log(1 − tanh²(ui))
Actor Network Architecture
state → (μ, logσ)The network outputs both mean and log-standard-deviation. The log_std is clamped to [-20, 2] for numerical stability (prevents the variance from becoming zero or exploding). The actual standard deviation is σ = exp(log_std).
5 — Soft Bellman Equation & Twin Critics
SAC's critics learn the soft Q-function, which accounts for future entropy bonuses. The target includes a log-probability penalty from the current policy:
Soft TD Target
includes entropy from next-state actionsy = r + γ ( min(Q'1(s', a'), Q'2(s', a')) − α log π(a'|s') )
where a' ~ π(·|s') is sampled from the current actor (not a target actor)
The −α log π term is the entropy bonus: actions that are more random (higher log-probability magnitude, more spread-out distribution) get higher soft Q-values. This encourages the critic to value states where the policy has room to be uncertain.
Both critics Q1 and Q2 are trained independently on the same target (using the twin minimum from TD3). The critic loss for each is simply MSE:
6 — Automatic Temperature Tuning
The temperature α is the most critical hyperparameter in SAC — it determines the reward-entropy tradeoff. The 2019 follow-up paper showed how to tune it automatically, removing the burden from the practitioner entirely.
Constrained Optimization for α
target entropy H̄ = −dim(A)Instead of fixing α, solve a constrained optimization problem: maximize expected return subject to the policy's entropy being at least H̄ (a target entropy). Using Lagrangian duality, this becomes:
α* = arg minα E[−α (log π(a|s) + H̄)]
In practice: if the policy's entropy is below the target, α increases (more entropy weight). If above the target, α decreases (more reward focus). The target entropy is heuristically set to H̄ = −dim(A), i.e., negative the number of action dimensions.
| SAC Version | Temperature | Networks | Key Difference |
|---|---|---|---|
| SAC v1 (2018) | Fixed α (hyperparameter) | 6 (actor, V, Q1, Q2, V', Q1') | Separate value network V(s) |
| SAC v2 (2019) | Auto-tuned α (learned) | 5 (actor, Q1, Q2, Q1', Q2') | No V network, auto-temperature |
7 — Training Algorithm
Key Hyperparameters
| Hyperparameter | Default Value |
|---|---|
| Hidden layers | 2 × 256 (actor and critic) |
| Learning rate | 3 × 10−4 (Adam, all networks + α) |
| Discount γ | 0.99 |
| Soft update τ | 0.005 |
| Target entropy H̄ | −dim(A) |
| Initial α | 0.2 (then auto-tuned) |
| Replay buffer size | 106 |
| Batch size | 256 |
| Warmup steps | 5,000–10,000 (random actions) |
| Updates per step | 1 (gradient step per environment step) |
8 — SAC vs TD3 vs PPO
These three algorithms represent the dominant approaches to different RL problem settings. Choosing between them depends on your specific requirements.
| Property | PPO | TD3 | SAC |
|---|---|---|---|
| On/Off-Policy | On-policy | Off-policy | Off-policy |
| Policy type | Stochastic | Deterministic | Stochastic |
| Action space | Discrete & continuous | Continuous only | Continuous (discrete variants exist) |
| Exploration | Policy entropy + bonus | Gaussian noise (manual) | Entropy-driven (automatic) |
| Sample efficiency | Low (on-policy) | High | High |
| Wall-clock efficiency | High (parallel envs) | Medium | Medium |
| Hyperparameter sensitivity | Medium | Low–Medium | Low (auto-tuned α) |
| Multi-modal behaviors | Yes (stochastic) | No (deterministic) | Yes (entropy-encouraged) |
| Replay buffer | No (uses rollout buffer) | Yes | Yes |
| Key innovation | Clipped surrogate | Twin critics + delay | Max entropy + auto-α |
9 — References
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018.
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & Levine, S. (2019). Soft Actor-Critic Algorithms and Applications. arXiv:1812.05905.
Ziebart, B.D. (2010). Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University.
Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018.
Kingma, D.P. & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR 2014. (Reparameterization trick origin.)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.