深度学习驱动的控制方法详解(六):深度强化学习控制(三)—— Actor-Critic架构
摘要: 本文详细介绍了深度强化学习中的Actor-Critic架构及其典型算法DDPG。Actor-Critic结合了值函数方法(低方差)和策略梯度方法(支持连续动作)的优势,通过Actor网络输出动作策略,Critic网络评估状态-动作值。重点讲解了确定性策略梯度定理(DPG)和DDPG算法,后者通过经验回放、目标网络和噪声探索实现高效连续控制。文中包含完整的DDPG算法流程和PyTorch实现
深度学习驱动的控制方法详解(六):深度强化学习控制(三)—— Actor-Critic架构
系列导读:本篇详解DDPG、TD3、SAC等高效连续控制算法,这些是机器人控制领域最常用的深度强化学习方法。
1. 引言:结合值函数与策略梯度
前两篇我们分别学习了:
- 值函数方法(DQN):低方差,但只能处理离散动作
- 策略梯度方法(PPO):支持连续动作,但方差较高
Actor-Critic架构结合两者优势:
Actor ⏟ 策略网络 + Critic ⏟ 值函数网络 = 低方差连续控制 \underbrace{\text{Actor}}_{\text{策略网络}} + \underbrace{\text{Critic}}_{\text{值函数网络}} = \text{低方差连续控制} 策略网络 Actor+值函数网络 Critic=低方差连续控制
2. Actor-Critic基础
2.1 架构概述
Actor:输出动作 a = π θ ( s ) a = \pi_\theta(s) a=πθ(s) 或分布 π θ ( a ∣ s ) \pi_\theta(a|s) πθ(a∣s)
Critic:评估状态-动作对的价值 Q ϕ ( s , a ) Q_\phi(s, a) Qϕ(s,a)
2.2 更新规则
Critic更新(TD学习):
ϕ ← ϕ − α Q ∇ ϕ ( Q ϕ ( s , a ) − y ) 2 \phi \leftarrow \phi - \alpha_Q \nabla_\phi \left( Q_\phi(s, a) - y \right)^2 ϕ←ϕ−αQ∇ϕ(Qϕ(s,a)−y)2
其中目标 y = r + γ Q ϕ ′ ( s ′ , a ′ ) y = r + \gamma Q_{\phi'}(s', a') y=r+γQϕ′(s′,a′)
Actor更新(策略梯度):
θ ← θ + α π ∇ θ Q ϕ ( s , π θ ( s ) ) \theta \leftarrow \theta + \alpha_\pi \nabla_\theta Q_\phi(s, \pi_\theta(s)) θ←θ+απ∇θQϕ(s,πθ(s))
2.3 确定性策略梯度
对于确定性策略 μ θ ( s ) \mu_\theta(s) μθ(s),有确定性策略梯度定理(DPG):
∇ θ J = E s ∼ ρ [ ∇ θ μ θ ( s ) ∇ a Q ( s , a ) ∣ a = μ θ ( s ) ] \nabla_\theta J = \mathbb{E}_{s \sim \rho} \left[ \nabla_\theta \mu_\theta(s) \nabla_a Q(s, a)|_{a=\mu_\theta(s)} \right] ∇θJ=Es∼ρ[∇θμθ(s)∇aQ(s,a)∣a=μθ(s)]
关键洞察:无需对动作空间积分,计算更高效!
3. DDPG:深度确定性策略梯度
3.1 核心思想
DDPG(Deep Deterministic Policy Gradient)将DQN的技巧应用于连续控制:
| DQN技巧 | DDPG应用 |
|---|---|
| 经验回放 | ✅ 相同 |
| 目标网络 | ✅ Actor和Critic各一个 |
| 贪婪策略 | 确定性策略 + 噪声探索 |
3.2 算法组件
网络结构:
- Actor网络 μ θ ( s ) \mu_\theta(s) μθ(s):输出确定性动作
- Critic网络 Q ϕ ( s , a ) Q_\phi(s, a) Qϕ(s,a):评估Q值
- 目标Actor μ θ ′ \mu_{\theta'} μθ′
- 目标Critic Q ϕ ′ Q_{\phi'} Qϕ′
探索机制:
a = μ θ ( s ) + N ( 0 , σ ) a = \mu_\theta(s) + \mathcal{N}(0, \sigma) a=μθ(s)+N(0,σ)
常用Ornstein-Uhlenbeck(OU)噪声或高斯噪声。
3.3 DDPG算法
算法: DDPG
输入: Actor μ_θ, Critic Q_φ, 目标网络参数 θ', φ'
输出: 训练好的策略
1. 初始化经验池 D
2. 初始化目标网络: θ' ← θ, φ' ← φ
3. for episode = 1 to M:
4. 初始化噪声过程 N
5. 获取初始状态 s
6. for t = 1 to T:
7. 选择动作: a = μ_θ(s) + N_t
8. 执行 a, 观察 r, s'
9. 存储 (s, a, r, s') 到 D
10. 从 D 采样 mini-batch
11. 计算目标: y = r + γ Q_{φ'}(s', μ_{θ'}(s'))
12. 更新Critic: φ ← φ - α_Q ∇_φ (Q_φ(s,a) - y)²
13. 更新Actor: θ ← θ + α_π ∇_θ Q_φ(s, μ_θ(s))
14. 软更新目标网络:
θ' ← τθ + (1-τ)θ'
φ' ← τφ + (1-τ)φ'
15. s ← s'
3.4 PyTorch实现
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random
class Actor(nn.Module):
"""DDPG Actor网络"""
def __init__(self, state_dim, action_dim, hidden_dim=256, max_action=1.0):
super().__init__()
self.max_action = max_action
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Tanh()
)
def forward(self, state):
return self.max_action * self.net(state)
class Critic(nn.Module):
"""DDPG Critic网络"""
def __init__(self, state_dim, action_dim, hidden_dim=256):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
def forward(self, state, action):
return self.net(torch.cat([state, action], dim=-1))
class DDPG:
"""DDPG算法"""
def __init__(self, state_dim, action_dim, max_action=1.0,
lr_actor=1e-4, lr_critic=1e-3, gamma=0.99, tau=0.005):
self.gamma = gamma
self.tau = tau
self.max_action = max_action
# 网络
self.actor = Actor(state_dim, action_dim, max_action=max_action)
self.actor_target = Actor(state_dim, action_dim, max_action=max_action)
self.actor_target.load_state_dict(self.actor.state_dict())
self.critic = Critic(state_dim, action_dim)
self.critic_target = Critic(state_dim, action_dim)
self.critic_target.load_state_dict(self.critic.state_dict())
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr_actor)
self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr_critic)
self.buffer = deque(maxlen=100000)
def select_action(self, state, noise=0.1):
state = torch.FloatTensor(state).unsqueeze(0)
action = self.actor(state).squeeze(0).detach().numpy()
action = action + np.random.normal(0, noise, size=action.shape)
return np.clip(action, -self.max_action, self.max_action)
def update(self, batch_size=64):
if len(self.buffer) < batch_size:
return
# 采样
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
states = torch.FloatTensor(np.array(states))
actions = torch.FloatTensor(np.array(actions))
rewards = torch.FloatTensor(np.array(rewards)).unsqueeze(1)
next_states = torch.FloatTensor(np.array(next_states))
dones = torch.FloatTensor(np.array(dones)).unsqueeze(1)
# 更新Critic
with torch.no_grad():
next_actions = self.actor_target(next_states)
target_q = self.critic_target(next_states, next_actions)
target_q = rewards + self.gamma * target_q * (1 - dones)
current_q = self.critic(states, actions)
critic_loss = nn.MSELoss()(current_q, target_q)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# 更新Actor
actor_loss = -self.critic(states, self.actor(states)).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# 软更新目标网络
self._soft_update(self.actor_target, self.actor)
self._soft_update(self.critic_target, self.critic)
return actor_loss.item(), critic_loss.item()
def _soft_update(self, target, source):
for target_param, param in zip(target.parameters(), source.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
3.5 DDPG的问题
DDPG虽然有效,但存在几个重要问题:
- Q值过估计:类似DQN
- 对超参数敏感:噪声、学习率等
- 脆弱性:策略可能突然崩溃
- 探索不足:确定性策略加噪声探索效率低
4. TD3:双延迟深度确定性策略梯度
4.1 三大改进
TD3(Twin Delayed DDPG)针对DDPG的问题提出三项改进:
| 改进 | 解决问题 | 方法 |
|---|---|---|
| 双Q网络 | Q值过估计 | 取两个Q网络的最小值 |
| 延迟策略更新 | 策略波动 | Actor更新频率低于Critic |
| 目标策略平滑 | 目标值方差 | 在目标动作上加噪声 |
4.2 双Q网络
使用两个独立的Critic网络,目标值取最小:
y = r + γ min i = 1 , 2 Q ϕ i ′ ( s ′ , a ~ ′ ) y = r + \gamma \min_{i=1,2} Q_{\phi'_i}(s', \tilde{a}') y=r+γi=1,2minQϕi′(s′,a~′)
这有效抑制了过估计。
4.3 目标策略平滑
在计算目标值时,对目标动作添加裁剪噪声:
a ~ ′ = clip ( μ θ ′ ( s ′ ) + clip ( ϵ , − c , c ) , a l o w , a h i g h ) \tilde{a}' = \text{clip}(\mu_{\theta'}(s') + \text{clip}(\epsilon, -c, c), a_{low}, a_{high}) a~′=clip(μθ′(s′)+clip(ϵ,−c,c),alow,ahigh)
ϵ ∼ N ( 0 , σ ) \epsilon \sim \mathcal{N}(0, \sigma) ϵ∼N(0,σ)
4.4 延迟更新
Actor网络每 d d d 步更新一次(通常 d = 2 d=2 d=2),Critic每步更新。
4.5 TD3实现
class TD3:
"""TD3算法"""
def __init__(self, state_dim, action_dim, max_action=1.0,
lr=3e-4, gamma=0.99, tau=0.005, policy_noise=0.2,
noise_clip=0.5, policy_delay=2):
self.gamma = gamma
self.tau = tau
self.max_action = max_action
self.policy_noise = policy_noise
self.noise_clip = noise_clip
self.policy_delay = policy_delay
self.total_it = 0
# Actor
self.actor = Actor(state_dim, action_dim, max_action=max_action)
self.actor_target = Actor(state_dim, action_dim, max_action=max_action)
self.actor_target.load_state_dict(self.actor.state_dict())
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr)
# 双Critic
self.critic1 = Critic(state_dim, action_dim)
self.critic2 = Critic(state_dim, action_dim)
self.critic1_target = Critic(state_dim, action_dim)
self.critic2_target = Critic(state_dim, action_dim)
self.critic1_target.load_state_dict(self.critic1.state_dict())
self.critic2_target.load_state_dict(self.critic2.state_dict())
self.critic_optimizer = optim.Adam(
list(self.critic1.parameters()) + list(self.critic2.parameters()),
lr=lr
)
self.buffer = deque(maxlen=100000)
def select_action(self, state, noise=0.1):
state = torch.FloatTensor(state).unsqueeze(0)
action = self.actor(state).squeeze(0).detach().numpy()
if noise > 0:
action = action + np.random.normal(0, noise, size=action.shape)
return np.clip(action, -self.max_action, self.max_action)
def update(self, batch_size=256):
self.total_it += 1
if len(self.buffer) < batch_size:
return None, None
# 采样
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
states = torch.FloatTensor(np.array(states))
actions = torch.FloatTensor(np.array(actions))
rewards = torch.FloatTensor(np.array(rewards)).unsqueeze(1)
next_states = torch.FloatTensor(np.array(next_states))
dones = torch.FloatTensor(np.array(dones)).unsqueeze(1)
with torch.no_grad():
# 目标策略平滑
noise = (torch.randn_like(actions) * self.policy_noise).clamp(
-self.noise_clip, self.noise_clip
)
next_actions = (self.actor_target(next_states) + noise).clamp(
-self.max_action, self.max_action
)
# 双Q取最小
target_q1 = self.critic1_target(next_states, next_actions)
target_q2 = self.critic2_target(next_states, next_actions)
target_q = torch.min(target_q1, target_q2)
target_q = rewards + self.gamma * target_q * (1 - dones)
# 更新Critic
current_q1 = self.critic1(states, actions)
current_q2 = self.critic2(states, actions)
critic_loss = nn.MSELoss()(current_q1, target_q) + \
nn.MSELoss()(current_q2, target_q)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
actor_loss = None
# 延迟更新Actor
if self.total_it % self.policy_delay == 0:
actor_loss = -self.critic1(states, self.actor(states)).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# 软更新目标网络
self._soft_update(self.actor_target, self.actor)
self._soft_update(self.critic1_target, self.critic1)
self._soft_update(self.critic2_target, self.critic2)
actor_loss = actor_loss.item()
return actor_loss, critic_loss.item()
5. SAC:软Actor-Critic
5.1 最大熵强化学习
SAC(Soft Actor-Critic)基于最大熵框架,目标是最大化期望回报+策略熵:
J ( π ) = ∑ t E [ r t + α H ( π ( ⋅ ∣ s t ) ) ] J(\pi) = \sum_t \mathbb{E} \left[ r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right] J(π)=t∑E[rt+αH(π(⋅∣st))]
其中熵项:
H ( π ) = − E a ∼ π [ log π ( a ∣ s ) ] \mathcal{H}(\pi) = -\mathbb{E}_{a \sim \pi}[\log \pi(a|s)] H(π)=−Ea∼π[logπ(a∣s)]
直觉:既要获得高回报,又要保持探索(策略不要太确定)。
5.2 软Bellman方程
软Q函数满足:
Q ∗ ( s , a ) = r + γ E s ′ [ V ∗ ( s ′ ) ] Q^*(s, a) = r + \gamma \mathbb{E}_{s'} \left[ V^*(s') \right] Q∗(s,a)=r+γEs′[V∗(s′)]
V ∗ ( s ) = E a ∼ π ∗ [ Q ∗ ( s , a ) − α log π ∗ ( a ∣ s ) ] V^*(s) = \mathbb{E}_{a \sim \pi^*} \left[ Q^*(s, a) - \alpha \log \pi^*(a|s) \right] V∗(s)=Ea∼π∗[Q∗(s,a)−αlogπ∗(a∣s)]
5.3 SAC的关键特性
| 特性 | 描述 | 优势 |
|---|---|---|
| 随机策略 | 输出动作分布 | 自然探索 |
| 熵正则化 | 鼓励策略多样性 | 避免局部最优 |
| 自动温度调节 | 自适应调整 α \alpha α | 减少调参 |
| 双Q网络 | 继承TD3 | 抑制过估计 |
5.4 重参数化技巧
为了能够通过采样的动作反向传播,使用重参数化技巧:
a = tanh ( μ θ ( s ) + σ θ ( s ) ⊙ ϵ ) , ϵ ∼ N ( 0 , I ) a = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \epsilon), \quad \epsilon \sim \mathcal{N}(0, I) a=tanh(μθ(s)+σθ(s)⊙ϵ),ϵ∼N(0,I)
5.5 自动温度调节
温度参数 α \alpha α 通过优化以下目标自动调整:
J ( α ) = E a ∼ π [ − α log π ( a ∣ s ) − α H ˉ ] J(\alpha) = \mathbb{E}_{a \sim \pi} \left[ -\alpha \log \pi(a|s) - \alpha \bar{\mathcal{H}} \right] J(α)=Ea∼π[−αlogπ(a∣s)−αHˉ]
其中 H ˉ \bar{\mathcal{H}} Hˉ 是目标熵(通常设为 − dim ( A ) -\dim(\mathcal{A}) −dim(A))。
5.6 SAC完整实现
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Normal
import numpy as np
class SACPolicy(nn.Module):
"""SAC随机策略网络"""
def __init__(self, state_dim, action_dim, hidden_dim=256, max_action=1.0):
super().__init__()
self.max_action = max_action
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU()
)
self.mean = nn.Linear(hidden_dim, action_dim)
self.log_std = nn.Linear(hidden_dim, action_dim)
def forward(self, state):
features = self.net(state)
mean = self.mean(features)
log_std = self.log_std(features).clamp(-20, 2)
return mean, log_std
def sample(self, state):
mean, log_std = self.forward(state)
std = log_std.exp()
dist = Normal(mean, std)
# 重参数化采样
x = dist.rsample()
action = torch.tanh(x) * self.max_action
# 计算log概率(考虑tanh变换的雅可比)
log_prob = dist.log_prob(x) - torch.log(1 - action.pow(2) + 1e-6)
log_prob = log_prob.sum(dim=-1, keepdim=True)
return action, log_prob
def get_action(self, state, deterministic=False):
mean, log_std = self.forward(state)
if deterministic:
return torch.tanh(mean) * self.max_action
else:
action, _ = self.sample(state)
return action
class SACCritic(nn.Module):
"""SAC双Q网络"""
def __init__(self, state_dim, action_dim, hidden_dim=256):
super().__init__()
self.q1 = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
self.q2 = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
def forward(self, state, action):
x = torch.cat([state, action], dim=-1)
return self.q1(x), self.q2(x)
class SAC:
"""SAC算法"""
def __init__(self, state_dim, action_dim, max_action=1.0,
lr=3e-4, gamma=0.99, tau=0.005, alpha=0.2, auto_alpha=True):
self.gamma = gamma
self.tau = tau
self.max_action = max_action
self.auto_alpha = auto_alpha
# 策略网络
self.policy = SACPolicy(state_dim, action_dim, max_action=max_action)
self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr)
# Critic网络
self.critic = SACCritic(state_dim, action_dim)
self.critic_target = SACCritic(state_dim, action_dim)
self.critic_target.load_state_dict(self.critic.state_dict())
self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr)
# 温度参数
if auto_alpha:
self.target_entropy = -action_dim
self.log_alpha = torch.zeros(1, requires_grad=True)
self.alpha_optimizer = optim.Adam([self.log_alpha], lr=lr)
self.alpha = self.log_alpha.exp().item()
else:
self.alpha = alpha
self.buffer = deque(maxlen=100000)
def select_action(self, state, deterministic=False):
state = torch.FloatTensor(state).unsqueeze(0)
action = self.policy.get_action(state, deterministic)
return action.squeeze(0).detach().numpy()
def update(self, batch_size=256):
if len(self.buffer) < batch_size:
return {}
# 采样
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
states = torch.FloatTensor(np.array(states))
actions = torch.FloatTensor(np.array(actions))
rewards = torch.FloatTensor(np.array(rewards)).unsqueeze(1)
next_states = torch.FloatTensor(np.array(next_states))
dones = torch.FloatTensor(np.array(dones)).unsqueeze(1)
# 更新Critic
with torch.no_grad():
next_actions, next_log_probs = self.policy.sample(next_states)
target_q1, target_q2 = self.critic_target(next_states, next_actions)
target_q = torch.min(target_q1, target_q2) - self.alpha * next_log_probs
target_q = rewards + self.gamma * target_q * (1 - dones)
current_q1, current_q2 = self.critic(states, actions)
critic_loss = nn.MSELoss()(current_q1, target_q) + \
nn.MSELoss()(current_q2, target_q)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# 更新Policy
new_actions, log_probs = self.policy.sample(states)
q1, q2 = self.critic(states, new_actions)
q = torch.min(q1, q2)
policy_loss = (self.alpha * log_probs - q).mean()
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step()
# 更新温度参数
alpha_loss = None
if self.auto_alpha:
alpha_loss = -(self.log_alpha * (log_probs + self.target_entropy).detach()).mean()
self.alpha_optimizer.zero_grad()
alpha_loss.backward()
self.alpha_optimizer.step()
self.alpha = self.log_alpha.exp().item()
# 软更新目标网络
for param, target_param in zip(self.critic.parameters(),
self.critic_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
return {
'critic_loss': critic_loss.item(),
'policy_loss': policy_loss.item(),
'alpha': self.alpha,
'entropy': -log_probs.mean().item()
}
6. 算法对比与选择
6.1 性能对比
| 算法 | 样本效率 | 稳定性 | 超参数敏感度 | 探索能力 |
|---|---|---|---|---|
| DDPG | 高 | 低 | 高 | 低 |
| TD3 | 高 | 中 | 中 | 低 |
| SAC | 高 | 高 | 低 | 高 |
| PPO | 中 | 高 | 低 | 中 |
6.2 选择指南
6.3 实践建议
默认选择SAC,原因:
- 自动温度调节减少调参
- 随机策略提供自然探索
- 对超参数不敏感
- 在大多数连续控制任务上表现优秀
选择TD3的情况:
- 确定性策略足够
- 计算资源有限(SAC稍慢)
- 已有DDPG代码基础
7. 控制应用案例
7.1 机器人运动控制
# 机械臂控制示例
def train_robot_arm():
env = gym.make('FetchReach-v2') # 机械臂到达任务
state_dim = env.observation_space['observation'].shape[0]
action_dim = env.action_space.shape[0]
agent = SAC(state_dim, action_dim)
for episode in range(1000):
obs, _ = env.reset()
state = obs['observation']
episode_reward = 0
for step in range(50):
action = agent.select_action(state)
obs, reward, terminated, truncated, info = env.step(action)
next_state = obs['observation']
done = terminated or truncated
agent.buffer.append((state, action, reward, next_state, float(done)))
agent.update()
state = next_state
episode_reward += reward
if done:
break
print(f"Episode {episode}, Reward: {episode_reward:.2f}")
7.2 连续控制基准
| 环境 | SAC分数 | TD3分数 | PPO分数 |
|---|---|---|---|
| HalfCheetah-v4 | ~12000 | ~10000 | ~8000 |
| Hopper-v4 | ~3500 | ~3300 | ~2500 |
| Walker2d-v4 | ~5500 | ~4500 | ~4000 |
| Ant-v4 | ~6000 | ~5000 | ~4500 |
8. 实践技巧
8.1 通用技巧
| 技巧 | 说明 |
|---|---|
| 奖励缩放 | 将奖励缩放到合理范围 [-10, 10] |
| 状态归一化 | 使用RunningMeanStd归一化状态 |
| 梯度裁剪 | clip_grad_norm_(params, 1.0) |
| 学习率调度 | 后期降低学习率 |
8.2 SAC特有技巧
# 目标熵设置
target_entropy = -np.prod(action_dim) # 默认-dim(A)
# 对于某些任务可能需要调整
target_entropy = -action_dim * 0.5 # 更少探索
target_entropy = -action_dim * 1.5 # 更多探索
8.3 调试清单
- ✅ 检查奖励尺度
- ✅ 监控Q值(不应发散)
- ✅ 监控策略熵(不应过快下降)
- ✅ 检查动作范围裁剪
- ✅ 验证环境状态/动作维度
9. 总结
本篇详细介绍了Actor-Critic架构的三大算法:
核心要点:
- DDPG:DQN + 确定性策略梯度,开创性工作
- TD3:双Q + 延迟更新 + 目标平滑,解决DDPG问题
- SAC:最大熵框架 + 随机策略,当前最佳实践
关键公式:
确定性策略梯度:
∇ θ J = E [ ∇ θ μ θ ( s ) ∇ a Q ( s , a ) ∣ a = μ θ ( s ) ] \nabla_\theta J = \mathbb{E} \left[ \nabla_\theta \mu_\theta(s) \nabla_a Q(s, a)|_{a=\mu_\theta(s)} \right] ∇θJ=E[∇θμθ(s)∇aQ(s,a)∣a=μθ(s)]
SAC目标:
J ( π ) = ∑ t E [ r t + α H ( π ( ⋅ ∣ s t ) ) ] J(\pi) = \sum_t \mathbb{E} \left[ r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right] J(π)=t∑E[rt+αH(π(⋅∣st))]
TD3目标值:
y = r + γ min i = 1 , 2 Q ϕ i ′ ( s ′ , a ~ ′ ) y = r + \gamma \min_{i=1,2} Q_{\phi'_i}(s', \tilde{a}') y=r+γi=1,2minQϕi′(s′,a~′)
推荐选择:
- 默认使用SAC
- 简单任务可用TD3
- DDPG主要用于理解基础
参考文献
- Lillicrap, T. P., et al. (2016). Continuous control with deep reinforcement learning. ICLR.
- Fujimoto, S., et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML.
- Haarnoja, T., et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning. ICML.
- Haarnoja, T., et al. (2019). Soft Actor-Critic Algorithms and Applications. arXiv.
下一篇预告:深度学习驱动的控制方法详解(七):基于模型的深度学习控制
我们将学习如何结合神经网络动力学模型与模型预测控制(MPC)。
如果觉得本文有帮助,欢迎点赞收藏,关注本系列后续更新!
更多推荐


所有评论(0)