深度学习驱动的控制方法详解(六):深度强化学习控制(三)—— Actor-Critic架构

系列导读:本篇详解DDPG、TD3、SAC等高效连续控制算法,这些是机器人控制领域最常用的深度强化学习方法。


1. 引言:结合值函数与策略梯度

前两篇我们分别学习了:

  • 值函数方法(DQN):低方差,但只能处理离散动作
  • 策略梯度方法(PPO):支持连续动作,但方差较高

Actor-Critic架构结合两者优势:

Actor ⏟ 策略网络 + Critic ⏟ 值函数网络 = 低方差连续控制 \underbrace{\text{Actor}}_{\text{策略网络}} + \underbrace{\text{Critic}}_{\text{值函数网络}} = \text{低方差连续控制} 策略网络 Actor+值函数网络 Critic=低方差连续控制


2. Actor-Critic基础

2.1 架构概述

状态 s

Actor π_θ

Critic Q_φ

动作 a

Q值

Actor:输出动作 a = π θ ( s ) a = \pi_\theta(s) a=πθ(s) 或分布 π θ ( a ∣ s ) \pi_\theta(a|s) πθ(as)

Critic:评估状态-动作对的价值 Q ϕ ( s , a ) Q_\phi(s, a) Qϕ(s,a)

2.2 更新规则

Critic更新(TD学习):

ϕ ← ϕ − α Q ∇ ϕ ( Q ϕ ( s , a ) − y ) 2 \phi \leftarrow \phi - \alpha_Q \nabla_\phi \left( Q_\phi(s, a) - y \right)^2 ϕϕαQϕ(Qϕ(s,a)y)2

其中目标 y = r + γ Q ϕ ′ ( s ′ , a ′ ) y = r + \gamma Q_{\phi'}(s', a') y=r+γQϕ(s,a)

Actor更新(策略梯度):

θ ← θ + α π ∇ θ Q ϕ ( s , π θ ( s ) ) \theta \leftarrow \theta + \alpha_\pi \nabla_\theta Q_\phi(s, \pi_\theta(s)) θθ+απθQϕ(s,πθ(s))

2.3 确定性策略梯度

对于确定性策略 μ θ ( s ) \mu_\theta(s) μθ(s),有确定性策略梯度定理(DPG):

∇ θ J = E s ∼ ρ [ ∇ θ μ θ ( s ) ∇ a Q ( s , a ) ∣ a = μ θ ( s ) ] \nabla_\theta J = \mathbb{E}_{s \sim \rho} \left[ \nabla_\theta \mu_\theta(s) \nabla_a Q(s, a)|_{a=\mu_\theta(s)} \right] θJ=Esρ[θμθ(s)aQ(s,a)a=μθ(s)]

关键洞察:无需对动作空间积分,计算更高效!


3. DDPG:深度确定性策略梯度

3.1 核心思想

DDPG(Deep Deterministic Policy Gradient)将DQN的技巧应用于连续控制:

DQN技巧 DDPG应用
经验回放 ✅ 相同
目标网络 ✅ Actor和Critic各一个
贪婪策略 确定性策略 + 噪声探索

3.2 算法组件

网络结构

  • Actor网络 μ θ ( s ) \mu_\theta(s) μθ(s):输出确定性动作
  • Critic网络 Q ϕ ( s , a ) Q_\phi(s, a) Qϕ(s,a):评估Q值
  • 目标Actor μ θ ′ \mu_{\theta'} μθ
  • 目标Critic Q ϕ ′ Q_{\phi'} Qϕ

探索机制
a = μ θ ( s ) + N ( 0 , σ ) a = \mu_\theta(s) + \mathcal{N}(0, \sigma) a=μθ(s)+N(0,σ)

常用Ornstein-Uhlenbeck(OU)噪声或高斯噪声。

3.3 DDPG算法

算法: DDPG
输入: Actor μ_θ, Critic Q_φ, 目标网络参数 θ', φ'
输出: 训练好的策略

1. 初始化经验池 D
2. 初始化目标网络: θ' ← θ, φ' ← φ
3. for episode = 1 to M:
4.     初始化噪声过程 N
5.     获取初始状态 s
6.     for t = 1 to T:
7.         选择动作: a = μ_θ(s) + N_t
8.         执行 a, 观察 r, s'
9.         存储 (s, a, r, s') 到 D
10.        从 D 采样 mini-batch
11.        计算目标: y = r + γ Q_{φ'}(s', μ_{θ'}(s'))
12.        更新Critic: φ ← φ - α_Q ∇_φ (Q_φ(s,a) - y)²
13.        更新Actor: θ ← θ + α_π ∇_θ Q_φ(s, μ_θ(s))
14.        软更新目标网络:
               θ' ← τθ + (1-τ)θ'
               φ' ← τφ + (1-τ)φ'
15.        s ← s'

3.4 PyTorch实现

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

class Actor(nn.Module):
    """DDPG Actor网络"""
    def __init__(self, state_dim, action_dim, hidden_dim=256, max_action=1.0):
        super().__init__()
        self.max_action = max_action
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Tanh()
        )

    def forward(self, state):
        return self.max_action * self.net(state)

class Critic(nn.Module):
    """DDPG Critic网络"""
    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )

    def forward(self, state, action):
        return self.net(torch.cat([state, action], dim=-1))

class DDPG:
    """DDPG算法"""
    def __init__(self, state_dim, action_dim, max_action=1.0,
                 lr_actor=1e-4, lr_critic=1e-3, gamma=0.99, tau=0.005):
        self.gamma = gamma
        self.tau = tau
        self.max_action = max_action

        # 网络
        self.actor = Actor(state_dim, action_dim, max_action=max_action)
        self.actor_target = Actor(state_dim, action_dim, max_action=max_action)
        self.actor_target.load_state_dict(self.actor.state_dict())

        self.critic = Critic(state_dim, action_dim)
        self.critic_target = Critic(state_dim, action_dim)
        self.critic_target.load_state_dict(self.critic.state_dict())

        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr_actor)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr_critic)

        self.buffer = deque(maxlen=100000)

    def select_action(self, state, noise=0.1):
        state = torch.FloatTensor(state).unsqueeze(0)
        action = self.actor(state).squeeze(0).detach().numpy()
        action = action + np.random.normal(0, noise, size=action.shape)
        return np.clip(action, -self.max_action, self.max_action)

    def update(self, batch_size=64):
        if len(self.buffer) < batch_size:
            return

        # 采样
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)

        states = torch.FloatTensor(np.array(states))
        actions = torch.FloatTensor(np.array(actions))
        rewards = torch.FloatTensor(np.array(rewards)).unsqueeze(1)
        next_states = torch.FloatTensor(np.array(next_states))
        dones = torch.FloatTensor(np.array(dones)).unsqueeze(1)

        # 更新Critic
        with torch.no_grad():
            next_actions = self.actor_target(next_states)
            target_q = self.critic_target(next_states, next_actions)
            target_q = rewards + self.gamma * target_q * (1 - dones)

        current_q = self.critic(states, actions)
        critic_loss = nn.MSELoss()(current_q, target_q)

        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # 更新Actor
        actor_loss = -self.critic(states, self.actor(states)).mean()

        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # 软更新目标网络
        self._soft_update(self.actor_target, self.actor)
        self._soft_update(self.critic_target, self.critic)

        return actor_loss.item(), critic_loss.item()

    def _soft_update(self, target, source):
        for target_param, param in zip(target.parameters(), source.parameters()):
            target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

3.5 DDPG的问题

DDPG虽然有效,但存在几个重要问题:

  1. Q值过估计:类似DQN
  2. 对超参数敏感:噪声、学习率等
  3. 脆弱性:策略可能突然崩溃
  4. 探索不足:确定性策略加噪声探索效率低

4. TD3:双延迟深度确定性策略梯度

4.1 三大改进

TD3(Twin Delayed DDPG)针对DDPG的问题提出三项改进:

改进 解决问题 方法
双Q网络 Q值过估计 取两个Q网络的最小值
延迟策略更新 策略波动 Actor更新频率低于Critic
目标策略平滑 目标值方差 在目标动作上加噪声

4.2 双Q网络

使用两个独立的Critic网络,目标值取最小:

y = r + γ min ⁡ i = 1 , 2 Q ϕ i ′ ( s ′ , a ~ ′ ) y = r + \gamma \min_{i=1,2} Q_{\phi'_i}(s', \tilde{a}') y=r+γi=1,2minQϕi(s,a~)

这有效抑制了过估计。

4.3 目标策略平滑

在计算目标值时,对目标动作添加裁剪噪声:

a ~ ′ = clip ( μ θ ′ ( s ′ ) + clip ( ϵ , − c , c ) , a l o w , a h i g h ) \tilde{a}' = \text{clip}(\mu_{\theta'}(s') + \text{clip}(\epsilon, -c, c), a_{low}, a_{high}) a~=clip(μθ(s)+clip(ϵ,c,c),alow,ahigh)

ϵ ∼ N ( 0 , σ ) \epsilon \sim \mathcal{N}(0, \sigma) ϵN(0,σ)

4.4 延迟更新

Actor网络每 d d d 步更新一次(通常 d = 2 d=2 d=2),Critic每步更新。

4.5 TD3实现

class TD3:
    """TD3算法"""
    def __init__(self, state_dim, action_dim, max_action=1.0,
                 lr=3e-4, gamma=0.99, tau=0.005, policy_noise=0.2,
                 noise_clip=0.5, policy_delay=2):
        self.gamma = gamma
        self.tau = tau
        self.max_action = max_action
        self.policy_noise = policy_noise
        self.noise_clip = noise_clip
        self.policy_delay = policy_delay
        self.total_it = 0

        # Actor
        self.actor = Actor(state_dim, action_dim, max_action=max_action)
        self.actor_target = Actor(state_dim, action_dim, max_action=max_action)
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr)

        # 双Critic
        self.critic1 = Critic(state_dim, action_dim)
        self.critic2 = Critic(state_dim, action_dim)
        self.critic1_target = Critic(state_dim, action_dim)
        self.critic2_target = Critic(state_dim, action_dim)
        self.critic1_target.load_state_dict(self.critic1.state_dict())
        self.critic2_target.load_state_dict(self.critic2.state_dict())
        self.critic_optimizer = optim.Adam(
            list(self.critic1.parameters()) + list(self.critic2.parameters()),
            lr=lr
        )

        self.buffer = deque(maxlen=100000)

    def select_action(self, state, noise=0.1):
        state = torch.FloatTensor(state).unsqueeze(0)
        action = self.actor(state).squeeze(0).detach().numpy()
        if noise > 0:
            action = action + np.random.normal(0, noise, size=action.shape)
        return np.clip(action, -self.max_action, self.max_action)

    def update(self, batch_size=256):
        self.total_it += 1

        if len(self.buffer) < batch_size:
            return None, None

        # 采样
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)

        states = torch.FloatTensor(np.array(states))
        actions = torch.FloatTensor(np.array(actions))
        rewards = torch.FloatTensor(np.array(rewards)).unsqueeze(1)
        next_states = torch.FloatTensor(np.array(next_states))
        dones = torch.FloatTensor(np.array(dones)).unsqueeze(1)

        with torch.no_grad():
            # 目标策略平滑
            noise = (torch.randn_like(actions) * self.policy_noise).clamp(
                -self.noise_clip, self.noise_clip
            )
            next_actions = (self.actor_target(next_states) + noise).clamp(
                -self.max_action, self.max_action
            )

            # 双Q取最小
            target_q1 = self.critic1_target(next_states, next_actions)
            target_q2 = self.critic2_target(next_states, next_actions)
            target_q = torch.min(target_q1, target_q2)
            target_q = rewards + self.gamma * target_q * (1 - dones)

        # 更新Critic
        current_q1 = self.critic1(states, actions)
        current_q2 = self.critic2(states, actions)
        critic_loss = nn.MSELoss()(current_q1, target_q) + \
                      nn.MSELoss()(current_q2, target_q)

        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        actor_loss = None

        # 延迟更新Actor
        if self.total_it % self.policy_delay == 0:
            actor_loss = -self.critic1(states, self.actor(states)).mean()

            self.actor_optimizer.zero_grad()
            actor_loss.backward()
            self.actor_optimizer.step()

            # 软更新目标网络
            self._soft_update(self.actor_target, self.actor)
            self._soft_update(self.critic1_target, self.critic1)
            self._soft_update(self.critic2_target, self.critic2)

            actor_loss = actor_loss.item()

        return actor_loss, critic_loss.item()

5. SAC:软Actor-Critic

5.1 最大熵强化学习

SAC(Soft Actor-Critic)基于最大熵框架,目标是最大化期望回报+策略熵

J ( π ) = ∑ t E [ r t + α H ( π ( ⋅ ∣ s t ) ) ] J(\pi) = \sum_t \mathbb{E} \left[ r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right] J(π)=tE[rt+αH(π(st))]

其中熵项:

H ( π ) = − E a ∼ π [ log ⁡ π ( a ∣ s ) ] \mathcal{H}(\pi) = -\mathbb{E}_{a \sim \pi}[\log \pi(a|s)] H(π)=Eaπ[logπ(as)]

直觉:既要获得高回报,又要保持探索(策略不要太确定)。

5.2 软Bellman方程

软Q函数满足:

Q ∗ ( s , a ) = r + γ E s ′ [ V ∗ ( s ′ ) ] Q^*(s, a) = r + \gamma \mathbb{E}_{s'} \left[ V^*(s') \right] Q(s,a)=r+γEs[V(s)]

V ∗ ( s ) = E a ∼ π ∗ [ Q ∗ ( s , a ) − α log ⁡ π ∗ ( a ∣ s ) ] V^*(s) = \mathbb{E}_{a \sim \pi^*} \left[ Q^*(s, a) - \alpha \log \pi^*(a|s) \right] V(s)=Eaπ[Q(s,a)αlogπ(as)]

5.3 SAC的关键特性

特性 描述 优势
随机策略 输出动作分布 自然探索
熵正则化 鼓励策略多样性 避免局部最优
自动温度调节 自适应调整 α \alpha α 减少调参
双Q网络 继承TD3 抑制过估计

5.4 重参数化技巧

为了能够通过采样的动作反向传播,使用重参数化技巧

a = tanh ⁡ ( μ θ ( s ) + σ θ ( s ) ⊙ ϵ ) , ϵ ∼ N ( 0 , I ) a = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \epsilon), \quad \epsilon \sim \mathcal{N}(0, I) a=tanh(μθ(s)+σθ(s)ϵ),ϵN(0,I)

5.5 自动温度调节

温度参数 α \alpha α 通过优化以下目标自动调整:

J ( α ) = E a ∼ π [ − α log ⁡ π ( a ∣ s ) − α H ˉ ] J(\alpha) = \mathbb{E}_{a \sim \pi} \left[ -\alpha \log \pi(a|s) - \alpha \bar{\mathcal{H}} \right] J(α)=Eaπ[αlogπ(as)αHˉ]

其中 H ˉ \bar{\mathcal{H}} Hˉ 是目标熵(通常设为 − dim ⁡ ( A ) -\dim(\mathcal{A}) dim(A))。

5.6 SAC完整实现

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Normal
import numpy as np

class SACPolicy(nn.Module):
    """SAC随机策略网络"""
    def __init__(self, state_dim, action_dim, hidden_dim=256, max_action=1.0):
        super().__init__()
        self.max_action = max_action

        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.mean = nn.Linear(hidden_dim, action_dim)
        self.log_std = nn.Linear(hidden_dim, action_dim)

    def forward(self, state):
        features = self.net(state)
        mean = self.mean(features)
        log_std = self.log_std(features).clamp(-20, 2)
        return mean, log_std

    def sample(self, state):
        mean, log_std = self.forward(state)
        std = log_std.exp()
        dist = Normal(mean, std)

        # 重参数化采样
        x = dist.rsample()
        action = torch.tanh(x) * self.max_action

        # 计算log概率(考虑tanh变换的雅可比)
        log_prob = dist.log_prob(x) - torch.log(1 - action.pow(2) + 1e-6)
        log_prob = log_prob.sum(dim=-1, keepdim=True)

        return action, log_prob

    def get_action(self, state, deterministic=False):
        mean, log_std = self.forward(state)
        if deterministic:
            return torch.tanh(mean) * self.max_action
        else:
            action, _ = self.sample(state)
            return action

class SACCritic(nn.Module):
    """SAC双Q网络"""
    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.q1 = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        self.q2 = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )

    def forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        return self.q1(x), self.q2(x)

class SAC:
    """SAC算法"""
    def __init__(self, state_dim, action_dim, max_action=1.0,
                 lr=3e-4, gamma=0.99, tau=0.005, alpha=0.2, auto_alpha=True):
        self.gamma = gamma
        self.tau = tau
        self.max_action = max_action
        self.auto_alpha = auto_alpha

        # 策略网络
        self.policy = SACPolicy(state_dim, action_dim, max_action=max_action)
        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr)

        # Critic网络
        self.critic = SACCritic(state_dim, action_dim)
        self.critic_target = SACCritic(state_dim, action_dim)
        self.critic_target.load_state_dict(self.critic.state_dict())
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr)

        # 温度参数
        if auto_alpha:
            self.target_entropy = -action_dim
            self.log_alpha = torch.zeros(1, requires_grad=True)
            self.alpha_optimizer = optim.Adam([self.log_alpha], lr=lr)
            self.alpha = self.log_alpha.exp().item()
        else:
            self.alpha = alpha

        self.buffer = deque(maxlen=100000)

    def select_action(self, state, deterministic=False):
        state = torch.FloatTensor(state).unsqueeze(0)
        action = self.policy.get_action(state, deterministic)
        return action.squeeze(0).detach().numpy()

    def update(self, batch_size=256):
        if len(self.buffer) < batch_size:
            return {}

        # 采样
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)

        states = torch.FloatTensor(np.array(states))
        actions = torch.FloatTensor(np.array(actions))
        rewards = torch.FloatTensor(np.array(rewards)).unsqueeze(1)
        next_states = torch.FloatTensor(np.array(next_states))
        dones = torch.FloatTensor(np.array(dones)).unsqueeze(1)

        # 更新Critic
        with torch.no_grad():
            next_actions, next_log_probs = self.policy.sample(next_states)
            target_q1, target_q2 = self.critic_target(next_states, next_actions)
            target_q = torch.min(target_q1, target_q2) - self.alpha * next_log_probs
            target_q = rewards + self.gamma * target_q * (1 - dones)

        current_q1, current_q2 = self.critic(states, actions)
        critic_loss = nn.MSELoss()(current_q1, target_q) + \
                      nn.MSELoss()(current_q2, target_q)

        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # 更新Policy
        new_actions, log_probs = self.policy.sample(states)
        q1, q2 = self.critic(states, new_actions)
        q = torch.min(q1, q2)
        policy_loss = (self.alpha * log_probs - q).mean()

        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()

        # 更新温度参数
        alpha_loss = None
        if self.auto_alpha:
            alpha_loss = -(self.log_alpha * (log_probs + self.target_entropy).detach()).mean()

            self.alpha_optimizer.zero_grad()
            alpha_loss.backward()
            self.alpha_optimizer.step()

            self.alpha = self.log_alpha.exp().item()

        # 软更新目标网络
        for param, target_param in zip(self.critic.parameters(),
                                       self.critic_target.parameters()):
            target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

        return {
            'critic_loss': critic_loss.item(),
            'policy_loss': policy_loss.item(),
            'alpha': self.alpha,
            'entropy': -log_probs.mean().item()
        }

6. 算法对比与选择

6.1 性能对比

算法 样本效率 稳定性 超参数敏感度 探索能力
DDPG
TD3
SAC
PPO

6.2 选择指南

简单低维

复杂探索

多环境并行

选择算法

任务特性?

TD3

SAC

PPO

超参数调优能力?

可尝试DDPG

使用TD3

推荐首选

6.3 实践建议

默认选择SAC,原因:

  1. 自动温度调节减少调参
  2. 随机策略提供自然探索
  3. 对超参数不敏感
  4. 在大多数连续控制任务上表现优秀

选择TD3的情况

  • 确定性策略足够
  • 计算资源有限(SAC稍慢)
  • 已有DDPG代码基础

7. 控制应用案例

7.1 机器人运动控制

# 机械臂控制示例
def train_robot_arm():
    env = gym.make('FetchReach-v2')  # 机械臂到达任务
    state_dim = env.observation_space['observation'].shape[0]
    action_dim = env.action_space.shape[0]

    agent = SAC(state_dim, action_dim)

    for episode in range(1000):
        obs, _ = env.reset()
        state = obs['observation']
        episode_reward = 0

        for step in range(50):
            action = agent.select_action(state)
            obs, reward, terminated, truncated, info = env.step(action)

            next_state = obs['observation']
            done = terminated or truncated

            agent.buffer.append((state, action, reward, next_state, float(done)))
            agent.update()

            state = next_state
            episode_reward += reward

            if done:
                break

        print(f"Episode {episode}, Reward: {episode_reward:.2f}")

7.2 连续控制基准

环境 SAC分数 TD3分数 PPO分数
HalfCheetah-v4 ~12000 ~10000 ~8000
Hopper-v4 ~3500 ~3300 ~2500
Walker2d-v4 ~5500 ~4500 ~4000
Ant-v4 ~6000 ~5000 ~4500

8. 实践技巧

8.1 通用技巧

技巧 说明
奖励缩放 将奖励缩放到合理范围 [-10, 10]
状态归一化 使用RunningMeanStd归一化状态
梯度裁剪 clip_grad_norm_(params, 1.0)
学习率调度 后期降低学习率

8.2 SAC特有技巧

# 目标熵设置
target_entropy = -np.prod(action_dim)  # 默认-dim(A)

# 对于某些任务可能需要调整
target_entropy = -action_dim * 0.5  # 更少探索
target_entropy = -action_dim * 1.5  # 更多探索

8.3 调试清单

  1. ✅ 检查奖励尺度
  2. ✅ 监控Q值(不应发散)
  3. ✅ 监控策略熵(不应过快下降)
  4. ✅ 检查动作范围裁剪
  5. ✅ 验证环境状态/动作维度

9. 总结

本篇详细介绍了Actor-Critic架构的三大算法:

核心要点

  1. DDPG:DQN + 确定性策略梯度,开创性工作
  2. TD3:双Q + 延迟更新 + 目标平滑,解决DDPG问题
  3. SAC:最大熵框架 + 随机策略,当前最佳实践

关键公式

确定性策略梯度:
∇ θ J = E [ ∇ θ μ θ ( s ) ∇ a Q ( s , a ) ∣ a = μ θ ( s ) ] \nabla_\theta J = \mathbb{E} \left[ \nabla_\theta \mu_\theta(s) \nabla_a Q(s, a)|_{a=\mu_\theta(s)} \right] θJ=E[θμθ(s)aQ(s,a)a=μθ(s)]

SAC目标:
J ( π ) = ∑ t E [ r t + α H ( π ( ⋅ ∣ s t ) ) ] J(\pi) = \sum_t \mathbb{E} \left[ r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right] J(π)=tE[rt+αH(π(st))]

TD3目标值:
y = r + γ min ⁡ i = 1 , 2 Q ϕ i ′ ( s ′ , a ~ ′ ) y = r + \gamma \min_{i=1,2} Q_{\phi'_i}(s', \tilde{a}') y=r+γi=1,2minQϕi(s,a~)

推荐选择

  • 默认使用SAC
  • 简单任务可用TD3
  • DDPG主要用于理解基础

参考文献

  1. Lillicrap, T. P., et al. (2016). Continuous control with deep reinforcement learning. ICLR.
  2. Fujimoto, S., et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML.
  3. Haarnoja, T., et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning. ICML.
  4. Haarnoja, T., et al. (2019). Soft Actor-Critic Algorithms and Applications. arXiv.

下一篇预告深度学习驱动的控制方法详解(七):基于模型的深度学习控制

我们将学习如何结合神经网络动力学模型与模型预测控制(MPC)。


如果觉得本文有帮助,欢迎点赞收藏,关注本系列后续更新!

Logo

脑启社区是一个专注类脑智能领域的开发者社区。欢迎加入社区,共建类脑智能生态。社区为开发者提供了丰富的开源类脑工具软件、类脑算法模型及数据集、类脑知识库、类脑技术培训课程以及类脑应用案例等资源。

更多推荐