Welcome to rlpack's documentation!¶
本包简介¶
rlpack 是一个基于 tensorflow 的强化学习算法库,解耦算法和环境,方便调用。
特点:
- 轻量级:仅依赖TensorFlow和Numpy,
- 解耦环境和算法,使其方便调用,
- 提供多进程环境交互采样示例。
使用方法¶
下面展示如何使用 rlpack 在 MuJoCo 环境中运行 PPO 算法。
import argparse
import time
from collections import namedtuple
import gym
import numpy as np
import tensorflow as tf
from rlpack.algos import PPO
from rlpack.utils import mlp, mlp_gaussian_policy
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--env', type=str, default="Reacher-v2")
args = parser.parse_args()
Transition = namedtuple('Transition', ('state', 'action', 'reward', 'done', 'early_stop', 'next_state'))
class Memory(object):
def __init__(self):
self.memory = []
def push(self, *args):
self.memory.append(Transition(*args))
def sample(self):
return Transition(*zip(*self.memory))
def policy_fn(x, a):
return mlp_gaussian_policy(x, a, hidden_sizes=[64, 64], activation=tf.tanh)
def value_fn(x):
v = mlp(x, [64, 64, 1])
return tf.squeeze(v, axis=1)
def run_main():
env = gym.make(args.env)
dim_obs = env.observation_space.shape[0]
dim_act = env.action_space.shape[0]
max_ep_len = 1000
agent = PPO(dim_act=dim_act, dim_obs=dim_obs, policy_fn=policy_fn, value_fn=value_fn, save_path="./log/ppo")
start_time = time.time()
o, ep_ret, ep_len = env.reset(), 0, 0
for epoch in range(50):
memory, ep_ret_list, ep_len_list = Memory(), [], []
for t in range(1000):
a = agent.get_action(o[np.newaxis, :])[0]
nexto, r, d, _ = env.step(a)
ep_ret += r
ep_len += 1
memory.push(o, a, r, int(d), int(ep_len == max_ep_len or t == 1000-1), nexto)
o = nexto
terminal = d or (ep_len == max_ep_len)
if terminal or (t == 1000-1):
if not(terminal):
print('Warning: trajectory cut off by epoch at %d steps.' % ep_len)
if terminal:
# 当到达完结状态或是最长状态时,记录结果
ep_ret_list.append(ep_ret)
ep_len_list.append(ep_len)
o, ep_ret, ep_len = env.reset(), 0, 0
print(f"{epoch}th epoch. average_return={np.mean(ep_ret_list)}, average_len={np.mean(ep_len_list)}")
# 更新策略。
batch = memory.sample()
agent.update([np.array(x) for x in batch])
elapsed_time = time.time() - start_time
print("elapsed time:", elapsed_time)
if __name__ == "__main__":
run_main()
安装流程¶
Python3.6+ is required.
- 安装依赖包
安装所需依赖软件包,请看 environment.yml. 建议使用 Anaconda 配置 python 运行环境,可用以下脚本安装。
$ git clone https://github.com/liber145/rlpack
$ cd rlpack
$ conda env create -f environment.yml
$ conda activate py36
- 安装 rlpack
$ python setup.py install
以上流程会安装一个常用的强化学习运行环境 gym. 该环境还支持一些复杂的强化学习环境,比如 MuJoCo ,具体请看 gym 的介绍。
Benchmarks¶
Mujoco game¶
DDPG | TRPO | PPO | TD3 | |
Ant-v2 | 609.61 | 969.08 | 1769.52 | |
HalfCheetah-v2 | 667.06 | 2607.94 | 6108.17 | |
Hopper-v2 | 1460.93 | 2100.74 | 2515.44 | |
Humanoid-v2 | 339.35 | 458.59 | 278.14 | |
HumanoidStandup-v2 | 58715.82 | 81282.21 | 84551.70 (1M: 90523.85) | |
InvertedDoublePendulum-v2 | 8131.25 | 6606.13 | 8342.53 (1M: 8925.03) | |
InvertedPendulum-v2 | 900.08 | 943.31 | 940.33 (1M: 972.17) | |
Reacher-v2 | -13.96 | -10.08 | -10.34 | -9.94 |
Swimmer-v2 | 38.05 | 44.17 | 43.55 | |
Walker2d-v2 | 493.14 (1M: 1373.30) | 1138.25 | 1008.70 (1M: 3394.72) |
Performance on 500,000 sample steps.
DQN¶
DQN is an off-policy algorithm.
Quick Review¶
DQN lighted the fire of reinforcement learning. It introduces deep learning to Q-learning and proposes two key ideas to make the learning not divergent [1].
The optimization objective of DQN can be formated by:
The two key ingredients are in the above equation:
- \(U(D)\) means to uniform sample experienced transitions \((s, a, r, s')\) from an experience replay buffer \(D\). This alleviates the correlations in the observed sequence and smoothes over changes in the data distribution.
- \(Q(s', a'; \theta_i^-)\) is a target action value function, which helps reducing correlations with the target.
Reference¶
[1] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529.
A2C¶
Advantage Actor Critic is an off-policy algorithm.
Quick Review¶
First, let's look at REINFORCE algorithm, which is a Monte Carlo policy gradient aglrothm.
REINFORCE algorithm uses Monte Carlo method to estimate the expected \(q(s_t, a_t)\).
Note that \(\mathbb{E}_{(s_t, a_t) \sim \pi} b(s_t) \nabla_\theta \pi(a_t | s_t; \theta) = 0\), we have
The term \(b(s_t)\) called baseline is usually estimated by state value \(v(s_t)\). The residual term \(q(s_t, a_t) - v(s_t)\) is called advantage. In general, the baseline leaves the expected value of the update unchanged, but it can have a large effect on reducing its variance [1].
Now, let's go to advantage actor critic (A2C). Instead, A2C uses a state value approximate function to estimate \(v(s_t; \theta)\). The action value can be derived as \(q(s_t, a_t) = r_t + \gamma v(s_{t+1}; \theta)\). The critic part updates the value function from TD error. The actor part updates the policy function by policy gradient.
Reference¶
[1] Sutton, Richard S., and Andrew G. Barto. "Reinforcement Learning: An Introduction." (1998).
TRPO¶
TRPO是一种经典的强化学习算法,全称是Trust Region Policy Optimization,中文译为信赖域策略优化。 策略梯度算法更新策略时,如何选择合适步长从而确保累积奖励增加是一个关键问题。 TRPO通过限制新策略在旧策略的邻域中搜索,具有
- 累积奖励递增的理论分析,
- 不错的训练效果。
优化目标¶
策略 \(\pi\) 的累积奖励定义为 \(J(\pi) = \mathbb{E}_{s_0, a_0, ... \sim \pi} \sum_{t=0}^\infty \gamma^t r(s_t, a_t)\). Sham Kakade(2012)分析了两个策略 \(\tilde{\pi}\) 和 \(\pi\) 之间的累积奖励差值,
其中,\(A_\pi(s_t, a_t)\) 表示优势函数,\(A_\pi(s_t, a_t) = Q_\pi(s_t, a_t) - V_\pi(s_t)\).
因此,给定当前策略 \(\pi\) ,我们可以通过提升差值项来改进模型。 在实际计算过程中,动作分布 :math:`tilde{pi}(a|s)`可以通过重要性采样(importance sampling)解决,
但状态分布 \(\rho_{\tilde{\pi}}(s)\) 难以通过重要性采样解决,因为状态分布受决策序列影响,概率依赖很深。 TRPO使用旧策略对应的状态分布 \(\rho_{\pi}(s)\) 去近似该状态分布。 因此,优化目标转化为最大化下面的近似累积奖励差函数,
以上优化目标和普通Actor Critic的优化目标是相同的。可见,普通Actor Critic也有近似优化目标。 TRPO进一步添加了KL散度来约束策略更新,最终的优化目标为,
理论分析¶
优化近似的目标会有两个问题
- 不知道更新方向对不对,
- 不知道如何挑选合适的步长。
TRPO建立了以下的边界分析,
以上不等式打通了累积奖励增益 \(J(\tilde{\pi}) - J(\pi)\) 和近似目标 \(L_\pi(\tilde{\pi})\) 之间的关系。 由此,我们不需要担心上述两个问题,只需优化不等式右边的项。 注意,具体求解优化目标时,我们进一步近似了策略约束,将KL散度的最大化操作替换为平均操作。
计算过程¶
求解优化问题时,我们对目标进行一阶泰勒近似,得到
其中 :math`g` 表示 \(A_{\pi_{\theta_{old}}}(s,a) \pi_\theta(a|s) / \pi_{\theta_{old}}(a|s)\) 在 \(\theta = \theta_{old}\) 处导数的期望, \(K_0\) 表示和 \(\theta\) 无关的常数。 我们对策略约束使用二阶泰勒近似,可以得到
其中 \(H\) 表示在等式左边项在 \(\theta=\theta_{old}\) 处的二阶导数,\(K_1\) 表示和 \(\theta\) 无关的常数。 注意,上面等式的右边没有一阶项,这是因为左边项在 \(\theta = \theta_{old}\) 的一阶项为零。 在实现过程中,上述一阶导数和二阶导数期望的计算都是使用采样的数据近似计算得到的。
我们去掉与 \(\theta\) 无关的常数项之后,可以得到如下的优化问题,
上式可以转化成等价的最小最大问题,
接下来我们使用KKT条件求解上述问题。 根据 \(L(\theta, \lambda)\) 的稳定性,我们可以得到 \(\partial L/\partial \theta = 0\), 进而推导出 \(\theta = \theta_{old} + \lambda^{-1} H^{-1}g\). 然后我们将其带入到 \(\partial L/\partial \lambda = 0\) , 可以计算得到 \(\lambda = \sqrt{ (g^\top H^{-1} g)/(2\epsilon) }\). 从而可以计算得出问题的解 \(\theta = \theta_{old} + \sqrt{ 2\epsilon (g^\top H^{-1}g)^{-1} } H^{-1}g\).
参考文献¶
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
PPO¶
PPO全称是Proximal Policy Optimization,中文译为近端策略优化。 PPO简化了TRPO中复杂的计算流程,从而降低了计算复杂度以及实现难度。
优化目标¶
PPO简化了TRPO中的优化问题,将优化问题转化为,
沿用TRPO中的思路,将新策略约束在旧策略的邻域内:首先使用clip操作,约束新旧策略在动作概率上的比率,获得一个近似目标; 然后通过min操作,确保最终的优化目标是一个真实目标的下界。 最后,求解优化问题来抬高下界,从而达到改进目标的效果。
直观释义¶
优势函数的定义是 \(A_{\pi_{old}}(s,a) = Q_{\pi_{old}}(s,a) - V_{\pi_{old}}(s)\) , 表示采样动作相对于平均动作的优势值。
- 当 \(A > 0\) 时,表示此时优势值为正,即当前策略在该状态上正确执行,没必要在此样本上过度修正算法。 因此,min操作和clip操作组合使得如果比值超过 \(1+\epsilon\) ,最终为 \(1+\epsilon\) ,否则保持原值。 这样就限制了更新程度。
- 当 \(A < 0\) 时,表示此时优势值为负(\(-\max(r, clip(r, 1-\epsilon, 1+\epsilon))A\) ,\(r\) 表示比值),即当前策略在该状态上效果不好,有必要在此样本上修正算法。 因此,min操作和clip操作组合使得如果比值低于 \(1-\epsilon\),最后为 \(1-\epsilon\) ,否则保持原值。 这样就使得更新成都可以很大。
参考文献¶
[1] Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
DDPG¶
DDPG is an off-policy algorithm.
Quick Review¶
DDPG is the deep learning vergion of deterministic policy gradient (DPG) algorithm [2]. DPG consider policy gradient algorithm in the context of deterministic policy.
Simliar to policy gradient theorem, [2] gives a deterministic policy gradient theorem,
The action value udpate is to minimize TD error between target value and current value as usual.
Implementation¶
The policy update can be rewritten to \(\nabla_\theta Q(s, \mu_\theta(s))\). We can write the policy loss as \(\mathbb{E}_s [-Q(s, \mu_\theta(s))]\), then pick an optimizer to do gradient descent on policy loss and value loss iteratively.
Given a state, straightforward action inference by old policy makes no exploration. [1] introduces an Ornstein-Uhlenbeck process to generate temporally correlated exploration.
Reference¶
[1] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
[2] Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014.