In previous posts (here and here), I have been covering policy slope-based reinforcement learning methods. In this post, I will go on the series past roofing another pseudo-policy gradient based method called Proximal Policy Optimization (PPO). The PPO technique was designed to make some improvements on the Trust Region Policy Optimization (TRPO) algorithm, which in turn was designed to better the Reward Thespian Critic (A2c) method. In this post, I'll explicate the theory surrounding the development of Proximal Policy Optimization and provide a worked coding instance using PPO to train an amanuensis to play the Cartpole environment from the Open AI Gym.

All code contained in this postal service can exist found on this site's Github repository here.

PPO theory

A policy gradient and A2C epitomize

A policy gradient-based method of reinforcement learning selection agent actions based on the output of a neural network, with each output corresponding to the probability that a certain action should be taken. This probability distribution is sampled from to produce actions during training.

The A2c algorithm within the family of policy gradient-based reinforcement learning methods has a policy gradient term equal to:

$$\nabla_\theta J(\theta) \sim \left(\sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)\right)A(s_t, a_t)$$

It should be noted that the to a higher place term is for gradient ascent, not descent i.east. this term should be maximized. Think that the advantage is expressed as:

$$A(s_t, a_t) = Q(s_t, a_t) – V(s_t)$$

Which is a kind of "relative" or normalized benefit of taking action $a_t$ from state $s_t$, with the normalization arising from the overall value of existence in that land $V(s_t)$ (for further discussion on the advantage and value estimates, meet this post). This loss weights the log probability of taking the action with the advantage. The probability of taking the action is generated from the policy output of the neural network, with a softmax activation practical. So if an action is taken with high probability, allow'southward say 0.8, and this results in a high advantage – then this policy output will be reinforced. If the advantage is depression, alternatively, then this loftier probability policy output will be discouraged.

Via Bellman'southward equation, the advantage can really be estimated purely from the value estimate:

$$A(s_t, a_t) = r_{t+i} + \gamma V(s_{t+i}) – V(s_t)$$

This enables united states to create an advantaged weighted policy slope method by likewise estimating the value of whatsoever given state $V(s_t)$. This requires a two-output neural network, one for estimating action probabilities (actor part) and the other for land values (critic office), with architecture as shown below:

A2C architecture

A2C architecture

The architecture shown below is for a network whose state inputs are game pixels. For other state types, such as the state variables for the Cartpole environs, the input layers are usually densely connected. The loss part for the value output in the A2C architecture is usually only the hateful-squared error between the predicted values and the actual values calculated from the discounted future rewards.

What'southward the disadvantage with the A2C method?

There are a couple of disadvantages of the A2C method. These can be summarised equally follows:

  1. Large gradient steps tin can motility the network weights into sub-optimal areas and wreck (or at least slow) the training progress
  2. At that place is poor sample efficiency – the batch samples can only inform the network training one time

These disadvantages were addressed in the Trust Region Policy Optimization (TRPO) method, which ultimately leads to the PPO method, and will be discussed in the following section in that context.

Trust Region Policy Optimization

In that location are some very interesting ideas in TRPO, withal, in this post, I volition only exist covering them briefly without too much theory.

Improving sampling efficiency

The starting time problem to address is the poor sample efficiency of A2C (and other Policy Gradient methods). A ready of training samples are nerveless past playing the game, which in reinforcement learning can be a relatively lengthy process, and they are just used to train the networkonce. This is non bully from a computational perspective, especially given that reinforcement learning may require hundreds of thousands or millions of training steps to achieve good results. In policy gradient-based methods, ideally nosotros would like to use samples generated via the rolling out of a trajectory using an older version of the policy, to train a new policy.

But how tin can nosotros exercise this in a mathematically valid manner? Information technology turns out nosotros can use the concept of Importance Sampling (IS). Importance sampling allows united states to compute the expected value of function $f(x)$, where $ten$ is sampled from a distribution $p(x)$, by instead sampling $x$ from another distribution $q(x)$. That's probably a bit confusing, but this is what it looks like mathematically:

$$\mathbb{E}_{x \sim p(x)} \left[f(10)\right] = \mathbb{E}_{10 \sim q(10)} \left[\frac{p(x)}{q(x)} f(ten) \right]$$

Where $x \sim p(x)$ in the expectation subscript refers to $10$ existence sampled from $p(ten)$ and the same goes for $x \sim q(x)$. We tin use this formula to the case where we might want to employ actions, states and reward values sampled during the coil-out of an onetime policy $\pi_{\theta old}$ to train the current policy. Commencement recall that the policy gradient part of the loss for A2C, with the expectation still shown (before information technology is cashed out into an empirical sum sampled from a trajectory), looks like:

$$\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_{\theta} log P_{\pi_{\theta}}(a_t|s_t) A(s_t, a_t)\right]$$

Now, using importance sampling, we can supersede this with:

$$\nabla_\theta J(\theta) = \mathbb{E} \left[\frac{\nabla_{\theta} P_{\pi_{\theta}}(a_t|s_t)}{P_{\pi_{\theta old}}(a_t|s_t)} A_{\theta old}(s_t, a_t)\right]$$

You may exist wondering where the $log$ went in the formula in a higher place, however it turns out that the derivative of both of these expressions are equivalent for pocket-size step sizes, as discussed in the original TRPO newspaper.

This utilize of importance sampling now allows the aforementioned batch of samples from a trajectory rollout to be used many times to train the current policy, making the policy gradient method much more sample efficient.

Improving step stability

Every bit mentioned to a higher place, one of the issues of the A2C method of reinforcement learning is that it is unstable. In other words, the slope steps during training tin can derail the agent's learning process. Consider the image beneath of the Hilary Stride on Mount Everest:

Hilary Step – Mt Everest Past Debasish biswas kolkata – Own work, CC BY-SA four.0, https://eatables.wikimedia.org/west/index.php?curid=46637445

Imagine the training procedure of the agent is to ascend the very narrow path to the summit, which represents optimal agent behavior. Notwithstanding, every bit can be observed, the path is very narrow, and a pace too large in any direction will send the amanuensis "over the cliff" perchance permanently damaging the training process for the agent. This is maybe an extreme example, though possibly non – reinforcement learning optimization landscapes are infamously precipitous and "treacherous". In whatever case, this fact restricts both the speed of the preparation and the learning rates that are possible during training. If the learning rates or steps are too high, then there will be a greater risk of derailing the training process.

This is particularly a problem if we are hoping to better sample efficiency and run multiple training steps over the aforementioned set of trajectory samples. This is where the "trust region" part of TRPO comes in. Trust region optimization involves creating a small area around the current betoken which specifies the maximum radius of a stride that can exist taken in the next slope descent/rise iteration. Consider the ii diagrams below, from this good overview of trust-region methods:

First trust region example

First trust region example

Second trust region example

Second trust region instance

The dark circles around the points in the diagrams in a higher place are the trust regions, which define the boundaries of the maximum adjacent step in whatever gradient method is being used. The trust region is either reduced or expanded depending on how well the last step performed – if information technology resulted in a significant comeback, the trust region size is expanded, if information technology resulted in a degradation in performance, the trust region size is reduced. Nonetheless, the trust region size can likewise be a uncomplicated fixed hyper-parameter, as is the case in TRPO as volition be shown beneath.

The optimization problem solved in TRPO is approximately every bit follows:

$$J(\theta) = \mathbb{E} \left[\frac{P_{\pi_{\theta}}(a_t|s_t)}{P_{\pi_{\theta old}}(a_t|s_t)} A_{\theta old}(s_t, a_t)\correct]$$

With an optimization constraint of:

$$D_{KL}(\theta | \theta_{old}) \leq \delta $$

Hither $D_{KL}$ is the Kullback–Leibler divergence (KL deviation). The KL divergence is a way of measuring the distance betwixt 2 probability distributions, so this constraint ensures that the new parameters of the policy aren't too different from the sometime parameters. In other words, this constraint enforces a trust region on the parameter / policy updates.

In the original paper, the conjugate slope optimization method is used to solve this optimization problem with the aforementioned $D_{KL}(\theta | \theta_{old})$ based constraint. However, compared to the standard optimization methods ordinarily used in deep learning, the conjugate gradient is irksome. Therefore, an improvement of TRPO that removes the need for strict constraints and the apply of conjugate gradient optimization, while still keeping both the trust region concept and enabling the improvement of sample efficiency, would clearly be desirable. Proximal Policy Optimization (PPO) is one mode of achieving these goals.

Proximal Policy Optimization (PPO)

Every bit shown above, the importance sampling version of the policy loss part, which enables amend sample efficiency, is expressed equally follows in the original PPO paper:

$$L^{CPI}(\theta) = \mathbb{E}_t \left[\frac{\pi_\theta (a_t | s_t)}{\pi_{\theta old}(a_t | s_t)}A_t\right] = \mathbb{E}_t\left[r_t(\theta)A_t\right]$$

Here the superscipt CPIstands for "conservative policy iteration", and the $\pi_\theta (a_t | s_t)$ notation expresses the same probability equally shown previously (i.east. $P_{\pi_{\theta}}(a_t|s_t)$). This loss function, while allowing greater sample efficiency by using importance sampling, as discussed above, will still be subject field to big gradient updates derailing training progress. In order to introduce a quasi-trust region limitation, while not introducing conjugate slope-based optimization or strict constraints, the authors of the PPO paper introduce the following culling loss function:

$$L^{CLIP}(\theta) = \mathbb{E}_t\left[min(r_t(\theta))A_t, clip(r_t(\theta), one – \epsilon, 1 + \epsilon)A_t)\correct]$$

Where $\epsilon$ is a hyperparameter recommended to be between 0.one to 0.3. This Prune loss role is essentially the same as the CPI loss role, apart from when the probability of an activeness under the new parameters ($\pi_\theta (a_t | s_t)$) is greater than the former parameters by a sure degree (i.due east. r > 1 + $\epsilon$), subsequently which the loss value is clipped. This ensures that no large steps can occur in the slope updates, which is essentially enforcing a trust region. The plots below for both positive and negative advantage values illustrate the behavior of $L^{CLIP}$ with respect to $r$:

PPO clipped loss function

PPO clipped loss function (from https://arxiv.org/pdf/1707.06347.pdf)

The full loss function for the PPO method consists of $Fifty^{CLIP}$, the mean-square error loss of the value estimator, and a term to encourage higher entropy / greater exploration (with the latter two terms identical to the components in A2C). It can be expressed as follows, with signs introduced to convert a maximization function to a minimization office:

$$L_{TOTAL} = -Fifty^{CLIP} + L^{VALUE} * k_1 – L^{ENTROPY} * k_2$$

Where the $k$ values are hyperparameters that weigh the various components of the loss against $L^{CLIP}$ (with $k_1$ usually on the society of 0.v, and $k_2$ on the order of 0.01). In the following section, I volition nowadays a code work-through of PPO being utilized to train a CartPole agent.

PPO code walk-through

Every bit stated higher up, all code in this walk-through can exist found on this site'due south Github repository. The RL surroundings that will exist used is Open up AI's Cartpole environment. In this example, I'll be making utilise of TensorFlow's GradientTape functionality and the Keras model API.

The code below shows the establishment of some constants and the cosmos of the neural network model structure:

import tensorflow as tf from tensorflow import keras import tensorflow_probability as tfp import numpy every bit np import gym import datetime equally dt  STORE_PATH = 'C:\\Users\\andre\\TensorBoard\\PPOCartpole' CRITIC_LOSS_WEIGHT = 0.five ENTROPY_LOSS_WEIGHT = 0.01 ENT_DISCOUNT_RATE = 0.995 BATCH_SIZE = 64 GAMMA = 0.99 CLIP_VALUE = 0.ii LR = 0.001 NUM_TRAIN_EPOCHS = 10 env = gym.brand("CartPole-v0") state_size = 4 num_actions = env.action_space.n ent_discount_val = ENTROPY_LOSS_WEIGHT  form Model(keras.Model):     def __init__(self, num_actions):         super().__init__()         self.num_actions = num_actions         self.dense1 = keras.layers.Dense(64, activation='relu',                                          kernel_initializer=keras.initializers.he_normal())         self.dense2 = keras.layers.Dense(64, activation='relu',                                          kernel_initializer=keras.initializers.he_normal())         self.value = keras.layers.Dense(1)         self.policy_logits = keras.layers.Dumbo(num_actions)     def phone call(self, inputs):         x = self.dense1(inputs)         x = self.dense2(x)         return self.value(ten), cocky.policy_logits(x)     def action_value(self, country):         value, logits = self.predict_on_batch(country)         dist = tfp.distributions.Categorical(logits=logits)         action = dist.sample()         return activity, value

Virtually of the constants alleged in the outset part of the code above are familiar from the A2C tutorial – the critic and entropy loss weight values, gamma value (reward discounting factor), so on. Withal, a few new values need to be discussed. First, I take included an entropy discounting rate (ENT_DISCOUNT_RATE) – after each episode, the entropy loss weight value will be discounted by this value to discourage exploration as the number of episodes played piles up. The abiding CLIP_VALUE refers to the value $\epsilon$ in the $L^{Prune}$ loss function ($\mathbb{E}_t\left[min(r_t(\theta))A_t, clip(r_t(\theta), ane – \epsilon, ane + \epsilon)A_t)\correct]$). The value NUM_TRAIN_EPOCHS refers to the number of times a given trajectory of rewards, deportment, states, etc. volition be used to train the network, making use of the importance sampling configuration of the loss role.

The Model class is a adequately standard Keras model API implementation, with two mutual dumbo layers and a split up value/critic output and a policy logits output. For more details on this model implementation, see the A2C tutorial.

The loss functions

Next, we'll review the custom loss functions:

def critic_loss(discounted_rewards, value_est):     return tf.cast(tf.reduce_mean(keras.losses.mean_squared_error(discounted_rewards, value_est)) * CRITIC_LOSS_WEIGHT,                    tf.float32)

First the critic loss – this loss function calculates the mean squared error between the discounted rewards (extracted past discounting the trajectory of rewards collated by running the agent in the environs) and the predicted values (value_est) from the neural network from the trajectory of states.

Next, we look at the entropy loss:

def entropy_loss(policy_logits, ent_discount_val):     probs = tf.nn.softmax(policy_logits)     entropy_loss = -tf.reduce_mean(keras.losses.categorical_crossentropy(probs, probs))     render entropy_loss * ent_discount_val

In this loss function, the policy logits directly from the neural network (and based on the trajectory of states sampled from the environment) and the discounted entropy weight (ent_discount_val) are passed as arguments. The policy logits are then converted to quasi-probabilities by applying the softmax part, then the entropy calculation is performed past applying the Keras categorical cross-entropy function onprobs twice (again, for an caption of how this works, run into here). Notice the minus sign to invert the entropy adding – in a minimization optimization calculation, this will act to heighten the entropy / exploration of the agent. The discounted entropy weight value is then applied to the entropy and the production is returned.

def actor_loss(advantages, old_probs, action_inds, policy_logits):     probs = tf.nn.softmax(policy_logits)     new_probs = tf.gather_nd(probs, action_inds)     ratio = new_probs / old_probs     policy_loss = -tf.reduce_mean(tf.math.minimum(         ratio * advantages,         tf.clip_by_value(ratio, one.0 - CLIP_VALUE, 1.0 + CLIP_VALUE) * advantages     ))     return policy_loss

The actor loss takes in the advantages (calculated in another function, which volition exist explained as follows), the former probability values (over again, calculated elsewhere), the action indices, and the policy logits from the neural network model. The activeness indices are the array indices of the deportment actually taken during the trajectory roll-out. In other words, if, for a sure state in the game, the action probabilities are [0.01, 0.3, 0.5, 0.2] and the 0.5 activity is taken, the activity index, in this case, is 2 (with a zero-based alphabetize system). Theaction_inds is a tensor indices which represent to the action taken for every tape in the batch.

The offset line of the function is where the policy logit values are converted to quasi-probabilities using the softmax function. These new policy logits are generated by using the most recent neural network parameters – in other words, after the softmax has been applied, they refer to the $\pi_\theta (a_t | s_t)$ values in the $L^{Clip}$ loss function. Theold_probs values refer to the $\pi_{\theta old}(a_t | s_t)$ values. We are only later the activity probabilities that correspond to the actions actually taken during the amanuensis'south playing of the game. The tf.gather_nd function, therefore, takes theprobs and extracts those probability outputs which stand for to the actions that were actually taken, specified by theaction_inds. Thisaction_inds based selection has already been performed on the probabilities from the neural network parameterized by the former network weight values $\theta_{old}$, as will be shown after.

Then ther value (i.e. the $\frac{\pi_\theta (a_t | s_t)}{\pi_{\theta onetime}(a_t | s_t)}$ value) is computed. Later on this, the clipping role is applied as per the $L^{Prune}$ calculation, shown over again below for reference:

$$Fifty^{Clip}(\theta) = \mathbb{E}_t\left[min(r_t(\theta))A_t, clip(r_t(\theta), 1 – \epsilon, ane + \epsilon)A_t)\right]$$

Discover the negative sign in the adding ofpolicy_loss – this ensures that $L^{CLIP}$ is maximized during the minimization optimization that will exist performed in the TensorFlow environment.

Next, I introduce thetrain_model role which calls the diverse loss function and makes the gradient pace:

def train_model(action_inds, old_probs, states, advantages, discounted_rewards, optimizer, ent_discount_val):     with tf.GradientTape() as record:         values, policy_logits = model.call(tf.stack(states))         act_loss = actor_loss(advantages, old_probs, action_inds, policy_logits)         ent_loss = entropy_loss(policy_logits, ent_discount_val)         c_loss = critic_loss(discounted_rewards, values)         tot_loss = act_loss + ent_loss + c_loss     grads = tape.gradient(tot_loss, model.trainable_variables)     optimizer.apply_gradients(naught(grads, model.trainable_variables))     return tot_loss, c_loss, act_loss, ent_loss

This function takes advantage of the tf.GradientTape() context manager – for more details, see this mail service. Within this context, the model is chosen on the listing of states collected during the unrolling of a trajectory of an agent through the game. Thestates variable will be a list of input states, of length equal to BATCH_SIZE. The model returns value and policy logit estimates. These values are then passed through to the various loss functions and summated. The gradients of the trainable variables with respect to this total loss value are then calculated, and an optimization step is undertaken using theapply_gradients method.

The next part calculates the discounted rewards and the advantages:

def get_advantages(rewards, dones, values, next_value):     discounted_rewards = np.array(rewards + [next_value[0]])     for t in reversed(range(len(rewards))):         discounted_rewards[t] = rewards[t] + GAMMA * discounted_rewards[t+ane] * (1-dones[t])     discounted_rewards = discounted_rewards[:-i]     # advantages are bootstrapped discounted rewards - values, using Bellman'south equation     advantages = discounted_rewards - np.stack(values)[:, 0]     # standardise advantages     advantages -= np.hateful(advantages)     advantages /= (np.std(advantages) + 1e-ten)     # standardise rewards too     discounted_rewards -= np.mean(discounted_rewards)     discounted_rewards /= (np.std(discounted_rewards) + 1e-eight)     render discounted_rewards, advantages

This function has been discussed in previous tutorials, for instance, this post. In this case, both the advantages (used in the player loss function) and the discounted rewards (used in the critic loss function) are normalized to improve preparation stability.

model = Model(num_actions) optimizer = keras.optimizers.Adam(learning_rate=LR) train_writer = tf.summary.create_file_writer(STORE_PATH + f"/PPO-CartPole_{dt.datetime.at present().strftime('%d%k%Y%H%Grand')}") num_steps = 10000000 episode_reward_sum = 0 country = env.reset() episode = 1 total_loss = None for step in range(num_steps):     rewards = []     actions = []     values = []     states = []     dones = []     probs = []     for _ in range(BATCH_SIZE):         _, policy_logits = model(state.reshape(1, -ane))         action, value = model.action_value(state.reshape(i, -i))         new_state, reward, done, _ = env.pace(activeness.numpy()[0])         actions.suspend(action)         values.append(value[0])         states.suspend(state)         dones.suspend(washed)         probs.suspend(policy_logits)         episode_reward_sum += reward         state = new_state         if done:             rewards.append(0.0)             state = env.reset()             if total_loss is not None:                 print(f"Episode: {episode}, latest episode reward: {episode_reward_sum}, "                       f"total loss: {np.hateful(total_loss)}, critic loss: {np.hateful(c_loss)}, "                       f"actor loss: {np.mean(act_loss)}, entropy loss {np.mean(ent_loss)}")             with train_writer.as_default():                 tf.summary.scalar('rewards', episode_reward_sum, episode)             episode_reward_sum = 0             episode += ane         else:             rewards.append(reward)     _, next_value = model.action_value(state.reshape(1, -ane))     discounted_rewards, advantages = get_advantages(rewards, dones, values, next_value[0])     actions = tf.squeeze(tf.stack(actions))     probs = tf.nn.softmax(tf.squeeze(tf.stack(probs)))     action_inds = tf.stack([tf.range(0, actions.shape[0]), tf.cast(deportment, tf.int32)], axis=1)     total_loss = np.zeros((NUM_TRAIN_EPOCHS))     act_loss = np.zeros((NUM_TRAIN_EPOCHS))     c_loss = np.zeros(((NUM_TRAIN_EPOCHS)))     ent_loss = np.zeros((NUM_TRAIN_EPOCHS))     for epoch in range(NUM_TRAIN_EPOCHS):         loss_tuple = train_model(action_inds, tf.gather_nd(probs, action_inds),                                  states, advantages, discounted_rewards, optimizer,                                  ent_discount_val)         total_loss[epoch] = loss_tuple[0]         c_loss[epoch] = loss_tuple[1]         act_loss[epoch] = loss_tuple[two]         ent_loss[epoch] = loss_tuple[iii]     ent_discount_val *= ENT_DISCOUNT_RATE     with train_writer.as_default():         tf.summary.scalar('tot_loss', np.mean(total_loss), step)         tf.summary.scalar('critic_loss', np.mean(c_loss), footstep)         tf.summary.scalar('actor_loss', np.hateful(act_loss), step)         tf.summary.scalar('entropy_loss', np.mean(ent_loss), step)

The majority of this trajectory roll-out / grooming loop has been covered in this post. The main difference is the loop over NUM_TRAIN_EPOCHS:

for epoch in range(NUM_TRAIN_EPOCHS):     loss_tuple = train_model(action_inds, tf.gather_nd(probs, action_inds),                              states, advantages, discounted_rewards, optimizer,                              ent_discount_val)     total_loss[epoch] = loss_tuple[0]     c_loss[epoch] = loss_tuple[1]     act_loss[epoch] = loss_tuple[two]     ent_loss[epoch] = loss_tuple[3]

In this section of the lawmaking, thetrain_model office is run NUM_TRAIN_EPOCHS times. This can be performed in PPO using the importance sampling functionality i.eastward.:

$$Fifty^{CPI}(\theta) = \mathbb{Due east}_t \left[\frac{\pi_\theta (a_t | s_t)}{\pi_{\theta old}(a_t | s_t)}A_t\right] = \mathbb{East}_t\left[r_t(\theta)A_t\right]$$

In this case,probs are the probabilities extracted from the model over the rolling out of the trajectory by the agent. In other words, they are the probabilities generated from the neural network with the parameters of the network based on the previous round of grooming. Therefore, these are theoldprobabilities or $\pi_{\theta former}(a_t | s_t)$ in the formula above. Thenewprobabilities $\pi_{\theta}(a_t | s_t)$ are generated inside the gradient tape context oftrain_model– a new prepare of parameters for each preparation loop. Notwithstanding, through each of these iterations, the old probabilities, orprobs,stay constant.

Running this code on the CartPole environment yields the following rewards during training:

PPO Cartpole rewards

PPO Cartpole rewards

As can be observed, the amanuensis manages to score the maximum Cartpole reward of 200 quite early on on in the training process, however, information technology takes some fourth dimension to achieve this event consistently. Farther fine-tuning of the entropy reduction factor, or other hyperparameters, would likely yield a more consequent high score sooner.

This concludes my introductory post on the Proximal Policy Optimization (PPO) method and its implementation in TensorFlow 2. I hope it was informative for y'all.