Yet Another Tutorial on PPO

Table of Contents

Introduction to PPO

Proximal Policy Optimization (PPO) is easiest to understand by starting from its surrogate objective. Let \(\pi_{\theta_{\mathrm{old}}}\) be the policy that collected the trajectories and let \(\pi_{\theta}\) be the policy we want to optimize. The non-clipping version of the PPO objective (proposed in TRPO) is

\[ L^{\mathrm{PPO}}(\theta)= \mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}} \left[r_{t}(\theta)\hat{A}_{t}\right], \qquad r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}\mid s_{t})}{\pi_{\theta_{\mathrm{old}}}(a_{t}\mid s_{t})}, \]

where \(\hat{A}_t\) is an estimator of the advantage function \(A_t\). The advantage of taking action \(a_t\) at state \(s_t\) under policy \(\pi_{\theta}\) is defined as

\[ A_t({\theta}) := A^{\pi_{\theta}}(s_t,a_t) =Q^{\pi_{\theta}}(s_t,a_t)-V^{\pi_{\theta}}(s_t), \]

where \(Q^{\pi_{\theta}}(s_t,a_t)\) is the expected return after taking \(a_t\) and following policy \(\pi_{\theta}\) afterwards, and \(V^{\pi_{\theta}}(s_t)\) is the expected return under the policy's default action distribution. \(\hat{A}_t\) is typically estimated with Generalized Advantage Estimation (GAE), together with a value network that predicts \(\hat{V}^{\pi_{\theta}}(s_t)\); we will discuss this later. Intuitively, \(\hat{A}_t>0\) means the sampled action was better than what the old policy usually does at that state, while \(\hat{A}_t<0\) means it was worse. The policy-gradient update is still valid if you replace \(\hat{A}_t\) with the reward of the current trajectory, but the variance will be much larger.

The old policy \(\pi_{\theta_{\mathrm{old}}}\) is a frozen snapshot of the policy used to collect the current batch of trajectories. This matters in practical RL systems because the simulator loop and the optimization loop are not necessarily synchronized: the optimizer can take several gradient steps using old trajectories while new batches are still rolling out to improve training efficiency.

\(L^{\mathrm{PPO}}\) follows naturally from importance sampling. For each state \(s_t\), we want to maximize the advantage function:

\[ \begin{align} \mathbb{E}_{s_t,a_t \sim \pi_{\theta}}\left[\hat{A}_t\right] &= \sum_{a_t}\pi_{\theta}(a_t \mid s_t)\hat{A}_t \\ &= \sum_{a_t}\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t) \frac{\pi_{\theta}(a_t \mid s_t)} {\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)}\hat{A}_t \\ &= \mathbb{E}_{a_t \sim \pi_{\theta_{\mathrm{old}}}} \left[r_t(\theta)\hat{A}_t\right]. \end{align} \]

Relationship to Policy Gradient

The above formulation looks a bit different from Policy Gradient, which is often the first RL algorithm people learn.

To make things clear, I will use three symbols:

  1. \(L(\theta)\) is the true policy-gradient objective, i.e., the expected discounted return. This is the quantity we actually care about, but it is not directly differentiable with backpropagation because the actions are sampled from the policy and the environment transition is outside the computation graph.
  2. \(L^{\mathrm{PG}}(\theta)\) is the regular policy-gradient surrogate. It uses \(\log \pi_{\theta}(a_t \mid s_t)\) and gives the policy-gradient update direction.
  3. \(L^{\mathrm{PPO}}(\theta)\) is the non-clipping version of the PPO objective. It uses the probability ratio \(r_t(\theta)\).

The true objective is

\[ L(\theta)=\mathbb{E}_{s_t,a_t \sim \pi_{\theta}}\left[G_t\right], \qquad G_t=\sum_{k=t}^{T}\gamma^{k-t}r(s_k,a_k). \]

Using the log-derivative trick, the classic policy-gradient result is

\[ \nabla_{\theta}L(\theta)=\mathbb{E}_{s_t,a_t \sim \pi_{\theta}} \left[\nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t\right]. \]

Derivation of Policy Gradient (Click to Unfold)

Let \(\tau=(s_0,a_0,s_1,a_1,\dots)\) be a trajectory and let \(R(\tau)\) be its return. The true objective can be written as a sum over all trajectories:

\[ L(\theta)=\sum_{\tau}P_{\theta}(\tau)R(\tau). \]

The useful identity is

\[\nabla_{\theta}f(\theta)=f(\theta)\nabla_{\theta}\log f(\theta).\]

Therefore

\[ \begin{align} \nabla_{\theta}L(\theta) &= \sum_{\tau}\nabla_{\theta}P_{\theta}(\tau)R(\tau) \\ &= \sum_{\tau}P_{\theta}(\tau)\nabla_{\theta}\log P_{\theta}(\tau)R(\tau) \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}} \left[\nabla_{\theta}\log P_{\theta}(\tau)R(\tau)\right]. \end{align} \]

The trajectory probability is

\[ P_{\theta}(\tau)=\rho(s_0)\prod_{t=0}^{T-1}\pi_{\theta}(a_t \mid s_t)p(s_{t+1}\mid s_t,a_t). \]

Since the initial-state distribution and environment dynamics do not depend on \(\theta\), only the policy terms remain in the gradient:

\[ \nabla_{\theta}\log P_{\theta}(\tau) =\sum_{t=0}^{T-1}\nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t). \]

Substituting this back gives the usual policy-gradient estimator. Replacing the full trajectory \(R(\tau)\) return with reward-to-go and subtracting a state baseline gives the lower-variance advantage form:

\[ \nabla_{\theta}L(\theta)= \mathbb{E}_{s_t,a_t \sim \pi_{\theta}} \left[\nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t\right]. \]

To implement policy gradient in PyTorch, we often use a surrogate objective to compute this gradient:

\[ L^{\mathrm{PG}}(\theta)= \mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}} \left[\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t\right]. \]

Taking the gradient of this surrogate gives the score-function estimator:

\[ \nabla_{\theta}L^{\mathrm{PG}}(\theta)= \mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}} \left[\nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t\right]. \]

At first glance, \(L^{\mathrm{PG}}\) and the non-clipping PPO objective \(L^{\mathrm{PPO}}\) look different:

\[ L^{\mathrm{PPO}}(\theta)= \mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}} \left[ \frac{\pi_{\theta}(a_t \mid s_t)} {\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)} \hat{A}_t \right]. \]

However, their gradients are the same at \(\theta=\theta_{\mathrm{old}}\):

\[ \begin{align} \nabla_{\theta}L^{\mathrm{PPO}}(\theta)\big|_{\theta=\theta_{\mathrm{old}}} &= \mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}} \left[ \frac{\nabla_{\theta}\pi_{\theta}(a_t \mid s_t)} {\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)} \hat{A}_t \right]_{\theta=\theta_{\mathrm{old}}} \\ &\stackrel{1}{=} \mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}} \left[ \underbrace{\frac{\pi_{\theta}(a_t \mid s_t)} {\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)}}_{=1} \nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t \right]_{\theta=\theta_{\mathrm{old}}} \\ &= \mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}} \left[ \nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t \right]_{\theta=\theta_{\mathrm{old}}} \\ &= \nabla_{\theta}L^{\mathrm{PG}}(\theta)\big|_{\theta=\theta_{\mathrm{old}}}, \end{align} \]

where \(\stackrel{1}{=}\) uses the same identity \(\nabla_{\theta}f(\theta)=f(\theta)\nabla_{\theta}\log f(\theta)\) as the policy-gradient derivation.

PPO Importance Sampling Clipping

The major contribution of the original PPO paper is replacing

\[ L^{\mathrm{PPO}}(\theta)=\mathbb{E}_{t}\left[r_{t}(\theta)\hat{A}_{t}\right] \]

with

\[ L^{\mathrm{CLIP}}(\theta)=\mathbb{E}_{t}\left[ \min\left( r_t(\theta)\hat{A}_t,\; \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t \right)\right]. \]

The goal of clipping here is to make sure \(\pi_\theta\) does not deviate too much from \(\pi_{\theta_{\mathrm{old}}}\). Consider the case \(\hat{A}_t>0\). If \(r_t(\theta) > 1+\epsilon\), the new policy already assigns this action a much higher probability than \(\pi_{\theta_{\mathrm{old}}}\) did, so PPO clips the ratio to prevent the probability from deviating further. If \(r_t(\theta) < 1-\epsilon\), the new policy assigns this action a lower probability than the old policy did, so PPO does not clip the ratio. The gradient will make the action more likely and move the probability back toward \(\pi_{\theta_{\mathrm{old}}}\). In other words, clipping is only needed when the update is moving away from \(\pi_{\theta_{\mathrm{old}}}\). The case for \(\hat{A}_t<0\) can be treated similarly.

Is Clipping Actually Working?

Clipping \(r_t(\theta)\) is the signature contribution of PPO, but it is also somewhat unusual: if the math says \(r_t(\theta)\) should be large, there may be a good reason. Why does PPO emphasize clipping the importance-sampling factor \(r_t(\theta)\) when deep learning already has similar general tools such as gradient clipping?

Indeed, empirical studies such as "Implementation Matters in Deep RL: A Case Study on PPO and TRPO" suggest that the performance gain from the original PPO paper mainly comes from code-level optimization such as advantage normalization, value-function fitting, batch construction, and early stopping, but not from the fancy importance-sampling clipping. My take is that you should treat it as one additional heuristic for improving stability when you need it, rather than as a must-have component.

Generalized Advantage Estimation (GAE)

The PPO objective assumes that we already have an advantage estimate \(\hat{A}_t\) for every sampled transition. A simple choice is the Monte Carlo return \(G_t\) minus a value estimate,

\[ \hat{A}_t = G_t - V_{\phi}^{\pi_{\theta}}(s_t), \qquad G_t = \sum_{k=t}^{T}\gamma^{k-t}r(s_k,a_k). \]

where \(V_{\phi}^{\pi_{\theta}}\) is the value network parameterized by \(\phi\) that approximates the value function of policy \(\pi_{\theta}\). This estimate is unbiased when we use the full Monte Carlo return, but it often has high variance because it depends on all future rewards in the trajectory. On the other extreme, we can use the one-step temporal-difference residual

\[ \delta_t = r_t + \gamma V_{\phi}^{\pi_{\theta}}(s_{t+1}) - V_{\phi}^{\pi_{\theta}}(s_t), \]

This one-step estimator has lower variance but higher bias because it trusts the value network's bootstrap estimate. Generalized Advantage Estimation (GAE) interpolates between these two extremes with a trace-decay parameter \(\lambda\):

\[ \hat{A}^{\mathrm{GAE}}_t = \sum_{l=0}^{T-t-1}(\gamma\lambda)^l\delta_{t+l}. \]

In implementation, the same equation is usually computed backward through the rollout:

\[ \hat{A}^{\mathrm{GAE}}_t = \delta_t + \gamma\lambda\hat{A}^{\mathrm{GAE}}_{t+1}, \qquad \hat{A}^{\mathrm{GAE}}_T = 0. \]

The parameter \(\lambda\) controls the bias-variance tradeoff. When \(\lambda=0\), GAE becomes the one-step TD residual. When \(\lambda=1\), it becomes the Monte Carlo advantage estimate mentioned above, as \(V_{\phi}^{\pi_{\theta}}(\cdot)\) in the middle cancels out and the summation of \(r\) turns to be \(G_t\). PPO implementations commonly use something like \(\lambda=0.95\), which keeps most of the long horizon signal while reducing variance. After computing \(\hat{A}^{\mathrm{GAE}}_t\), it is also common to normalize advantages within the batch before using them in the PPO objective.

Training Value Network (Critic Network)

The value network is trained to predict the expected return from each state. Once we have \(\hat{A}^{\mathrm{GAE}}_t\), the corresponding return target is

\[ \hat{R}_t = \hat{A}^{\mathrm{GAE}}_t + V_{\phi_{\mathrm{old}}}^{\pi_{\theta_{\mathrm{old}}}}(s_t), \]

where \(V_{\phi_{\mathrm{old}}}^{\pi_{\theta_{\mathrm{old}}}}(s_t)\) is the value estimate used when the rollout batch was processed. The critic is then trained with a regression loss:

\[ L^{\mathrm{VF}}(\phi) = \mathbb{E}_{t}\left[ \left(V_{\phi}^{\pi_{\theta}}(s_t)-\hat{R}_t\right)^2 \right]. \]

The target \(\hat{R}_t\) is normally treated as a constant during this update. In most implementations, both \(\hat{A}^{\mathrm{GAE}}_t\) and \(\hat{R}_t\) are always detached tensors.

Some PPO implementations also clip the value-function update, analogous to policy-ratio clipping:

\[ V^{\mathrm{clip}}_{\phi}(s_t) = V_{\phi_{\mathrm{old}}}^{\pi_{\theta_{\mathrm{old}}}}(s_t) + \mathrm{clip}\left( V_{\phi}^{\pi_{\theta}}(s_t)-V_{\phi_{\mathrm{old}}}^{\pi_{\theta_{\mathrm{old}}}}(s_t), -\epsilon_v, \epsilon_v \right). \]

Then the critic loss becomes the larger of the unclipped and clipped squared errors:

\[ L^{\mathrm{VF,clip}}(\phi) = \mathbb{E}_{t}\left[ \max\left( \left(V_{\phi}^{\pi_{\theta}}(s_t)-\hat{R}_t\right)^2,\; \left(V^{\mathrm{clip}}_{\phi}(s_t)-\hat{R}_t\right)^2 \right) \right]. \]

This value clipping uses the same idea as importance-sampling clipping and is also not essential according to "Implementation Matters in Deep RL: A Case Study on PPO and TRPO".

The PPO Objective

Putting the actor and critic together, a typical PPO implementation optimizes a combined objective of the form

\[ L^{\mathrm{total}}(\theta,\phi) = -L^{\mathrm{CLIP}}(\theta)+ c_v L^{\mathrm{VF}}(\phi) - c_e H(\pi_{\theta}), \]

where \(H(\pi_{\theta})\) is an entropy bonus, \(c_v\) controls the critic loss weight, and \(c_e\) controls the entropy weight. The minus sign in front of \(L^{\mathrm{CLIP}}\) appears because most deep-learning optimizers minimize losses, while the PPO objective is written as a quantity to maximize.

Avatar
Yichao Zhou
Ph.D. in Computer Science

I am interested in AI and CV.

Previous