<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>robotics | Yichao Zhou</title>
    <link>https://yichaozhou.com/tags/robotics/</link>
      <atom:link href="https://yichaozhou.com/tags/robotics/index.xml" rel="self" type="application/rss+xml" />
    <description>robotics</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>© 2018-2026 Yichao Zhou All Rights Reserved</copyright><lastBuildDate>Sun, 26 Apr 2026 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://yichaozhou.com/images/icon_hu55475fe97c00cec85fe596a6a8c7f761_27704_512x512_fill_lanczos_center_2.png</url>
      <title>robotics</title>
      <link>https://yichaozhou.com/tags/robotics/</link>
    </image>
    
    <item>
      <title>Yet Another Tutorial on PPO</title>
      <link>https://yichaozhou.com/post/20260426ppo/</link>
      <pubDate>Sun, 26 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://yichaozhou.com/post/20260426ppo/</guid>
      <description>&lt;h2 id=&#34;introduction-to-ppo&#34;&gt;Introduction to PPO&lt;/h2&gt;

&lt;p&gt;Proximal Policy Optimization (PPO) is easiest to understand by starting from its surrogate
objective.  Let &lt;span  class=&#34;math&#34;&gt;\(\pi_{\theta_{\mathrm{old}}}\)&lt;/span&gt; be the policy that collected the trajectories and
let &lt;span  class=&#34;math&#34;&gt;\(\pi_{\theta}\)&lt;/span&gt; be the policy we want to optimize.  The &lt;strong&gt;non-clipping&lt;/strong&gt; version of the PPO
objective (proposed in TRPO) is&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
L^{\mathrm{PPO}}(\theta)=
\mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}}
\left[r_{t}(\theta)\hat{A}_{t}\right], \qquad
r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}\mid s_{t})}{\pi_{\theta_{\mathrm{old}}}(a_{t}\mid s_{t})},
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;where &lt;span  class=&#34;math&#34;&gt;\(\hat{A}_t\)&lt;/span&gt; is an estimator of the advantage function &lt;span  class=&#34;math&#34;&gt;\(A_t\)&lt;/span&gt;.  The advantage of taking
action &lt;span  class=&#34;math&#34;&gt;\(a_t\)&lt;/span&gt; at state &lt;span  class=&#34;math&#34;&gt;\(s_t\)&lt;/span&gt; under policy &lt;span  class=&#34;math&#34;&gt;\(\pi_{\theta}\)&lt;/span&gt; is defined as&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
A_t({\theta}) := A^{\pi_{\theta}}(s_t,a_t)
=Q^{\pi_{\theta}}(s_t,a_t)-V^{\pi_{\theta}}(s_t),
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;where &lt;span  class=&#34;math&#34;&gt;\(Q^{\pi_{\theta}}(s_t,a_t)\)&lt;/span&gt; is the expected return after taking &lt;span  class=&#34;math&#34;&gt;\(a_t\)&lt;/span&gt; and following
policy &lt;span  class=&#34;math&#34;&gt;\(\pi_{\theta}\)&lt;/span&gt; afterwards, and &lt;span  class=&#34;math&#34;&gt;\(V^{\pi_{\theta}}(s_t)\)&lt;/span&gt; is the expected return under
the policy&#39;s default action distribution.  &lt;span  class=&#34;math&#34;&gt;\(\hat{A}_t\)&lt;/span&gt; is typically estimated with Generalized
Advantage Estimation (GAE), together with a value network that predicts
&lt;span  class=&#34;math&#34;&gt;\(\hat{V}^{\pi_{\theta}}(s_t)\)&lt;/span&gt;; we will discuss this later.  Intuitively,
&lt;span  class=&#34;math&#34;&gt;\(\hat{A}_t&gt;0\)&lt;/span&gt; means the sampled action was better than what the old policy usually does at that
state, while &lt;span  class=&#34;math&#34;&gt;\(\hat{A}_t&lt;0\)&lt;/span&gt; means it was worse.  The policy-gradient update is still valid if
you replace &lt;span  class=&#34;math&#34;&gt;\(\hat{A}_t\)&lt;/span&gt; with the reward of the current trajectory, but the variance will be much
larger.&lt;/p&gt;

&lt;p&gt;The old policy &lt;span  class=&#34;math&#34;&gt;\(\pi_{\theta_{\mathrm{old}}}\)&lt;/span&gt; is a frozen snapshot of the policy used to collect
the current batch of trajectories.  This matters in practical RL systems because the simulator
loop and the optimization loop are not necessarily synchronized: the optimizer can take several
gradient steps using old trajectories while new batches are still rolling out to improve training
efficiency.&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\(L^{\mathrm{PPO}}\)&lt;/span&gt; follows naturally from importance sampling.  For each state
&lt;span  class=&#34;math&#34;&gt;\(s_t\)&lt;/span&gt;, we want to maximize the advantage function:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\begin{align}
\mathbb{E}_{s_t,a_t \sim \pi_{\theta}}\left[\hat{A}_t\right]
&amp;= \sum_{a_t}\pi_{\theta}(a_t \mid s_t)\hat{A}_t \\
&amp;= \sum_{a_t}\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)
\frac{\pi_{\theta}(a_t \mid s_t)}
{\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)}\hat{A}_t \\
&amp;= \mathbb{E}_{a_t \sim \pi_{\theta_{\mathrm{old}}}}
\left[r_t(\theta)\hat{A}_t\right].
\end{align}
\]&lt;/span&gt;&lt;/p&gt;

&lt;h2 id=&#34;relationship-to-policy-gradient&#34;&gt;Relationship to Policy Gradient&lt;/h2&gt;

&lt;p&gt;The above formulation looks a bit different from Policy Gradient, which is often the first RL
algorithm people learn.&lt;/p&gt;

&lt;p&gt;To make things clear, I will use three symbols:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;span  class=&#34;math&#34;&gt;\(L(\theta)\)&lt;/span&gt; is the true policy-gradient objective, i.e., the expected discounted return.
This is the quantity we actually care about, but it is not directly differentiable with
backpropagation because the actions are sampled from the policy and the environment transition
is outside the computation graph.&lt;/li&gt;
&lt;li&gt;&lt;span  class=&#34;math&#34;&gt;\(L^{\mathrm{PG}}(\theta)\)&lt;/span&gt; is the regular policy-gradient surrogate.  It uses
&lt;span  class=&#34;math&#34;&gt;\(\log \pi_{\theta}(a_t \mid s_t)\)&lt;/span&gt; and gives the policy-gradient update direction.&lt;/li&gt;
&lt;li&gt;&lt;span  class=&#34;math&#34;&gt;\(L^{\mathrm{PPO}}(\theta)\)&lt;/span&gt; is the non-clipping version of the PPO objective.  It uses the
probability ratio &lt;span  class=&#34;math&#34;&gt;\(r_t(\theta)\)&lt;/span&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The true objective is&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
L(\theta)=\mathbb{E}_{s_t,a_t \sim \pi_{\theta}}\left[G_t\right],
\qquad
G_t=\sum_{k=t}^{T}\gamma^{k-t}r(s_k,a_k).
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Using the log-derivative trick, the classic policy-gradient result is&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\nabla_{\theta}L(\theta)=\mathbb{E}_{s_t,a_t \sim \pi_{\theta}}
\left[\nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t\right].
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;details&gt;
&lt;summary&gt;Derivation of Policy Gradient (Click to Unfold)&lt;/summary&gt;&lt;/p&gt;

&lt;p&gt;Let &lt;span  class=&#34;math&#34;&gt;\(\tau=(s_0,a_0,s_1,a_1,\dots)\)&lt;/span&gt; be a trajectory and let &lt;span  class=&#34;math&#34;&gt;\(R(\tau)\)&lt;/span&gt; be its return.  The true
objective can be written as a sum over all trajectories:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
L(\theta)=\sum_{\tau}P_{\theta}(\tau)R(\tau).
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The useful identity is&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[\nabla_{\theta}f(\theta)=f(\theta)\nabla_{\theta}\log f(\theta).\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Therefore&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\begin{align}
\nabla_{\theta}L(\theta)
&amp;= \sum_{\tau}\nabla_{\theta}P_{\theta}(\tau)R(\tau) \\
&amp;= \sum_{\tau}P_{\theta}(\tau)\nabla_{\theta}\log P_{\theta}(\tau)R(\tau) \\
&amp;= \mathbb{E}_{\tau \sim \pi_{\theta}}
\left[\nabla_{\theta}\log P_{\theta}(\tau)R(\tau)\right].
\end{align}
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The trajectory probability is&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
P_{\theta}(\tau)=\rho(s_0)\prod_{t=0}^{T-1}\pi_{\theta}(a_t \mid s_t)p(s_{t+1}\mid s_t,a_t).
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Since the initial-state distribution and environment dynamics do not depend on &lt;span  class=&#34;math&#34;&gt;\(\theta\)&lt;/span&gt;, only the
policy terms remain in the gradient:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\nabla_{\theta}\log P_{\theta}(\tau)
=\sum_{t=0}^{T-1}\nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t).
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Substituting this back gives the usual policy-gradient estimator.  &lt;a href=&#34;https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#don-t-let-the-past-distract-you&#34;&gt;Replacing the full trajectory &lt;span  class=&#34;math&#34;&gt;\(R(\tau)\)&lt;/span&gt;
return with reward-to-go and subtracting a state baseline&lt;/a&gt; gives the lower-variance advantage form:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\nabla_{\theta}L(\theta)=
\mathbb{E}_{s_t,a_t \sim \pi_{\theta}}
\left[\nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t\right].
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;/details&gt;&lt;/p&gt;

&lt;p&gt;To implement policy gradient in PyTorch, we often use a surrogate objective to compute this
gradient:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
L^{\mathrm{PG}}(\theta)=
\mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}}
\left[\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t\right].
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Taking the gradient of this surrogate gives the score-function estimator:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\nabla_{\theta}L^{\mathrm{PG}}(\theta)=
\mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}}
\left[\nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t\right].
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;At first glance, &lt;span  class=&#34;math&#34;&gt;\(L^{\mathrm{PG}}\)&lt;/span&gt; and the non-clipping PPO objective &lt;span  class=&#34;math&#34;&gt;\(L^{\mathrm{PPO}}\)&lt;/span&gt; look
different:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
L^{\mathrm{PPO}}(\theta)=
\mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}}
\left[
\frac{\pi_{\theta}(a_t \mid s_t)}
{\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)}
\hat{A}_t
\right].
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;However, their gradients are the same at &lt;span  class=&#34;math&#34;&gt;\(\theta=\theta_{\mathrm{old}}\)&lt;/span&gt;:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\begin{align}
\nabla_{\theta}L^{\mathrm{PPO}}(\theta)\big|_{\theta=\theta_{\mathrm{old}}}
&amp;=
\mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}}
\left[
\frac{\nabla_{\theta}\pi_{\theta}(a_t \mid s_t)}
{\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)}
\hat{A}_t
\right]_{\theta=\theta_{\mathrm{old}}} \\
&amp;\stackrel{1}{=}
\mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}}
\left[
\underbrace{\frac{\pi_{\theta}(a_t \mid s_t)}
{\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)}}_{=1}
\nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t
\right]_{\theta=\theta_{\mathrm{old}}} \\
&amp;=
\mathbb{E}_{s_t,a_t \sim \pi_{\theta_{\mathrm{old}}}}
\left[
\nabla_{\theta}\log \pi_{\theta}(a_t \mid s_t)\hat{A}_t
\right]_{\theta=\theta_{\mathrm{old}}} \\
&amp;=
\nabla_{\theta}L^{\mathrm{PG}}(\theta)\big|_{\theta=\theta_{\mathrm{old}}},
\end{align}
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;where &lt;span  class=&#34;math&#34;&gt;\(\stackrel{1}{=}\)&lt;/span&gt; uses the same identity
&lt;span  class=&#34;math&#34;&gt;\(\nabla_{\theta}f(\theta)=f(\theta)\nabla_{\theta}\log f(\theta)\)&lt;/span&gt; as the policy-gradient
derivation.&lt;/p&gt;

&lt;h2 id=&#34;ppo-importance-sampling-clipping&#34;&gt;PPO Importance Sampling Clipping&lt;/h2&gt;

&lt;p&gt;The major contribution of the original PPO paper is replacing&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
L^{\mathrm{PPO}}(\theta)=\mathbb{E}_{t}\left[r_{t}(\theta)\hat{A}_{t}\right]
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;with&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
L^{\mathrm{CLIP}}(\theta)=\mathbb{E}_{t}\left[
\min\left(
r_t(\theta)\hat{A}_t,\;
\mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t
\right)\right].
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The goal of clipping here is to make sure &lt;span  class=&#34;math&#34;&gt;\(\pi_\theta\)&lt;/span&gt; does not deviate too much from
&lt;span  class=&#34;math&#34;&gt;\(\pi_{\theta_{\mathrm{old}}}\)&lt;/span&gt;.  Consider the case &lt;span  class=&#34;math&#34;&gt;\(\hat{A}_t&gt;0\)&lt;/span&gt;.  If
&lt;span  class=&#34;math&#34;&gt;\(r_t(\theta) &gt; 1+\epsilon\)&lt;/span&gt;, the new policy already assigns this action a much higher probability
than &lt;span  class=&#34;math&#34;&gt;\(\pi_{\theta_{\mathrm{old}}}\)&lt;/span&gt; did, so PPO clips the ratio to prevent the probability from
deviating further.  If &lt;span  class=&#34;math&#34;&gt;\(r_t(\theta) &lt; 1-\epsilon\)&lt;/span&gt;, the new policy assigns this action a lower
probability than the old policy did, so PPO does not clip the ratio.  The gradient will make the
action more likely and move the probability back toward &lt;span  class=&#34;math&#34;&gt;\(\pi_{\theta_{\mathrm{old}}}\)&lt;/span&gt;.  &lt;strong&gt;In
other words, clipping is only needed when the update is moving away from
&lt;span  class=&#34;math&#34;&gt;\(\pi_{\theta_{\mathrm{old}}}\)&lt;/span&gt;.&lt;/strong&gt;  The case for &lt;span  class=&#34;math&#34;&gt;\(\hat{A}_t&lt;0\)&lt;/span&gt; can be treated
&lt;a href=&#34;https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl&#34;&gt;similarly&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&#34;is-clipping-actually-working&#34;&gt;Is Clipping Actually Working?&lt;/h3&gt;

&lt;p&gt;Clipping &lt;span  class=&#34;math&#34;&gt;\(r_t(\theta)\)&lt;/span&gt; is the signature contribution of PPO, but it is also somewhat unusual:
if the math says &lt;span  class=&#34;math&#34;&gt;\(r_t(\theta)\)&lt;/span&gt; should be large, there may be a good reason.  Why does PPO
emphasize clipping the importance-sampling factor &lt;span  class=&#34;math&#34;&gt;\(r_t(\theta)\)&lt;/span&gt; when deep learning already has
similar general tools such as &lt;a href=&#34;https://docs.pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html&#34;&gt;gradient
clipping&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Indeed, empirical studies such as &lt;a href=&#34;https://arxiv.org/pdf/2005.12729&#34;&gt;&amp;quot;Implementation Matters in Deep RL: A Case Study on
PPO and TRPO&amp;quot;&lt;/a&gt; suggest that the performance gain from the original PPO paper
mainly comes from code-level optimization such as advantage normalization, value-function fitting,
batch construction, and early stopping, &lt;strong&gt;but not from the fancy importance-sampling clipping&lt;/strong&gt;.
My take is that you should treat it as one additional heuristic for improving stability when you need it, rather
than as a must-have component.&lt;/p&gt;

&lt;h2 id=&#34;generalized-advantage-estimation-gae&#34;&gt;Generalized Advantage Estimation (GAE)&lt;/h2&gt;

&lt;p&gt;The PPO objective assumes that we already have an advantage estimate &lt;span  class=&#34;math&#34;&gt;\(\hat{A}_t\)&lt;/span&gt; for every
sampled transition.  A simple choice is the Monte Carlo return &lt;span  class=&#34;math&#34;&gt;\(G_t\)&lt;/span&gt; minus a value estimate,&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\hat{A}_t = G_t - V_{\phi}^{\pi_{\theta}}(s_t), \qquad G_t = \sum_{k=t}^{T}\gamma^{k-t}r(s_k,a_k).
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;where &lt;span  class=&#34;math&#34;&gt;\(V_{\phi}^{\pi_{\theta}}\)&lt;/span&gt; is the value network parameterized by &lt;span  class=&#34;math&#34;&gt;\(\phi\)&lt;/span&gt; that approximates
the value function of policy &lt;span  class=&#34;math&#34;&gt;\(\pi_{\theta}\)&lt;/span&gt;.  This estimate is unbiased when we use the full
Monte Carlo return, but it often has high variance because it depends on all future rewards in the
trajectory.  On the other extreme, we can use the one-step temporal-difference residual&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\delta_t = r_t + \gamma V_{\phi}^{\pi_{\theta}}(s_{t+1}) - V_{\phi}^{\pi_{\theta}}(s_t),
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;This one-step estimator has lower variance but higher bias because it trusts the value network&#39;s
bootstrap estimate.  Generalized Advantage Estimation (GAE) interpolates between these two
extremes with a trace-decay parameter &lt;span  class=&#34;math&#34;&gt;\(\lambda\)&lt;/span&gt;:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\hat{A}^{\mathrm{GAE}}_t
= \sum_{l=0}^{T-t-1}(\gamma\lambda)^l\delta_{t+l}.
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;In implementation, the same equation is usually computed backward through the rollout:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\hat{A}^{\mathrm{GAE}}_t
= \delta_t + \gamma\lambda\hat{A}^{\mathrm{GAE}}_{t+1},
\qquad
\hat{A}^{\mathrm{GAE}}_T = 0.
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The parameter &lt;span  class=&#34;math&#34;&gt;\(\lambda\)&lt;/span&gt; controls the bias-variance tradeoff.  When &lt;span  class=&#34;math&#34;&gt;\(\lambda=0\)&lt;/span&gt;, GAE becomes the
one-step TD residual.  When &lt;span  class=&#34;math&#34;&gt;\(\lambda=1\)&lt;/span&gt;, it becomes the Monte Carlo advantage estimate mentioned above,
as &lt;span  class=&#34;math&#34;&gt;\(V_{\phi}^{\pi_{\theta}}(\cdot)\)&lt;/span&gt; in the middle cancels out and the summation of &lt;span  class=&#34;math&#34;&gt;\(r\)&lt;/span&gt; turns to be &lt;span  class=&#34;math&#34;&gt;\(G_t\)&lt;/span&gt;.
PPO implementations commonly use something like &lt;span  class=&#34;math&#34;&gt;\(\lambda=0.95\)&lt;/span&gt;, which keeps most of the long
horizon signal while reducing variance.  After computing &lt;span  class=&#34;math&#34;&gt;\(\hat{A}^{\mathrm{GAE}}_t\)&lt;/span&gt;, it is also
common to normalize advantages within the batch before using them in the PPO objective.&lt;/p&gt;

&lt;h2 id=&#34;training-value-network-critic-network&#34;&gt;Training Value Network (Critic Network)&lt;/h2&gt;

&lt;p&gt;The value network is trained to predict the expected return from each state.  Once we have
&lt;span  class=&#34;math&#34;&gt;\(\hat{A}^{\mathrm{GAE}}_t\)&lt;/span&gt;, the corresponding return target is&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
\hat{R}_t = \hat{A}^{\mathrm{GAE}}_t + V_{\phi_{\mathrm{old}}}^{\pi_{\theta_{\mathrm{old}}}}(s_t),
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;where &lt;span  class=&#34;math&#34;&gt;\(V_{\phi_{\mathrm{old}}}^{\pi_{\theta_{\mathrm{old}}}}(s_t)\)&lt;/span&gt; is the value estimate used
when the rollout batch was processed.  The critic is then trained with a regression loss:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
L^{\mathrm{VF}}(\phi)
= \mathbb{E}_{t}\left[
\left(V_{\phi}^{\pi_{\theta}}(s_t)-\hat{R}_t\right)^2
\right].
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The target &lt;span  class=&#34;math&#34;&gt;\(\hat{R}_t\)&lt;/span&gt; is normally treated as a constant during this update.  In most &lt;a href=&#34;https://stable-baselines3.readthedocs.io/en/v2.3.0/_modules/stable_baselines3/ppo/ppo.html&#34;&gt;implementations&lt;/a&gt;,
both &lt;span  class=&#34;math&#34;&gt;\(\hat{A}^{\mathrm{GAE}}_t\)&lt;/span&gt; and &lt;span  class=&#34;math&#34;&gt;\(\hat{R}_t\)&lt;/span&gt; are always detached tensors.&lt;/p&gt;

&lt;p&gt;Some PPO implementations also clip the value-function update, analogous to policy-ratio clipping:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
V^{\mathrm{clip}}_{\phi}(s_t)
= V_{\phi_{\mathrm{old}}}^{\pi_{\theta_{\mathrm{old}}}}(s_t) + \mathrm{clip}\left(
V_{\phi}^{\pi_{\theta}}(s_t)-V_{\phi_{\mathrm{old}}}^{\pi_{\theta_{\mathrm{old}}}}(s_t), -\epsilon_v, \epsilon_v
\right).
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Then the critic loss becomes the larger of the unclipped and clipped squared errors:&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
L^{\mathrm{VF,clip}}(\phi)
= \mathbb{E}_{t}\left[
\max\left(
\left(V_{\phi}^{\pi_{\theta}}(s_t)-\hat{R}_t\right)^2,\;
\left(V^{\mathrm{clip}}_{\phi}(s_t)-\hat{R}_t\right)^2
\right)
\right].
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;This value clipping uses the same idea as importance-sampling clipping and is also not essential according to &lt;a href=&#34;https://arxiv.org/pdf/2005.12729&#34;&gt;&amp;quot;Implementation Matters in Deep RL: A Case Study on
PPO and TRPO&amp;quot;&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&#34;the-ppo-objective&#34;&gt;The PPO Objective&lt;/h2&gt;

&lt;p&gt;Putting the actor and critic together, a typical PPO implementation optimizes a combined objective
of the form&lt;/p&gt;

&lt;p&gt;&lt;span  class=&#34;math&#34;&gt;\[
L^{\mathrm{total}}(\theta,\phi)
= -L^{\mathrm{CLIP}}(\theta)+ c_v L^{\mathrm{VF}}(\phi) - c_e H(\pi_{\theta}),
\]&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;where &lt;span  class=&#34;math&#34;&gt;\(H(\pi_{\theta})\)&lt;/span&gt; is an entropy bonus, &lt;span  class=&#34;math&#34;&gt;\(c_v\)&lt;/span&gt; controls the critic loss weight, and
&lt;span  class=&#34;math&#34;&gt;\(c_e\)&lt;/span&gt; controls the entropy weight.  The minus sign in front of &lt;span  class=&#34;math&#34;&gt;\(L^{\mathrm{CLIP}}\)&lt;/span&gt; appears
because most deep-learning optimizers minimize losses, while the PPO objective is written as a
quantity to maximize.&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>
