Recall that the policy gradients are given by:
$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_{i,t}\vert\mathbf{s}_{i,t})\hat{A}^\pi_{i,t} $$
1. Policy Gradients as Policy Iteration
Recall that the discounted RL objective function is given by:
$$ J(\theta) = \mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\sum_t\gamma^t r(\mathbf{s}_t, \mathbf{a}_t)\right] $$
and that it can be expressed as:
$$ J(\theta) = \mathbb{E}_{\mathbf{s}_0 \sim p(\mathbf{s}_0)}\left[V^{\pi_\theta}(\mathbf{s}_0) \right] $$
where \(\mathbf{s}_0\) is the state we start in. The objective here is to maximize \(J\). Suppose that we initialize our policy with parameters \(\theta\). Then we may define our goal is to find some parameters \(\theta'\) such that:
$$ \theta' = \text{argmax}_{\theta'}\left[J(\theta')-J(\theta)\right] $$
Consider:
$$ J(\theta')-J(\theta) = J(\theta')-\mathbb E_{\mathbf{s}_0\sim p(\mathbf{s}_0)} [V^{\pi_\theta}(\mathbf{s}_0)] $$
We may rewrite the expectation on the left hand side to be under \(p_{\theta'}(\tau)\). This is because \(V^{\pi_\theta}(\mathbf s_0)\) only depends on the initial state \(\mathbf s_0\) (and so is independent of the rest of the trajectory) whose distribution is independent of the policy we choose.
$$ \begin{align} J(\theta')-J(\theta) &= J(\theta')-\mathbb E_{\mathbb{\tau}\sim p_{\theta'}(\tau)} [V^{\pi_\theta}(\mathbf{s}_0)]\\ &= J(\theta')-\mathbb E_{\mathbb{\tau}\sim p_{\theta'}(\tau)} \left[\sum_{t=0}^\infty \gamma^tV^{\pi_\theta}(\mathbf{s}_t)-\sum_{t=1}^\infty \gamma^tV^{\pi_\theta}(\mathbf{s}_t)\right]\\ &= J(\theta')+\mathbb E_{\mathbb{\tau}\sim p_{\theta'}(\tau)} \left[\sum_{t=0}^\infty \gamma^t\left(\gamma V^{\pi_\theta}(\mathbf{s}_{t+1})-V^{\pi_\theta}(\mathbf{s}_t)\right)\right]\\ &= \mathbb E_{\tau\sim p_{\theta'}(\tau)}\left[\sum_{t=0}^\infty \gamma^t r(\mathbf{s}_t, \mathbf{a}_t)\right] + \mathbb E_{\mathbb{\tau}\sim p_{\theta'}(\tau)} \left[\sum_{t=0}^\infty \gamma^t\left(\gamma V^{\pi_\theta}(\mathbf{s}_{t+1})-V^{\pi_\theta}(\mathbf{s}_t)\right)\right]\\ &= \mathbb E_{\tau\sim p_{\theta'}(\tau)}\left[\sum_{t=0}^\infty \gamma^t\left( r(\mathbf{s}_t, \mathbf{a}_t) + \gamma V^{\pi_\theta}(\mathbf{s}_{t+1})-V^{\pi_\theta}(\mathbf{s}_t)\right)\right]\\ &= \mathbb E_{\tau\sim p_{\theta'}(\tau)}\left[\sum_{t=0}^\infty \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\\ \end{align} $$
The third step would not be entirely correct if the summation was only up to a finite \(T\) (as we will then have the term \(V^{\pi_\theta}(\mathbf{s}_{T+1})\)). However, as for sufficiently large values of \(T\), because \(\gamma^t\) becomes so small as \(t\rightarrow T\) we can just ignore these last terms.
We may rewrite the expression above using importance sampling:
$$ \begin{align} J(\theta')-J(\theta) &= \sum_{t=0}^\infty\mathbb E_{\mathbf s_t\sim p_{\theta'}(\mathbf s_t)}\left[\mathbb E_{\mathbf a_t\sim \pi_\theta'(\mathbf a_t\vert \mathbf s_t)}\left[\gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\right]\\ &= \sum_{t=0}^\infty\mathbb E_{\mathbf s_t\sim p_{\theta'}(\mathbf s_t)}\left[ \mathbb E_{\mathbf a_t\sim \pi_\theta(\mathbf a_t\vert \mathbf s_t)} \left[\frac{\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t)}{\pi_\theta(\mathbf a_t\vert \mathbf s_t)} \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\right] \end{align} $$
Suppose that \(p_{\theta'}(\mathbf s_t) = p_\theta(\mathbf s_t)\). This means that we can easily evaluate the expectations in the expression above by sampling trajectories with our current policy \(\pi_\theta\). We can then choose \(\theta'\) such that it maximizes this expression. If on the other hand \(p_{\theta'}(\mathbf s_t) \neq p_\theta(\mathbf s_t)\), which is true in general, it becomes harder to evaluate the outer expectation because we cannot sample using our next policy (as we do not know what it is) and nor do we know the state probabilities under it. However, it turns out that we can ignore this distribution mismatch (i.e. assume \(p_{\theta'}(\mathbf s_t) = p_\theta(\mathbf s_t)\)) and bound the error as a result of doing so.
2. Bounding the Distribution Change
In this section we will prove that \(p_\theta(\mathbf{s}_t)\) is close to \(p_{\theta'}(\mathbf{s}_t)\) when \(\pi_\theta\) is close to \(\pi_{\theta '}\).
A. Simple Case
Let us assume that \(\pi_\theta\) is a deterministic policy, i.e. \(\mathbf{a}_t = \pi_\theta({\mathbf{s}_t})\). We say that \(\pi_{\theta'}\) is close to \(\pi_\theta\) when:
$$ \pi_{\theta'}(\mathbf{a}_t \neq \pi_\theta(\mathbf{s}_t)\vert \mathbf{s}_t) \leq \epsilon $$
i.e. the probability that \(\pi_{\theta'}\) and \(\pi_\theta\) take different actions in some state \(\mathbf{s}_t\) is bounded by \(\epsilon\).
Let us do an analysis similar to the one done for DAgger. Suppose that we run both policies for \(t\) steps. Then with probability \((1-\epsilon)^t\), the two policies took exactly the same action (the probability of taking the same action on a single time step is \((1-\epsilon)\)). For the upper bound case, we assume that once the two policies take different actions in some state, they continue to take different actions for all subsequent states. Similar to the DAgger case we have:
$$ p_{\theta'}(\mathbf{s}_t) = (1-\epsilon)^t p_\theta(\mathbf{s}_t) + (1-(1-\epsilon)^t) p_{\text{mistake}}(\mathbf{s}_t) $$
where \(p_\text{mistake}\) is some (complex) distribution. As before, we can express the total variation divergence between \(p_{\theta'}\) and \(p_\theta\) as:
$$ \vert p_{\theta'}(\mathbf{s}_t) - p_\theta(\mathbf{s}_t)\vert \leq 2\epsilon t $$
B. General Case
Let \(\pi_\theta\) be some arbitrary distribution. We say \(\pi_{\theta'}\) is close to \(\pi_\theta\) if
$$ \vert \pi_{\theta'}(\mathbf{a}_t\vert \mathbf{s}_t) - \pi_\theta(\mathbf{a}_t\vert \mathbf{s}_t)\vert \leq \epsilon $$
for all \(\mathbf s_t\), i.e the total variation divergence between \(\pi_{\theta'}\) and \(\pi_\theta\) is bounded by \(\epsilon\) for all states.
Lemma. Suppose \(p_X\) and \(p_Y\) are distributions such that \(\vert p_X(z) - p_Y(z) \vert = \epsilon\). Then there exists some joint distribution \(p(x,y)\) such that its marginals \(p(x)\) and \(p(y)\) are equal to \(p_X(x)\) and \(p_Y(y)\) respectively and \(p(x=y)=1-\epsilon\).
This means that if we assume that \(\pi_{\theta'}\) is close to \(\pi_\theta\) (as per our definition above), then the two policies choose the same action with probability (at least) \(1-\epsilon\). This is the same as what we had in the simple case above. Therefore:
$$ p_{\theta'}(\mathbf{s}_t) = (1-\epsilon)^t p_\theta(\mathbf{s}_t) + (1-(1-\epsilon)^t) p_{\text{mistake}}(\mathbf{s}_t) $$
from which it follows that:
$$ \vert p_{\theta'}(\mathbf{s}_t) - p_\theta(\mathbf{s}_t)\vert \leq 2\epsilon t $$
Note that for some function \(f\) that is always positive:
$$ \begin{align} \mathbb{E}_{\mathbf{s}_t\sim p_{\theta'}(\mathbf{s}_t)} \left[ f(\mathbf{s}_t)\right] &= \sum_{\mathbf{s}_t}p_{\theta'}(\mathbf{s}_t)f(\mathbf{s}_t)\\ &= \sum_{\mathbf{s}_t}\left(p_{\theta}(\mathbf{s}_t)f(\mathbf{s}_t) - (p_{\theta}(\mathbf{s}_t) - p_{\theta'}(\mathbf{s}_t))f(\mathbf{s}_t)\right)\\ &= \sum_{\mathbf{s}_t}p_{\theta}(\mathbf{s}_t)f(\mathbf{s}_t) - \sum_{\mathbf{s}_t}(p_{\theta}(\mathbf{s}_t) - p_{\theta'}(\mathbf{s}_t))f(\mathbf{s}_t)\\ &\geq \sum_{\mathbf{s}_t}p_{\theta}(\mathbf{s}_t)f(\mathbf{s}_t) - \max_{\mathbf{s}_t} f(\mathbf{s}_t)\sum_{\mathbf{s}_t}(p_{\theta}(\mathbf{s}_t) - p_{\theta'}(\mathbf{s}_t))\\ &\geq \sum_{\mathbf{s}_t}p_{\theta}(\mathbf{s}_t)f(\mathbf{s}_t) - \max_{\mathbf{s}_t} f(\mathbf{s}_t)\vert p_{\theta}(\mathbf{s}_t) - p_{\theta'}(\mathbf{s}_t)\vert\\ &\geq \sum_{\mathbf{s}_t}p_{\theta}(\mathbf{s}_t)f(\mathbf{s}_t) -2\epsilon t \max_{\mathbf{s}_t} f(\mathbf{s}_t)\\ &= \mathbb{E}_{\mathbf{s}_t\sim p_{\theta}(\mathbf{s}_t)}\left[f(\mathbf{s}_t)\right] -2\epsilon t \max_{\mathbf{s}_t} f(\mathbf{s}_t) \end{align} $$
Therefore:
$$ \begin{align} \sum_{t=0}^\infty&\mathbb E_{\mathbf s_t\sim p_{\theta'}(\mathbf s_t)}\left[ \mathbb E_{\mathbf a_t\sim \pi_\theta(\mathbf a_t\vert \mathbf s_t)} \left[\frac{\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t)}{\pi_\theta(\mathbf a_t\vert \mathbf s_t)} \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\right]\\ &\geq \sum_{t=0}^\infty\mathbb E_{\mathbf s_t\sim p_{\theta}(\mathbf s_t)}\left[ \mathbb E_{\mathbf a_t\sim \pi_\theta(\mathbf a_t\vert \mathbf s_t)} \left[\frac{\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t)}{\pi_\theta(\mathbf a_t\vert \mathbf s_t)} \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\right] -2\epsilon t C \end{align} $$
for some constant \(C\).
C. A More Convenient Bound
Instead of dealing with the total variation divergence, it is usually easier to use the KL divergence. The two divergences are related to each other via:
$$ \vert \pi_{\theta'} (\mathbf{a}_t\vert\mathbf{s}_t)-\pi_{\theta} (\mathbf{a}_t \vert \mathbf{s}_t) \vert\leq \sqrt{\frac{1}{2}D_{KL}\left(\pi_{\theta'} (\mathbf{a}_t\vert \mathbf{s}_t) \vert\vert \pi_{\theta} (\mathbf{a}_t\vert\mathbf{s}_t)\right)} $$
where \(D_{KL}\) - the KL divergence - is given by:
$$ D_{KL}(p_1(x)\vert\vert p_2(x)) = \mathbb{E}_{x\sim p_1(x)}\left[\log\frac{p_1(x)}{p_2(x)}\right] $$
Our objective function is thus given by:
$$ \theta' \leftarrow \text{argmax}_{\theta'}\left[\sum_{t=0}^\infty\mathbb E_{\mathbf s_t\sim p_{\theta}(\mathbf s_t)}\left[\mathbb E_{\mathbf a_t\sim \pi_\theta(\mathbf a_t\vert \mathbf s_t)} \left[\frac{\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t)}{\pi_\theta(\mathbf a_t\vert \mathbf s_t)} \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\right] \right]\\ \\ \text{such that}\; D_{KL}\left(\pi_{\theta'} (\mathbf{a}_t\vert \mathbf{s}_t) \vert\vert \pi_{\theta} (\mathbf{a}_t\vert\mathbf{s}_t)\right) \leq \epsilon $$
For small enough \(\epsilon\) this is guaranteed to improve \(J(\theta')-J(\theta)\).
3. Enforcing the Constraint
One way to optimize the objective function given in the previous section is to write down its Lagrangian:
$$ \mathcal{L}(\theta',\lambda) = \sum_{t=0}^\infty\mathbb E_{\mathbf s_t\sim p_{\theta}(\mathbf s_t)}\left[\mathbb E_{\mathbf a_t\sim \pi_\theta(\mathbf a_t\vert \mathbf s_t)} \left[\frac{\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t)}{\pi_\theta(\mathbf a_t\vert \mathbf s_t)} \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\right] - \lambda\left(D_{KL}\left(\pi_{\theta'} (\mathbf{a}_t\vert \mathbf{s}_t) \vert\vert \pi_{\theta} (\mathbf{a}_t\vert\mathbf{s}_t)\right)-\epsilon\right) $$
The dual gradient descent algorithm, which we will discuss in a later set of lecture notes, repeatedly performs the following two steps:
- Maximize \(\mathcal{L}(\theta', \lambda)\) with respect to \(\theta'\).
- \(\lambda \leftarrow \lambda + \alpha(D_{KL}\left(\pi_{\theta'} (\mathbf{a}_t\vert \mathbf{s}_t) \vert\vert \pi_{\theta} (\mathbf{a}_t\vert\mathbf{s}_t)\right)-\epsilon)\).
The second step essentially raises \(\lambda\) if the constraint is violated i.e if
$$ D_{KL}\left(\pi_{\theta'} (\mathbf{a}_t\vert \mathbf{s}_t) \vert\vert \pi_{\theta} (\mathbf{a}_t\vert\mathbf{s}_t)\right)-\epsilon > 0 $$
and lowers it otherwise.
4. Taylor’s Approximation to the Objective Function
Let \(\bar{A}(\theta')\) denote:
$$ \sum_{t=0}^\infty\mathbb E_{\mathbf s_t\sim p_{\theta}(\mathbf s_t)}\left[\mathbb E_{\mathbf a_t\sim \pi_\theta(\mathbf a_t\vert \mathbf s_t)} \left[\frac{\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t)}{\pi_\theta(\mathbf a_t\vert \mathbf s_t)} \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\right] $$
Our objective function is thus given as:
$$ \theta^\prime \leftarrow \text{argmax}_{\theta^\prime} \bar{A}(\theta^\prime)\;\; \text{s.t.}\; D_{KL}\left(\pi_{\theta'} (\mathbf{a}_t\vert \mathbf{s}_t) \vert\vert \pi_{\theta} (\mathbf{a}_t\vert\mathbf{s}_t)\right) \leq \epsilon $$
We may use a first order Taylor approximation for \(\bar{A}(\theta^\prime)\) at our old parameter \(\theta\):
$$ \bar{A}(\theta^\prime) \approx \bar{A}(\theta) + \nabla_{\theta}\bar{A}(\theta)(\theta^\prime-\theta) $$
where \(\nabla_{\theta}\bar{A}(\theta)\) actually means taking the derivative of \(\bar{A}(\theta^\prime)\) with respect to \(\theta^\prime\) and then substituting in \(\theta\) for \(\theta^\prime\). Note that:
$$ \text{argmax}_{\theta^\prime}\left[\bar{A}(\theta) + \nabla_{\theta}\bar{A}(\theta)(\theta^\prime-\theta)\right] = \text{argmax}_{\theta^\prime}\left[\nabla_{\theta}\bar{A}(\theta)(\theta^\prime-\theta)\right] $$
Also, we have:
$$ \begin{align} \nabla_{\theta^\prime} \bar{A}(\theta^\prime) &= \nabla_{\theta^\prime}\sum_{t=0}^\infty\int p_{\theta}(\mathbf s_t)\int\pi_\theta(\mathbf a_t\vert \mathbf s_t) \left[\frac{\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t)}{\pi_\theta(\mathbf a_t\vert \mathbf s_t)} \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\\ &= \sum_{t=0}^\infty\int p_{\theta}(\mathbf s_t)\int\pi_\theta(\mathbf a_t\vert \mathbf s_t) \left[\frac{\nabla_{\theta^\prime}\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t)}{\pi_\theta(\mathbf a_t\vert \mathbf s_t)} \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\\ &= \sum_{t=0}^\infty\int p_{\theta}(\mathbf s_t)\int\pi_\theta(\mathbf a_t\vert \mathbf s_t) \left[\frac{\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t)}{\pi_\theta(\mathbf a_t\vert \mathbf s_t)}\nabla_{\theta^\prime}\log\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t) \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\\ &=\sum_{t=0}^\infty\mathbb E_{\mathbf s_t\sim p_{\theta}(\mathbf s_t)}\left[\mathbb E_{\mathbf a_t\sim \pi_\theta(\mathbf a_t\vert \mathbf s_t)}\left[\frac{\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t)}{\pi_\theta(\mathbf a_t\vert \mathbf s_t)}\nabla_{\theta^\prime}\log\pi_{\theta'}(\mathbf a_t\vert \mathbf s_t) \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\right] \end{align} $$
Therefore:
$$ \nabla_\theta(\bar{A}(\theta)) = \sum_{t=0}^\infty\mathbb E_{\mathbf s_t\sim p_{\theta}(\mathbf s_t)}\left[\mathbb E_{\mathbf a_t\sim \pi_\theta(\mathbf a_t\vert \mathbf s_t)}\left[\nabla_{\theta^\prime}\log\pi_{\theta}(\mathbf a_t\vert \mathbf s_t) \gamma^t A^{\pi_\theta}(\mathbf s_t,\mathbf a _t)\right]\right] $$
which is exactly the normal policy gradient \(\nabla_\theta J(\theta)\). The approximated objective function is thus given by:
$$ \theta^\prime \leftarrow \text{argmax}_{\theta^\prime} \nabla_\theta J(\theta)(\theta^\prime-\theta)\;\; \text{s.t.}\; D_{KL}\left(\pi_{\theta'} (\mathbf{a}_t\vert \mathbf{s}_t) \vert\vert \pi_{\theta} (\mathbf{a}_t\vert\mathbf{s}_t)\right) \leq \epsilon $$
It can be shown that gradient ascent optimizes the following objective:
$$ \theta^\prime \leftarrow \text{argmax}_{\theta^\prime} \nabla_\theta J(\theta)(\theta^\prime-\theta)\;\; \text{s.t.}\; \vert\vert \theta-\theta^\prime\vert\vert \leq \epsilon $$
i.e. it constricts \(\theta^\prime\) to be within \(\epsilon\) distance of \(\theta\) in the Euclidean space.
Note that the approximated objective function and the gradient ascent are exactly the same except for the constraints they impose on \(\theta^\prime\). However, it turns out that the second order Taylor expansion of \(D_{KL}\) is given by:
$$ D_{KL}\left(\pi_{\theta'} (\mathbf{a}_t\vert \mathbf{s}_t) \vert\vert \pi_{\theta} (\mathbf{a}_t\vert\mathbf{s}_t)\right) \approx \frac{1}{2}(\theta^\prime-\theta)\mathbb{F}(\theta^\prime-\theta) $$
where \(\mathbf{F}\) is the Fisher-information matrix given by:
$$ \mathbf{F} = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(\mathbf{a} \vert\mathbf{s})\nabla_\theta\log\pi_\theta(\mathbf{a}\vert\mathbf{s})^T\right] $$
Note that the expectation in \(\mathbf{F}\) can be estimated with samples. Expressing \(D_{KL}\) in this way allows us to rewrite the approximated objective function in the form of gradient descent:
$$ \theta^\prime = \theta + \alpha \mathbf F^{-1}\nabla_\theta J(\theta) $$
where
$$ \alpha = \sqrt{\frac{2\epsilon}{\nabla_\theta J(\theta)^T\mathbf F \nabla_\theta J(\theta)}} $$
This gradient form is referred to as the natural gradient descent partly because it does gradient ascent in the probability space i.e. \(\theta^\prime\) is constrained such that \(\pi_{\theta^\prime}\) is within \(\epsilon\) “distance” (the KL divergence) of \(\pi_\theta\).