1. Evaluating the RL Objective Function

Recall that the RL objective function is given by:

$$ \theta^* = \underset{\theta}{\text{argmax}}\; \mathbb{E}_{\tau\sim p_\theta(\tau)}\sum_t\left[r(\mathbf{s}_t,\mathbf{a}_t)\right] $$

Define:

$$ J(\theta) = \mathbb{E}_{\tau\sim p_\theta(\tau)} \sum_t \left[r(\mathbf{s}_t,\mathbf{a}_t)\right] $$

We may approximate \(J(\theta)\) by running our policy \(\pi_\theta\) and averaging over the total rewards:

$$ J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_t r(\mathbf{s}_t^{(i)}, \mathbf{a}_t^{(i)}) $$

We will make use of this approximation later.

2. Direct Policy Differentiation

We may find \(\theta^*\) by directly taking the gradient of \(J(\theta)\) and then repeatedly applying:

$$ \theta := \theta - \alpha\nabla_\theta J(\theta) $$

which is just stochastic gradient descent.

Let \(r(\tau)\) denote \(\sum_t r(\mathbf{s}_t,\mathbf{a}_t)\). Then:

$$ \begin{align} \nabla_\theta J(\theta) &= \nabla_\theta \int_\tau p_\theta(\tau) r(\tau)d\tau\\ &= \int_\tau \nabla_\theta p_\theta(\tau) r(\tau)d\tau\\ &= \int_\tau p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) r(\tau)d\tau\\ &= \mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\nabla_\theta \log p_\theta(\tau) r(\tau)\right] \end{align} $$

where the third line follows from the identity:

$$ \begin{align} p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) &= p_\theta(\tau) \left(\frac{1}{p_\theta(\tau)}\nabla_\theta p_\theta(\tau)\right)\\ &= \nabla_\theta p_\theta(\tau) \end{align} $$

Recall that:

$$ p_\theta(\tau) = p(\mathbf{s}_1)\prod_t \pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)p(\mathbf{s}_{t+1}\vert \mathbf{s}_t, \mathbf{a}_t) $$

Therefore:

$$ \log p_\theta(\tau) = \log p(\mathbf{s}_1)+\sum_t\left( \log\pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)+\log p(\mathbf{s}_{t+1}\vert \mathbf{s}_t, \mathbf{a}_t)\right) $$

As the first and last terms do not depend on \(\theta\) we have:

$$ \nabla_\theta \log p_\theta(\tau) = \sum_t\nabla_\theta\log\pi_\theta(\mathbf{a}_t \vert\mathbf{s}_t) $$

Hence:

$$ \nabla_\theta J(\theta) =\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\left(\sum_t\nabla_\theta\log\pi_\theta(\mathbf{a}_t\vert \mathbf{s}_t)\right)\left(\sum_t r(\mathbf{s}_t,\mathbf{a}_t)\right)\right] $$

which can be approximated as:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \left(\sum_t\nabla_\theta \log\pi_\theta(\mathbf{a}_t^{(i)}\vert \mathbf{s}_t^{(i)})\right) \left(\sum_t r(\mathbf{s}_t^{(i)}, \mathbf{a}_t^{(i)})\right) $$

The REINFORCE algorithm works as follows:

  1. Sample \(\{\tau^{(i)}\}\) from \(\pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)\)
  2. Estimate \(\nabla_\theta J(\theta)\) using the approximation above for the sampled \(\{\tau^{(i)}\}\)
  3. Perform the update \(\theta := \theta+\alpha\nabla_\theta J(\theta)\)

Interestingly, note that in the case of maximum likelihood estimation:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \left(\sum_t\nabla_\theta \log\pi_\theta(\mathbf{a}_t^{(i)}\vert \mathbf{s}_t^{(i)})\right) $$

In direct policy differentiation we instead weight each sample’s gradient with its total reward. Thus, we are essentially favoring gradients coming from samples that have higher rewards over those that have lower rewards. This is simply just formalizing the notion of trial-and-error (i.e. run different trials and make the good ones more likely and the bad ones less so).

As an example of direct policy differentiation, consider a Gaussian policy:

$$ \pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t) = \mathcal{N}(f_{\text{neural network}}(\mathcal{s}_t),\sigma) $$

Here a neural network outputs the mean of a Gaussian distribution and \(\sigma\) is some fixed variance. Then:

$$ \log \pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t) = -\frac{1}{2\sigma^2}( f(\mathbf{s}_t)-\mathbf{a}_t)^2 + \text{const} $$

And:

$$ \nabla_\theta \log \pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t) = -\frac{1}{\sigma^2} \left(f(\mathbf{s}_t)-\mathbf{a}_t\right)\frac{df}{d\theta} $$

Finally, note that in deriving the policy gradients the only time we made use of Markov’s assumption was when we replaced \(p(\mathbf{s}_{t+1}\vert \mathbf{s}_1,\mathbf{a}_1\ldots \mathbf{s}_1t,\mathbf{a}_t)\) with \(p(\mathbf{s}_{t+1} \vert \mathbf{s}_t,\mathbf{a}_t)\) when showing that:

$$ p_\theta(\tau) = p(\mathbf{s}_1)\prod_t \pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)p(\mathbf{s}_{t+1}\vert \mathbf{s}_t, \mathbf{a}_t) $$

However, as this term later gets zeroed out when taking the gradient of \(p_\theta(\tau)\), the policy gradients are essentially independent of Markov’s assumption. Hence, in the case of partial observability we can simply replace the states \(\mathbf{s}_t\) with the observations \(\mathbf{o}_t\) (recall that observations, unlike states, are not Markovian):

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \left(\sum_t\nabla_\theta \log\pi_\theta(\mathbf{a}_t^{(i)}\vert \mathbf{o}_t^{(i)})\right) \left(\sum_t r(\mathbf{o}_t^{(i)}, \mathbf{a}_t^{(i)})\right) $$

3. Reducing Variance

One major problem with policy gradients is that of high variance. The gradients are calculated by repeatedly sampling trajectories using some policy. Now usually each trajectory takes a very different path depending upon the states and actions sampled from the policy. The gradient estimates due to each of these paths are therefore quite different resulting in high variance.

A. Exploiting Causality

Let us rewrite the policy gradients as:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T\nabla_\theta \log\pi_\theta(\mathbf{a}_t^{(i)}\vert \mathbf{s}_t^{(i)})\left(\sum_{t'=1}^T r(\mathbf{s}_{t'}^{(i)}, \mathbf{a}_{t'}^{(i)})\right) $$

One way to reduce variance is to make use of causality, which just means that future actions and states cannot influence past ones:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T\nabla_\theta \log\pi_\theta(\mathbf{a}_t^{(i)}\vert \mathbf{s}_t^{(i)})\left(\sum_{t'=t}^T r(\mathbf{s}_{t'}^{(i)}, \mathbf{a}_{t'}^{(i)})\right) $$

We call \(\sum_{t'=t}^T r(\mathbf{s}_{t'}^{(i)}, \mathbf{a}_{t'}^{(i)})\) as the reward-to-go. As the variance of the sum of some variables is equal to the sum of their variances, just by removing some variables from the policy gradients expression we have reduced the total variance.

B. Baselines

Suppose that we subtract a constant baseline \(b\) from the rewards in our policy gradients expression:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\nabla_\theta \log p_\theta (\tau)(r(\tau)-b)\right] $$

Note that doing so is unbiased in expectation because:

$$ \begin{align} \mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\nabla_\theta \log p_\theta (\tau)b\right] &= \int_\tau p_\theta(\tau)\nabla_\theta \log p_\theta(\tau)b\;d\tau\\ &= b\int_\tau \nabla_\theta p_\theta (\tau)\;d\tau\\ &= b\nabla_\theta\int_\tau p_\theta (\tau)\;d\tau\\ &= b\nabla_\theta(1)\;d\tau\\ &= 0 \end{align} $$

One common choice of \(b\) which works pretty well is the average reward:

$$ b = \frac{1}{N}\sum_{i=1}^N r(\tau) $$

This choice of \(b\) makes better-than-average samples have a positive reward and worse-than-averages ones a negative reward.

Suppose that we do not subtract a baseline. Then it might be that under some circumstances, all samples have a positive reward. Obviously, the good samples will have a more positive reward than the bad samples. However, because all samples have a positive reward, the policy, when trained, will try to make them all more probable i.e. assign them higher probabilities (albeit to different extents). This clearly is an undesirable thing.

Note that subtracting a baseline makes the \(\nabla_\theta \log p_\theta (\tau)(r(\tau)-b)\) smaller, which leads to a lower variance.

C. The Optimal Baseline

Let us derive the optimal baseline. The variance of a random variable \(X\) is given by:

$$ \text{Var}[X] = \mathbb{E}[X^2]-\mathbb{E}[X]^2 $$

Therefore:

$$ \begin{align} \text{Var} &= \mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\left(\nabla_\theta\log p_\theta(\tau) (r(\tau)-b)\right)^2\right] - \mathbb{E}_{\tau\sim p_\theta(\tau)} \left[ \nabla_\theta\log p_\theta(\tau) (r(\tau)-b)\right]^2\\ &= \mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\left(\nabla_\theta\log p_\theta(\tau) (r(\tau)-b)\right)^2\right] - \mathbb{E}_{\tau\sim p_\theta(\tau)} \left[ \nabla_\theta\log p_\theta(\tau) r(\tau)\right]^2\\ &= \mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\left(\nabla_\theta\log p_\theta(\tau) r(\tau)\right)^2 -2\left(\nabla_\theta\log p_\theta(\tau)\right)^2 r(\tau)b + \left(\nabla_\theta\log p_\theta(\tau) b\right)^2\right] - \mathbb{E}_{\tau\sim p_\theta(\tau)} \left[ \nabla_\theta\log p_\theta(\tau) r(\tau)\right]^2\\ \end{align} $$

We need to minimize \(\text{Var}\). Taking its derivative with respect to \(b\) and setting the result to \(0\) gives:

$$ \begin{align} 0&=\nabla_b\;\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[-2\left(\nabla_\theta\log p_\theta(\tau)\right)^2 r(\tau)b + \left(\nabla_\theta\log p_\theta(\tau) b\right)^2\right] \\ 0&=-2\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\left(\nabla_\theta\log p_\theta(\tau)\right)^2 r(\tau)\right] + 2b\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\left(\nabla_\theta\log p_\theta(\tau) \right)^2\right]\\ b &= \frac{\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\left(\nabla_\theta\log p_\theta(\tau)\right)^2 r(\tau)\right]}{\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\left(\nabla_\theta\log p_\theta(\tau) \right)^2\right]} \end{align} $$

So the optimal (constant) baseline is just the expected reward weighted by gradient magnitudes.

4. Policy Gradient Is On-Policy

Consider the policy gradient expression:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\nabla_\theta \log p_\theta (\tau)r(\tau)\right] $$

The problem here is that the expectation is under the policy. Therefore, even a slight change in the policy would mean that we would have to generate new samples. As neural networks require a large number of iterations to converge and only change a little during successive iterations having to resample after every iteration is quite inefficient and computationally expensive.

5. Off-Policy Learning

A. Importance Sampling

Suppose that we have some random variable \(x\). Then:

$$ \begin{align} \mathbb{E}_{x\sim p(x)}\left[f(x)\right] &= \int p(x)f(x)dx\\ &= \int q(x)\frac{p(x)}{q(x)}f(x)dx\\ &= \mathbb{E}_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right] \end{align} $$

This is known as importance sampling (IS).

B. Deriving the Policy Gradient with Importance Sampling

We now consider the problem presented with the on-policy version of policy gradients presented earlier. Suppose that we do not wish to collect new samples each time the policy is modified a little. Let \(\bar{p}_\theta\) denote the distribution that was used to generate samples. Then:

$$ J(\theta) = \mathbb{E}_{\tau\sim \bar{p}_\theta(\tau)}\left[\frac{p_\theta(\tau)}{\bar{p}_\theta(\tau)}r(\tau)\right] $$

Recall that:

$$ p_\theta(\tau) = p(\mathbf{s}_1)\prod_t\pi(\mathbf{a}_t\vert\mathbf{s}_t) p(\mathbf{s}_{t+1}\vert\mathbf{s}_t,\mathbf{a}_t) $$

Therefore:

$$ \begin{align} \frac{p_\theta(\tau)}{\bar{p}_\theta(\tau)} &= \frac{p(\mathbf{s}_1)\prod_t\pi(\mathbf{a}_t\vert\mathbf{s}_t) p(\mathbf{s}_{t+1}\vert\mathbf{s}_t,\mathbf{a}_t)}{p(\mathbf{s}_1)\prod_t\bar\pi(\mathbf{a}_t\vert\mathbf{s}_t) p(\mathbf{s}_{t+1}\vert\mathbf{s}_t,\mathbf{a}_t)}\\ &=\frac{\prod_t\pi(\mathbf{a}_t\vert\mathbf{s}_t)}{\prod_t\bar\pi (\mathbf{a}_t\vert\mathbf{s}_t)}\\ \end{align} $$

Taking the gradient of \(J(\theta)\) with respect to the parameters \(\theta\) of the new distribution \(p_\theta\) (and not the of the old distribution \(\bar{p}_\theta\)) gives:

$$ \begin{align} \nabla_\theta J(\theta) &= \nabla_\theta\mathbb{E}_{\tau \sim \bar{p}_\theta(\tau)}\left[\frac{p_\theta(\tau)}{\bar{p}_\theta(\tau)}r(\tau)\right]\\ &= \mathbb{E}_{\tau \sim \bar{p}_\theta(\tau)}\left[\frac{\nabla_\theta p_\theta(\tau)}{\bar{p}_\theta(\tau)}r(\tau)\right]\\ &= \mathbb{E}_{\tau \sim \bar{p}_\theta(\tau)}\left[\frac{p_\theta(\tau)}{\bar{p}_\theta(\tau)}\nabla_\theta \log p_\theta(\tau)r(\tau)\right]\\ &= \mathbb{E}_{\tau \sim \bar{p}_\theta(\tau)}\left[ \left(\prod_{t=1}^T \frac{\pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)}{\bar{\pi}_\theta(\mathbf{a}_t \vert\mathbf{s}_t)} \right)\left(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(\mathbf{a}_t \vert \mathbf{s}_t)\right) \left(\sum_{t=1}^T r(\mathbf{s}_t, \mathbf{a}_t)\right)\right]\\ \end{align} $$

We may make use of causality here:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \bar{p}_\theta(\tau)}\left[\sum_{t=1}^T\nabla_\theta \log \pi_\theta(\mathbf{a}_t\vert \mathbf{s}_t)\left(\prod_{t'=1}^t \frac{\pi_\theta (\mathbf{a}_t \vert\mathbf{s}_t)}{\bar{\pi}_\theta(\mathbf{a}_t \vert \mathbf{s}_t)} \right) \left(\sum_{t'=t}^T r(\mathbf{s}_{t'}, \mathbf{a}_{t'})\left(\prod_{t''=t}^T \frac{\pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)}{\bar{\pi}_\theta(\mathbf{a}_t \vert \mathbf{s}_t)} \right) \right)\right]\\ $$

One problem with the expression above is that as \(0\leq\pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)\le1\), the product terms can either vanish or explode. One way to mitigate this is by removing the right-most product term:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \bar{p}_\theta(\tau)}\left[\sum_{t=1}^T\nabla_\theta \log \pi_\theta(\mathbf{a}_t\vert \mathbf{s}_t)\left(\prod_{t'=1}^t \frac{\pi_\theta (\mathbf{a}_t \vert\mathbf{s}_t)}{\bar{\pi}_\theta(\mathbf{a}_t \vert \mathbf{s}_t)} \right) \left(\sum_{t'=t}^T r(\mathbf{s}_{t'}, \mathbf{a}_{t'})\right)\right]\\ $$

While doing so no longer gives us the policy gradients, the resulting form is used by a policy iteration algorithm that we will discuss in a later lecture.

C. Preview: A First Order Approximation for Importance Sampling

Recall that the objective function may be written as:

$$ J(\theta) = \sum_{t=1}^T \mathbb{E}_{(\mathbf{s}_t,\mathbf{a}_t)\sim p_\theta(\mathbf{s}_t,\mathbf{a}_t)} \left[ r(\mathbf{s}_t,\mathbf{a}_t)\right] $$

Note that:

$$ \begin{align} J(\theta) &= \sum_{t=1}^T \int\int p_\theta(\mathbf{s}_t,\mathbf{a}_t)r (\mathbf{s}_t, \mathbf{a}_t) d\mathbf{s}_td\mathbf{a}_t\\ &= \sum_{t=1}^T \int\int p(\mathbf{s}_t)\pi_\theta(\mathbf{a}_t \vert \mathbf{s}_t)r (\mathbf{s}_t, \mathbf{a}_t) d\mathbf{s}_td\mathbf{a}_t\\ &= \sum_{t=1}^T \mathbb{E}_{\mathbf{s}_t\sim p(\mathbf{s}_t)}\left[ \mathbb{E}_{\mathbf{a}_t \sim \pi_\theta(\mathbf{a}_t \vert \mathbf{s}_t)}\left[r (\mathbf{s}_t, \mathbf{a}_t)\right]\right] \end{align} $$

Using importance sampling gives:

$$ J(\theta) = \sum_{t=1}^T \mathbb{E}_{\mathbf{s}_t\sim \bar{p}(\mathbf{s}_t)}\left[ \frac{\bar{p}(\mathbf{s}_t)}{p(\mathbf{s}_t)} \mathbb{E}_{\mathbf{a}_t \sim \bar{\pi}_\theta(\mathbf{a}_t \vert \mathbf{s}_t)}\left[\frac{\bar{\pi_\theta}(\mathbf{a}_t \vert \mathbf{s}_t)}{\pi_\theta(\mathbf{a}_t \vert \mathbf{s}_t)} r(\mathbf{s}_t,\mathbf{a}_t)\right]\right] $$

As we’ll see in a later lecture, we can ignore the ratio \(\bar{p}(\mathbf{s}_t)/p(\mathbf{s}_t)\) and bound the error as a result of doing so.

6. Policy Gradients With Automatic Differentiation

Recall that the policy gradient is given by:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \left(\sum_t\nabla_\theta \log\pi_\theta(\mathbf{a}_t^{(i)}\vert \mathbf{s}_t^{(i)})\right) \left(\sum_t r(\mathbf{s}_t^{(i)}, \mathbf{a}_t^{(i)})\right) $$

We can simply construct a graph that implements the following ‘pseudo-loss’ function:

$$ J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \left(\sum_t \log \pi_\theta (\mathbf{a}_t^{(i)}\vert \mathbf{s}_t^{(i)})\right) \left(\sum_t r(\mathbf{s}_t^{(i)}, \mathbf{a}_t^{(i)})\right) $$

The log probability may be calculated by using the cross-entropy function in case of discrete actions or the Gaussian distribution if the actions are continuous. One important detail to note here is that the automatic differentiation package must not know that the rewards depend on \(\theta\) i.e. gradients should not flow through the rewards.

7. Policy Gradients in Practice

In practice, the gradients have a very high variance (i.e. are very noisy). One way to mitigate this is to use larger batches. Also, tweaking learning rates can become very hard for policy gradients. While adaptive step size rules such as ADAM can work, later lectures will talk about specific learning rate adjustment methods for policy gradients.