Lecture 6

1. A Better Reward-To-Go?

Recall that our policy gradients are given by:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_t^{(i)}\vert\mathbf{s}_t^{(i)})\left(\sum_{t'=t}^T r(\mathbf{s}_{t'}^{(i)}, \mathbf{a}_{t'}^{(i)})\right) $$

The sum $\sum_{t'=t}^T r(\mathbf{s}_{t'}^{(i)}, \mathbf{a}_{t'}^{(i)})$ is known as the reward-to-go. This is simply the sum of rewards from the current time step to the end. Note that this is using only the current trajectory to estimate the reward obtained by taking action $\mathbf{a}_t$ in state $\mathbf{s}_t$ and then following the policy $\pi$. However, as we may get different trajectories (because of the stochasticity of the system) each time we take the same action $\mathbf{a}_t$ in state $\mathbf{s}_t$, what we really need is the expected reward. Now, while an estimate of the expectation of a distribution using a single sample is always unbiased (in expectation we get the right answer), it has high variance¹.

Recall that we had earlier defined the Q-function as:

$$ Q^\pi(\mathbf{s}_t,\mathbf{a}_t) = \sum_{t'=t}^T \mathbb{E}_{(\mathbf{s}_{t'},\mathbf{a}_{t'})\sim p_\theta(\mathbf{s}_{t'}, \mathbf{a}_{t'}\vert \mathbf{s}_t,\mathbf{a}_t}[r(\mathbf{s}_{t'},\mathbf{a}_{t'})] $$

This is the true expected reward obtained from taking action $\mathbf{a}_t$ in state $\mathbf{s}_t$. Therefore, we modify our policy gradients as:

$$ \nabla_\theta J(\theta) = \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \log \pi(\mathbf{s}_t^{(i)}\vert \mathbf{a}_t^{(i)})\left(Q^\pi(\mathbf{s}_t^{(i)}, \mathbf{a}_t^{(i)})-V^\pi(\mathbf{s}_t^{(i)})\right) $$

We have also subtracted a baseline. Recall that the value function $V^\pi(\mathbf{s}_t)$ is just:

$$ V^\pi(\mathbf{s}_t)=\mathbb{E}_{\mathbf{a}_t\sim\pi_\theta(\mathbf{a}_t\vert \mathbf{s}_t)}\left[Q^\pi(\mathbf{s}_t,\mathbf{a}_t)\right] $$

i.e. it is just the expected value of the Q-function under the actions. Subtracting it has the intuition of assigning better-than-average actions a positive reward-to-go and worse-than-average ones a negative reward-to-go. Note that subtracting a state-dependent baseline is unbiased (as we showed in Homework 2).

We refer $Q^\pi(\mathbf{s}_t, \mathbf{a}_t)-V^\pi(\mathbf{s}_t)$ to as the advantage and denote it with $A^\pi(\mathbf{s}_t,\mathbf{a}_t)$. So the better the estimate $A^\pi(\mathbf{s}_t,\mathbf{a}_t)$ is, the lower the variance.

2. Value Function Fitting

Note that:

$$ \begin{align} Q^\pi(\mathbf{s}_t,\mathbf{a}_t) &= \sum_{t'=t}^T\mathbb{E}_{(\mathbf{s}_{t'}, \mathbf{a}_{t'})\sim p_\theta(\mathbf{s}_{t'}, \mathbf{a}_{t'}\vert \mathbf{s}_t,\mathbf{a}_t)}\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\right]\\ &= r(\mathbf{s}_t,\mathbf{a}_t) + \sum_{t'=t+1}^T\mathbb{E}_{(\mathbf{s}_{t'}, \mathbf{a}_{t'})\sim p_\theta(\mathbf{s}_{t'}, \mathbf{a}_{t'}\vert \mathbf{s}_t, \mathbf{a}_t)}\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\right]\\ &= r(\mathbf{s}_t,\mathbf{a}_t) + \sum_{t'=t+1}^T\int p_\theta(\mathbf{s}_{t'}, \mathbf{a}_{t'}\vert \mathbf{s}_t, \mathbf{a}_t)r(\mathbf{s}_{t'},\mathbf{a}_{t'}) d(\mathbf{s}_{t'}, \mathbf{a}_{t'})\\ &= r(\mathbf{s}_t,\mathbf{a}_t) + \sum_{t'=t+1}^T\int\left(\int p_\theta(\mathbf{s}_{t'},\mathbf{a}_{t'}\vert \mathbf{s}_{t+1},\mathbf{s}_t,\mathbf{a}_t) p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t) d\mathbf{s}_{t+1}\right) r(\mathbf{s}_{t'},\mathbf{a}_{t'}) d(\mathbf{s}_{t'}, \mathbf{a}_{t'})\\ &= r(\mathbf{s}_t,\mathbf{a}_t) + \sum_{t'=t+1}^T\int\left(\int p_\theta(\mathbf{s}_{t'},\mathbf{a}_{t'}\vert \mathbf{s}_{t+1}) p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t) d\mathbf{s}_{t+1}\right) r(\mathbf{s}_{t'},\mathbf{a}_{t'}) d(\mathbf{s}_{t'}, \mathbf{a}_{t'})\\ &= r(\mathbf{s}_t,\mathbf{a}_t) + \int p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)\left(\sum_{t'=t+1}^T\int p_\theta(\mathbf{s}_{t'}, \mathbf{a}_{t'}\vert \mathbf{s}_{t+1}) r(\mathbf{s}_{t'},\mathbf{a}_{t'}) d(\mathbf{s}_{t'}, \mathbf{a}_{t'})\right)d\mathbf{s}_{t+1}\\ &= r(\mathbf{s}_t,\mathbf{a}_t) + \int p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)V^\pi (\mathbf{s}_{t+1})d\mathbf{s}_{t+1}\\ &= r(\mathbf{s}_t,\mathbf{a}_t) + \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert\mathbf{s}_t,\mathbf{a}_t)}\left[V^\pi(\mathbf{s}_{t+1})\right] \end{align} $$

where the fifth line followed from Markov’s assumption.

We can estimate the expectation using a single-sample only:

$$ Q^\pi(\mathbf{s}_t,\mathbf{a}_t) \approx r(\mathbf{s}_t,\mathbf{a}_t) + V^\pi(\mathbf{s}_{t+1}) $$

Therefore:

$$ A^\pi(\mathbf{s}_t,\mathbf{a}_t) \approx r(\mathbf{s}_t,\mathbf{a}_t) + V^\pi(\mathbf{s}_{t+1}) - V^\pi(\mathbf{s}_t) $$

Hence, we only need to fit the value function to a function approximator in order to estimate the advantages.

In order to train a function approximator (say, a neural network) we need a training set:

$$ \left\{\left(\mathbf{s}_t^{(i)}, V^\pi(\mathbf{s}_t^{(i)})\right)\right\}_{i=1}^N $$

One way to estimate $V^\pi(\mathbf{s}_t)$ is through Monte Carlo evaluation (as in policy gradients):

$$ \begin{align} V^\pi(\mathbf{s}_t) &\approx \frac{1}{N}\sum_{i=1}^N \sum_{t'=t}^T r(\mathbf{s}_{t'}^{(i)}, \mathbf{a}_{t'}^{(i)}) \end{align} $$

However, this requires us to reset our simulator multiple times to state $\mathbf{s}_t$ which may be expensive or not possible. In practice, estimating $V^\pi$ with a single-sample works well:

$$ V^\pi(\mathbf{s}_t) \approx \sum_{t'=t}^T r(\mathbf{s}_{t'}, \mathbf{a}_{t'}) $$

Our training set is thus given by:

$$ \left\{\left(\mathbf{s}_t^{(i)}, \sum_{t'=t}^T r(\mathbf{s}_{t'}^{(i)}, \mathbf{a}_{t'}^{(i)})\right)\right\}_{i=1}^N $$

For supervised regression we would minimize:

$$ \mathcal{L}(\phi) = \sum_i\vert\vert \hat{V}^{\pi}_{\phi}(\mathbf{s}^{(i)})-y^{(i)})\vert \vert^2 $$

where $\hat{V}^{\pi}_{\phi}$ is the function approximator with parameters $\phi$.

While this works fine in practice, we can do something even better:

$$ \begin{align} V^\pi(\mathbf{s}_t) &= \mathbb{E}_{\mathbf{a}_t\sim\pi_\theta(\mathbf{a}_t\vert \mathbf{s}_t)}[Q(\mathbf{s}_t,\mathbf{a}_t)]\\ &\approx Q(\mathbf{s}_t,\mathbf{a}_t)\\ &= r(\mathbf{s}_t,\mathbf{a}_t) + \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert\mathbf{s}_t,\mathbf{a}_t)}\left[V^\pi(\mathbf{s}_{t+1})\right]\\ &\approx r(\mathbf{s}_t,\mathbf{a}_t)+V^\pi(\mathbf{s}_{t+1}) \end{align} $$

This estimate is often called as the bootstrapped estimate. So we can modify our training set to be:

$$ \left\{\left(\mathbf{s}_t^{(i)}, r(\mathbf{s}_t^{(i)},\mathbf{a}_t^{(i)}) +\hat{V}^{\pi}_{\phi}(\mathbf{s}_{t+1})\right)\right\} $$

i.e. we use our previous estimate of the value function to generate new labels. Note that as our estimate of the value function would be (hopefully) closer to the true value function than a single-sample estimate, this will reduce variance (though at the expense of introducing some bias, because the estimate after all is incorrect). We can now use this training set for supervised regression.

3. The Actor Critic Algorithm

The batch actor-critic algorithm repeats the following until convergence:

Sample $\{\mathbf{s}^{(i)},\mathbf{a}^{(i)}\}$ from $\pi(\mathbf{a}\vert\mathbf{s})$
Fit $\hat V^{\pi}_{\phi}$ to the sampled rewards
Evaluate $A^\pi (\mathbf{s}_t,\mathbf{a}_t) \approx r(\mathbf{s}_t,\mathbf{a}_t) + \hat V^{\pi}_{\phi}(\mathbf{s}_{t+1}) - \hat V^{\pi}_{\phi}(\mathbf{s}_t)$
Compute $\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_t^{(i)}\vert\mathbf{s}_t^{(i)})A^\pi (\mathbf{s}_t^{(i)},\mathbf{a}_t^{(i)})$
Update $\theta := \theta + \alpha\nabla_\theta J(\theta)$

Step 2 boils down to two steps:

Compute $y^{(i)}_t=r(\mathbf{s}_t^{(i)},\mathbf{a}_t^{(i)})+\hat V^\pi_\phi(\mathbf{s}_{t+1}^{(i)})$ for all $i$
Minimize $\mathcal{L}(\phi) = \sum_i\vert\vert \hat{V}^{\pi}_{\phi}(\mathbf{s}^{(i)})-y^{(i)})\vert \vert^2$

4. Discount Factors

Suppose we have an infinite horizon. If, for example, the rewards are all positive, then just adding them together will result in an infinitely large number. Discount factors are used to remedy this. We modify our labels $y^{(i)}$ to be:

$$ y^{(i)}_t=r(\mathbf{s}_t^{(i)},\mathbf{a}_t^{(i)})+\gamma\hat V^\pi_\phi ( \mathbf{s}_{t+1}^{(i)}) $$

where $\gamma\in [0,1]$ is the discount factor. To compute the advantages we use:

$$ A^\pi (\mathbf{s}_t,\mathbf{a}_t) \approx r(\mathbf{s}_t,\mathbf{a}_t) + \gamma\hat V^{\pi}_{\phi}(\mathbf{s}_{t+1}) - \hat V^{\pi}_{\phi}(\mathbf{s}_t) $$

For policy gradients we have two options. We can either use:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_t^{(i)}\vert\mathbf{s}_t^{(i)})\left(\sum_{t'=t}^T \gamma^{t'-t} r(\mathbf{s}_{t'}^{(i)}, \mathbf{a}_{t'}^{(i)})\right) $$

or use:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_t^{(i)}\vert\mathbf{s}_t^{(i)})\left(\sum_{t'=1}^T \gamma^{t-1} r(\mathbf{s}_{t'}^{(i)}, \mathbf{a}_{t'}^{(i)})\right) $$

which after noting causality gives:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \gamma^{t-1} \nabla_\theta \log \pi_\theta(\mathbf{a}_t^{(i)}\vert\mathbf{s}_t^{(i)})\left(\sum_{t'=t}^T \gamma^{t'-t} r(\mathbf{s}_{t'}^{(i)}, \mathbf{a}_{t'}^{(i)})\right) $$

The second option has the intuition (because of the term $\gamma^{t-1}$) that later steps do not matter as much as earlier steps. However, the first option is often used in practice. Discounts may also serve as a variance-reduction technique as they essentially reduce the weightage of states further in the future which are usually more uncertain than the ones closer.

Discounts result in the online version of the actor-critic algorithm:

Take action $\mathbf{a}\sim \pi(\mathbf{a}\vert\mathbf{s})$ and get $(\mathbf{s},\mathbf{a},\mathbf{s'},r)$
Update $\hat V^{\pi}_{\phi}$ using target $r+\hat V^{\pi}_{\phi}(\mathbf{s'})$
Evaluate $A^\pi (\mathbf{s},\mathbf{a}) \approx r(\mathbf{s},\mathbf{a}) + \gamma\hat V^{\pi}_{\phi}(\mathbf{s'}) - \hat V^{\pi}_{\phi}(\mathbf{s'})$
Compute $\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(\mathbf{a}\vert\mathbf{s})A^\pi (\mathbf{s},\mathbf{a})$
Update $\theta := \theta + \alpha\nabla_\theta J(\theta)$

The online actor-critic algorithm, however, works best in batches. This can be done in two ways: synchronized and asynchronous parallel actor critic mechanisms. We’ll discuss them in a later lecture note.

5. Architecture Design

For the Actor-Critic algorithm, we need to fit two neural networks (to the value function and the policy). We can either opt for a two-network design where the two networks do not share weights. Such a architecture is simple and stable but has the disadvantage that the actor and the critic cannot share features with each other. However, a shared network design is much harder to optimize.

6. Critics as State-Dependent Baseline

Note that the policy gradients were unbiased but had high variance. In contrast, the actor-critic is not unbiased but has lower variance. The lower variance is primarily because of the critic. Suppose that we only use the critic as a state-dependent baseline:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \log\pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)\left(\left(\sum_{t'=t}^T r(\mathbf{a}_{t'}, \mathbf{s}_{t'})\right)-\hat{V}^{\pi}_{\phi}(\mathbf{s}_{t'})\right) $$

Recall that using a state-dependent baseline is unbiased. So in this way, not only is our gradient unbiased but it also has lower variance.

7. Control Variates: Action-Dependent Baselines

We may use baselines that are also dependent on actions (such as the Q-function):

$$ A^\pi(\mathbf{s}_t,\mathbf{a}_t) = \left(\sum_{t'=t}^T r(\mathbf{s}_{t'}, \mathbf{a}_{t'})\right)-\hat Q^\pi_\phi(\mathbf{s}_t,\mathbf{a}_t) $$

Note that if the critic i.e. $\hat Q^\pi_\phi$ is exactly equal to the true Q-function, then the advantage will become zero in expectation (as the Q-function is equal to the expected sum of rewards). However, in practice this is not possible and the advantage will be a small non-zero albeit number. However, because smaller numbers have smaller variances, this estimate of the advantage will result in a lower variance for the policy gradients. Unfortunately, in general:

$$ \mathbb{E}_{\mathbf{a}_t\sim\mathbf{a}_t\vert\mathbf{s}_t}[Q(\mathbf{s}_t, \mathbf{a}_t)] \neq 0 $$

which means that the policy gradients are no longer unbiased. One way to make them unbiased again is to add back this expectation:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta\log \pi(\mathbf{a}_t\vert \mathbf{s}_t)\left(\left(\sum_{t'=t}^T r(\mathbf{s}_{t'}, \mathbf{a}_{t'})\right)-\hat Q^\pi_\phi(\mathbf{s}_t,\mathbf{a}_t)\right) + \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \mathbb{E}_{\mathbf{a}_t\sim\mathbf{a}_t\vert \mathbf{s}_t}[\hat Q^\pi_\phi(\mathbf{s}_t, \mathbf{a}_t)] $$

If we choose our function approximator carefully (such as quadratic in the actions under a Gaussian policy), then the second term above can have a closed form solution. So, in this way, we have been to reduce the variance of the first term by using an action-dependent baseline and have corrected for the bias as a result of doing so through the second term.

8. Eligibility Traces & N-Step Returns

For the infinite horizon case, the actor-critic computes advantages using:

$$ A_C^\pi(\mathbf{s}_t,\mathbf{a}_t) = r(\mathbf{s}_t,\mathbf{a}_t)+\gamma\hat V^\pi_\phi(\mathbf{s}_{t+1})-\hat V^\pi_\phi(\mathbf{s}_t) $$

while the Monte-Carlo (policy gradients) uses:

$$ A_{MC}^\pi(\mathbf{s}_t,\mathbf{a}_t) = \left(\sum_{t'=t}^{\infty} r(\mathbf{s}_t, \mathbf{a}_t)\right)-\hat V^\pi_\phi(\mathbf{s}_t) $$

$A_C^\pi$ has a lower variance but is biased while $A^\pi_{MC}$ has a higher variance but is unbiased. One way of controlling this bias-variance tradeoff is to use:

$$ A^\pi_n(\mathbf{s}_t,\mathbf{a}_t) = \left(\sum_{t'=t}^{t+n}\gamma^{t'-t} r(\mathbf{s}_t,\mathbf{a}_t)\right) + \gamma^{n}\hat V^\pi_\phi(\mathbf{s}_{t+n})-\hat V^\pi_\phi(\mathbf{s}_t) $$

Here we’re using summing up the rewards for the first $n$ steps and then using our function approximator for the remaining steps. The intuition behind this is that time steps further in the future are more uncertain (i.e. have higher variance) than the ones closer. So for those time steps, we can use a lower-variance, though slightly biased, estimator (the $\hat V^\pi_\phi$). For time steps in the near future, variance is not that big of an issue. So it is better to use an unbiased estimator (the sum of rewards) for them. In practice, choosing $n>1$ works better.

9. Generalized Advantage Estimation

We may use more than one value of $n$ in the $n$-step returns above and then weight the advantages:

$$ A^\pi_{GAE}(\mathbf{s}_t,\mathbf{a}_t) = \sum_{n=1}^\infty w_nA^\pi_n(\mathbf{s}_t, \mathbf{a}_t) $$

The weights can be chosen such that:

$$ w_n \propto \gamma^{n-1} $$

where $\gamma \in [0,1]$, i.e. $n$-step returns with smaller values of $n$ are preferred. It can easily be shown² that:

$$ A^\pi_{GAE} = \sum_{t=t'}^\infty (\gamma\lambda)^{t'-t}\delta_{t'} $$

where:

$$ \delta_{t'} = r(\mathbf{s}_{t'},\mathbf{a}_{t'}) + \gamma \hat V^\pi_\phi(\mathbf{s}_{t'+1})+\hat V^\pi_\phi(\mathbf{s}_{t'}) $$

Consider a random variable $y$ such that $\mathbb{E}[y]=\mu$. Suppose that we estimate $\mu$ using $N$ samples: $\bar{\mu} = \frac{1}{N}\sum_{i=1}^N y^{(i)}$. Then the expected value of $\bar{\mu}$ is: $\mathbb{E}[\bar\mu] =$ $\mathbb{E}\left[\frac{1}{N}\sum_{i=1}^N y^{(i)}\right]=$ $\frac{1}{N}\sum_{i=1}^N \mathbb{E}\left[y^{(i)}\right]$ $=\frac{1}{N}\sum_{i=1}^N \mu=\mu$. Hence, estimating the expectation of $y$ with any number of samples gives the right answer in expectation and so is unbiased. However, the individual samples $y^{(i)}$ might be far away from $\mu$. (and each other), which leads to high variance. Of course, as $N$ increases the estimate $\bar{\mu}$ becomes better and better. One intuition behind this is that the samples to the left and right of $\mu$ start to cancel out each other’s effect, bringing the result closer to $\mu$. ↩
See the original paper for this derivation: Schulman, Moritz, Levine, Jordan, Abbeel (2016). High-dimensional continuous control using generalized advantage estimation: batch-mode actor-critic with blended Monte Carlo and function approximator returns. ↩