Lecture 11

In the last set of lecture notes we assumed that we knew $\mathbf{s}_{t+1}=f(\mathbf{s}_t,\mathbf{a}_t)$ (or $p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)$ in the stochastic case) and used it to find optimal trajectories. We will now discuss some ways in which we can learn $f$ (or $p$). We begin with a very naïve way of accomplishing this task.

1. Model-Based RL v0.5

The simplest way to learn the dynamics is to just run a base (e.g. random) policy in the environment, collect the transitions and fit a neural network to it.

Run base policy $\pi_0(\mathbf{a}_t\vert\mathbf{s}_t)$ to collect $\mathcal{D} = \{(\mathbf{s}_t,\mathbf{a}_t,\mathbf{s}_{t+1})^{(i)}\}$.
Train a neural network $f'$ to minimize $\vert\vert f(\mathbf{s}^{(i)}_t,\mathbf{a}^{(i)}_t)-\mathbf{s}^{(i)}_{t+1} \vert\vert^2$.
Plan through $f(\mathbf{s}_t,\mathbf{a}_t)$ to choose actions (e.g. use LQR).

Often this method can work well if the base policy is sufficiently explorative. This technique is particularly effective if one can hand engineer a dynamic representation for the system using the knowledge of physics and then just fit the few unknown parameters.

However in general this technique does not work primarily because of a distributional mismatch problem. Let $\pi_f$ denote our policy with which we choose actions in Step 3. This policy is, obviously, different from the base policy (which may just be a random policy). Recall that in step 3, given an initial state we take a certain action. This gets us to another state. However, this new state was sampled from a different distribution $p_{\pi_f}$ whereas the training set for $f'$ came from some other distribution $p_{\pi_0}$. Therefore, one cannot make any guarantees for the accuracy of $f'$ on this sample. Note that this is essentially the same problem that we had with the naïve version of imitation learning.

2. Model-Based RL v1.0

We can try to make $p_{\pi_0} = p_{\pi_f}$ as we did with DAgger.

Run base policy $\pi_0(\mathbf{a}_t\vert\mathbf{s}_t)$ to collect $\mathcal{D} = \{(\mathbf{s}_t,\mathbf{a}_t,\mathbf{s}_{t+1})^{(i)}\}$.
Repeat:
1. Train a neural network $f'$ to minimize $\vert\vert f(\mathbf{s}^{(i)}_t,\mathbf{a}^{(i)}_t)-\mathbf{s}^{(i)}_{t+1} \vert\vert^2$.
2. Plan through $f(\mathbf{s}_t,\mathbf{a}_t)$ to choose actions (e.g. use LQR).
3. Execute these actions and add the resulting data $\{(\mathbf{s}_t,\mathbf{a}_t,\mathbf{s}_{t+1})^{(j)}\}$ to $\mathcal{D}$.

3. Model-Based RL v1.5

Suppose that we train an algorithm for a self-driving car using model-based RL. Given an initial state, the algorithm is asked to plan its actions for the next $T$ time steps. Suppose that the car is going straight but the algorithm thinks that the it is drifting a little to the left. To correct for that, the algorithm turns the steering to the right. Obviously after a few time steps the car will crash into the roadside pavement. One way to solve this problem is to have the algorithm re-plan at every time step. In that case, even when the algorithm turns the steering to the right initially, it will soon realize after a few timesteps that the car will ultimately crash into the roadside pavement. As a correctional measure, the algorithm will then turn the steering back to its initial position.

Run base policy $\pi_0(\mathbf{a}_t\vert\mathbf{s}_t)$ to collect $\mathcal{D} = \{(\mathbf{s}_t,\mathbf{a}_t,\mathbf{s}_{t+1})^{(i)}\}$.
Train a neural network $f'$ to minimize $\vert\vert f(\mathbf{s}^{(i)}_t,\mathbf{a}^{(i)}_t)-\mathbf{s}^{(i)}_{t+1} \vert\vert^2$.
Repeat:
1. Plan through $f(\mathbf{s}_t,\mathbf{a}_t)$ to choose actions (e.g. use LQR).
2. Execute the first action only and observe the resulting state $\mathbf{s}_{t+1}$.
3. Add $(\mathbf{s}_{t},\mathbf{a}_{t},\mathbf{s}_{t+1})$ to $\mathcal{D}$.
4. After every $N$ steps retrain $f'$.

Note that the more we re-plan the less perfect each plan needs to be. Also, as we’re only using the first action in each plan, we can use shorter horizons. However, re-planning at each time step is obviously more computationally expensive.

4. Model-Based RL v2.0

One thing that we can do is to use $f'$ to directly train a policy $\pi_\theta$ by backpropagating gradients through $f'$.

Run base policy $\pi_0(\mathbf{a}_t\vert\mathbf{s}_t)$ to collect $\mathcal{D} = \{(\mathbf{s}_t,\mathbf{a}_t,\mathbf{s}_{t+1})^{(i)}\}$.
Repeat:
1. Train a neural network $f'$ to minimize $\vert\vert f(\mathbf{s}^{(i)}_t,\mathbf{a}^{(i)}_t)-\mathbf{s}^{(i)}_{t+1} \vert\vert^2$.
2. Backpropagate through $f(\mathbf{s}_t,\mathbf{a}_t)$ to maximize the reward obtained from running $\pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)$.
3. Run $\pi(\mathbf{a}_t\vert\mathbf{s}_t)$ and add the collected data to $\mathcal{D}$.

This is similar to policy gradients (in the sense that we are taking derivatives with respect to the reward obtained from running the policy) except that we are now also backpropagating through (and learning) $f(\mathbf{s}_t,\mathbf{a}_t)$.

5. Local Models

Up till now, we have talked about trying to fit a global model to the system dynamics. However, the trouble with these models is that if they are not accurate, the policy will seek out regions where these models are erroneoulsy optimistic. To prevent that, we will need to find a very good model in most of the state space. This may not be possible always as the dynamics may be too complicated.

Recall that for iterative LQR we really do not need $f$ but rather its derivatives $\nabla_{\mathbf{x}_t}f$ and $\nabla_{\mathbf{u}_t}f$ around some trajectory. The idea behind local models is to approximate these derivatives around the trajectories obtained from running the current policy.

Let us assume that our system dynamics are almost deterministic with a small amount of Gaussian noise, i.e. $p(\mathbf{x}_{t+1}\vert \mathbf{x}_{t},\mathbf{u}_{t}) \sim \mathcal{N}(f(\mathbf{x}_{t},\mathbf{u}_{t}),\Sigma)$. Given a set of trajectories obtained from running our policy multiple times, we can fit a linear model to $f$ (for e.g. via linear regression):

$$ f(\mathbf{x}_{t},\mathbf{u}_{t}) \approx \mathbf{A}_t\mathbf{x}_{t} + \mathbf{B}_t\mathbf{u}_{t} + \mathbf{c}_t $$

Note that:

$$ \begin{align} \mathbf{A}_t &= \nabla_{\mathbf{x}_{t}}f(\mathbf{x}_{t},\mathbf{u}_{t})\\ \mathbf{B}_t &= \nabla_{\mathbf{u}_{t}}f(\mathbf{x}_{t},\mathbf{u}_{t}) \end{align} $$

We can then use $\mathbf{A}_t$ and $\mathbf{B}_t$ to do, for example, iterative LQR.

The following algorithm summarizes the above discussion:

Repeat until convergence:

Run $p(\mathbf{u}_t\vert\mathbf{x}_t)$ in the environment to collect trajectories.
Fit the dynamics $p(\mathbf{x}_{t+1}\vert\mathbf{x}_t,\mathbf{u}_t)$ using the collected trajectories.
Run iLQR using the new dynamics to obtain a better $p(\mathbf{u}_t\vert\mathbf{x}_t)$.

There are two main issues to this algorithm that we discuss in the next two sections.

6. Which Policy to Use?

Recall that the iLQR algorithm chooses actions according to $\mathbf{u}_t = \mathbf{K}_t(\mathbf{x}_t-\hat{\mathbf{x}}_t)+\mathbf{k}_t+\hat{\mathbf{u}}_t$. Note that $\mathbf{u}_t$ is deterministic. Also, we had earlier assumed that the system dynamics our approximately deterministic. This means that when we run our (deterministic) policy, all of our trajectories will be very close to each other. Consider the following diagrams:

Clearly, it is much easier to fit a linear model when the points are sufficiently spread. If the points are too close together, many different equally good linear models can be constructed. Having a deterministic policy has the same effect. One way to counter this is to add some Gaussian noise to the policy, i.e.:

$$ p(\mathbf{u}_t\vert\mathbf{x}_t) = \mathcal{N}(\mathbf{K}_t(\mathbf{x}_t-\hat{\mathbf{x}}_t)+\mathbf{k}_t+\hat{\mathbf{u}}_t,\Sigma_t) $$

One neat trick is to set $\Sigma_t$ to be equal to $\mathbf{Q}_{\mathbf{u}_t,\mathbf{u}_t}^{-1}$. Recall that in LQR algorithm:

$$ \frac{1}{2}\begin{bmatrix}\mathbf{x}_{t}\\\mathbf{u}_{t} \end{bmatrix}^T\mathbf{Q}_{t}\begin{bmatrix}\mathbf{x}_{t}\\\mathbf{u}_{t}\end{bmatrix} + \begin{bmatrix}\mathbf{x}_{t}\\\mathbf{u}_{t}\end{bmatrix}^T \mathbf{q}_{t} $$

is equal to $\sum_{t=t}^T c(\mathbf{x}_t,\mathbf{u}_t)$, i.e. its it the total cost to go from time step $t$. Now note that $\mathbf{Q}_{\mathbf{u}_t,\mathbf{u}_t}$ only gets multiplied with $\mathbf{u}_t$. Therefore, if $\mathbf{Q}_{\mathbf{u}_t,\mathbf{u}_t}$ is big, then changing $\mathbf{u}_t$ (by adding noise) will result in a big change in the cost-to-go. Setting $\Sigma_t$ to be the inverse of $\mathbf{Q}_{\mathbf{u}_t,\mathbf{u}_t}$thus means that the policy will only act randomly (via the added noise $\Sigma_t$) when doing so minimally affects the cost-to-go. Interestingly, it can be shown that while the LQR algorithm minimizes:

$$ \sum_{t=1}^t c(\mathbf{x}_t,\mathbf{u}_t) $$

this Linear-Gaussian version (where we add Gaussian noise to the policy) actually minimizes:

$$ \sum_{t=1}^t \mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t) \sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[c(\mathbf{x}_t,\mathbf{u}_t)-\mathcal{H}(p(\mathbf{u}_t\vert\mathbf{x}_t))\right] $$

were $\mathcal{H}$ is the entropy-function. In other words, the Linear-Gaussian version ties to find the most random policy that minimize the total cost. We will make use of this shortly.

7. How to Fit the Dynamics?

Instead of using linear regression to fit the dynamics, as we previously discussed, it is sometimes better to use Bayesian linear regression. In this case, a global model for the dynamics is used as a prior. However, in the following discussion, we will assume a simple linear regression model.

The issue with local models is that they are only accuracte locally. However, the LQR algorithm is not aware of this. It may, therefore, take an action that results in a state outside of the locally-accurate space. We thus need to force the LQR to remain within the locally-accurate space. Note that the system dynamics that the LQR is fed at some iteration were learned based on the trajectory produced by the LQR at the previous iteration. We thus need to constrain the trajectory that the LQR produces at the current iteration to be close to the one produced at the previous iteration. This is because if the trajectory distributions are close together, then the local dynamics too will be close.

Let $p(\tau)$ and $\bar{p}(\tau)$ denote the distribution of the trajectories at the current and previous iterations respectively. By “close” we mean that:

$$ \begin{align} D_{KL}(p(\tau)\vert\vert p(\bar{\tau})) &\leq \epsilon\\ &= \mathbb{E}_{\tau \sim p(\tau)}\left[\log p(\tau) -\log p(\bar{\tau})\right] \end{align} $$

Recall that:

$$ p(\tau) = p(\mathbf{x}_1)\prod_{t=1}^T p(\mathbf{u}_{t}\vert\mathbf{x}_t) p(\mathbf{x}_{t+1}\vert\mathbf{x}_t,\mathbf{u}_t) $$

Note that:

$$ \bar{p}(\tau) = p(\mathbf{x}_1)\prod_{t=1}^T \bar{p}(\mathbf{u}_{t} \vert \mathbf{x}_t) p(\mathbf{x}_{t+1}\vert\mathbf{x}_t,\mathbf{u}_t) $$

because the dynamics and the initial conditions are the same at all iterations.

Therefore:

$$ \begin{align} \log p(\tau) -\log p(\bar{\tau}) &= \left(\log p(\mathbf{x}_{1}) + \sum_{t=1}^T p(\mathbf{u}_{t}\vert\mathbf{x}_t) +\sum_{t=1}^T p(\mathbf{x}_{t+1} \vert \mathbf{x}_t,\mathbf{u}_t)\right) \\&\quad - \left(\log p(\mathbf{x}_{1}) + \sum_{t=1}^T \bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t) +\sum_{t=1}^T p(\mathbf{x}_{t+1} \vert \mathbf{x}_t,\mathbf{u}_t)\right)\\ &= \sum_{t=1}^T p(\mathbf{u}_{t}\vert\mathbf{x}_t) - \bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t) \end{align} $$

We, thus, have:

$$ \begin{align} D_{KL}(p(\tau)\vert\vert p(\bar{\tau})) &= \mathbb{E}_{\tau \sim p(\tau)}\left[\sum_{t=1}^T p(\mathbf{u}_{t}\vert\mathbf{x}_t) - \bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t)\right]\\ &= \sum_{t=1}^T \mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t) \sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[ p(\mathbf{u}_{t}\vert\mathbf{x}_t) - \bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t)\right]\\ &= \sum_{t=1}^T -\mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t) \sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[\bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t)\right] + \mathbb{E}_{\mathbf{x}_t \sim p(\mathbf{x}_t)}\left[\mathbb{E}_{(\mathbf{u}_t \vert \mathbf{x}_t) \sim p(\mathbf{u}_t\vert\mathbf{x}_t)} \left[p(\mathbf{u}_{t} \vert\mathbf{x}_t)\right]\right] \end{align} $$

Note that $\mathbb{E}_{(\mathbf{u}_t \vert \mathbf{x}_t) \sim p(\mathbf{u}_t\vert\mathbf{x}_t)} \left[p(\mathbf{u}_{t} \vert\mathbf{x}_t)\right]$ is equal to the negative of the entropy of $p(\mathbf{u}_{t} \vert\mathbf{x}_t)$.

$$ \begin{align} D_{KL}(p(\tau)\vert\vert p(\bar{\tau})) &= \sum_{t=1}^T -\mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t) \sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[\bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t)\right] - \mathbb{E}_{\mathbf{x}_t \sim p(\mathbf{x}_t)}\left[\mathcal{H}(p(\mathbf{u}_{t} \vert\mathbf{x}_t))\right]\\ &= \sum_{t=1}^T \mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t) \sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[-\bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t) - \mathcal{H}(p(\mathbf{u}_{t} \vert\mathbf{x}_t))\right] \end{align} $$

where the last step follows from the fact that:

$$ \begin{align} \mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t) \sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[f(\mathbf{x}_t)\right] &= \int \int p(\mathbf{x}_t,\mathbf{u}_t)f(\mathbf{x}_t)d\mathbf{x}_td\mathbf{u}_t\\ &= \int f(\mathbf{x}_t) \int p(\mathbf{x}_t,\mathbf{u}_t)d\mathbf{u}_td\mathbf{x}_t\\ &= \int f(\mathbf{x}_t) p(\mathbf{x}_t)d\mathbf{x}_t\\ &= \mathbb{E}_{\mathbf{x}_t \sim p(\mathbf{x}_t)}\left[f(\mathbf{x}_t)\right] \end{align} $$

Note that $\mathcal{H}$ is a function of $\mathbf{x}_t$ only.

Before moving on, let us take a digression and talk about dual gradient descent.

8. Digression: Dual Gradient Descent (DGD)

Suppose that we have to solve the following optimization problem under the given constraint:

$$ \min_{\mathbf{x}}f(\mathbf{x}) \text{ s.t. } C(\mathbf{x}) = 0 $$

We may construct its Lagrangian as follows:

$$ \mathcal{L}(\mathbf{x},\lambda) = f(\mathbf{x}) + \lambda C(\mathbf{x}) $$

where $\lambda$ is the Lagrangian multiplier. Minimizing $\mathcal{L}$ with respect to the primal variable i.e. $\mathbf{x}$ gives the dual function:

$$ g(\lambda) = \min_{\mathbf{x}}\mathcal{L}(\mathbf{x},\lambda) $$

Let $\mathbf{x}^*$ denote the value of $\mathbf{x}$ that minimizes $\mathcal{L}$. Then:

$$ g(\lambda) = \mathcal{L}(\mathbf{x}^*,\lambda) $$

The optimal value of $\lambda$ is given by:

$$ \lambda \leftarrow \text{armgax}_{\lambda} g(\lambda) $$

This can be found by computing the gradient:

$$ \begin{align} \frac{dg}{d\lambda} &= \frac{d\mathcal{L}}{d\mathbf{x}^*}\frac{d\mathbf{x}^*}{d\lambda} + \frac{d\mathcal{L}}{d\lambda}\\ &= \frac{d\mathcal{L}}{d\lambda} \end{align} $$

where the last line follows from the fact that $\frac{d\mathcal{L}}{d\mathbf{x}^*} =0$ because $\mathbf{x}^*=\text{argmin}_{\mathbf{x}}\mathcal{L}(\mathbf{x},\lambda)$.

The dual gradient descent algorithm repeatedly performs the following steps until convergence:

Find $\mathbf{x}^* \leftarrow \text{argmin}_\mathbf{x} \mathcal{L}(\mathbf{x},\lambda)$
Compute $\frac{dg}{d\lambda} = \frac{d\mathcal{L}(\mathbf{x}^*,\lambda)}{d\lambda}$
Update $\lambda \leftarrow \lambda + \alpha\frac{dg}{d\lambda}$

$\lambda$ may be initialized randomly.

9. DGD with iLQR

Recall that for our locally-accurate model for the system dynamics the optimization problem that we want to solve is:

$$ \min_{p}\sum_{t=1}^T \mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t)\sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[c(\mathbf{x}_t,\mathbf{u}_t)\right] \text{ s.t. } D_{KL}(p(\tau)\vert\vert\bar{p}(\tau)) < \epsilon $$

Let us construct a Lagrangian for this:

$$ \begin{align} \mathcal{L}(p,\lambda) &= \sum_{t=1}^T \mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t)\sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[c(\mathbf{x}_t,\mathbf{u}_t) \right]- \lambda\left(D_{KL}(p(\tau)\vert\vert\bar{p}(\tau)) - \epsilon\right)\\ &= \sum_{t=1}^T \mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t)\sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[c(\mathbf{x}_t,\mathbf{u}_t) - \lambda\bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t) - \lambda\mathcal{H}(p(\mathbf{u}_{t} \vert\mathbf{x}_t))\right] - \lambda\epsilon \end{align} $$

where we just substituted in the expression for $D_{KL}(p(\tau)\vert\vert\bar{p}(\tau))$ that we had derived above. We may use the dual gradient descent algorithm here:

Repeat until convergence:

Find $\mathbb{p}^* \leftarrow \text{argmin}_\mathbb{p} \mathcal{L}(\mathbb{p},\lambda)$
Compute $\frac{dg}{d\lambda} = \frac{d\mathcal{L}(\mathbb{p}^*,\lambda)}{d\lambda}$
Update $\lambda \leftarrow \lambda + \alpha\frac{dg}{d\lambda}$

Steps 2 and 3 are quite straightforward. For step 1, recall that the Linear-Gaussian version of iLQR minimizes the following with respect to $p$:

$$ \sum_{t=1}^t \mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t) \sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[c(\mathbf{x}_t,\mathbf{u}_t)-\mathcal{H}(p(\mathbf{u}_t\vert\mathbf{x}_t))\right] $$

For step 1 we need to minimize the following with respect to $p$:

$$ \sum_{t=1}^T \mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t)\sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[c(\mathbf{x}_t,\mathbf{u}_t) - \lambda\bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t) - \lambda\mathcal{H}(p(\mathbf{u}_{t} \vert\mathbf{x}_t))\right] - \lambda\epsilon $$

for some known $\lambda$. Dividing this expression by $\lambda$ throughout and ignoring the term $\lambda \epsilon$ (because $p$ does not depend on it) gives:

$$ \sum_{t=1}^T \mathbb{E}_{(\mathbf{x}_t,\mathbf{u}_t)\sim p(\mathbf{x}_t,\mathbf{u}_t)}\left[\frac{1}{\lambda}c(\mathbf{x}_t,\mathbf{u}_t) - \bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t) - \mathcal{H}(p(\mathbf{u}_{t} \vert\mathbf{x}_t))\right] $$

Therefore, for step 1 we can simply use the iLQR algorithm with the cost function:

$$ \bar{c}(\mathbf{x}_t,\mathbf{u}_t) = \frac{1}{\lambda}c(\mathbf{x}_t,\mathbf{u}_t) - \bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t) $$

Note that:

$$ \frac{dg}{d\lambda} = D_{KL}(p^*(\tau)\vert\vert\bar{p}(\tau)) - \epsilon $$

So the final form of the algorithm is given as:

Repeat until convergence:

Set $\bar{c}(\mathbf{x}_t,\mathbf{u}_t) = \frac{1}{\lambda}c(\mathbf{x}_t,\mathbf{u}_t) - \bar{p}(\mathbf{u}_{t}\vert\mathbf{x}_t)$
Use iLQR to find $p^*$ using $\bar{c}(\mathbf{x}_t,\mathbf{u}_t)$ as the cost function
Update $\lambda \leftarrow \lambda + \alpha\left(D_{KL}(p^*(\tau)\vert\vert\bar{p}(\tau)) - \epsilon \right)$