Lecture 13

In this set of lecture notes, we will focus on learning a policy using model-based RL. A policy has several advantages. With it:

we do not need to re-plan at every stage. This is computationally faster.
we essentially have a closed loop control i.e. the policy can be conditioned on the current and previous states (recall that for stochastic dynamics we can never be sure as to which state we will land into upon taking a particular action).

We have already seen one method of learning a policy in these notes in which we directly backpropagated into the policy through $f(\mathbf{a},\mathbf{a})$. However, as we discussed there are several problems associated with this approach. One may to mitigate those problems is to make use of collocation methods.

1. Collocation Methods

Collocation methods solve the following optimization problem:

$$ \min_{\mathbf{u}_1,\ldots,\mathbf{u}_T,\mathbf{x}_1,\ldots,\mathbf{x}_T} \sum_{t=1}^T c(\mathbf{x}_t,\mathbf{u}_t) \text{ s.t. } \mathbf{x}_t = f(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) $$

In case we have to learn a policy (with parameters $\theta$), we can rewrite this optimization problem as follows:

$$ \min_{\mathbf{u}_1,\ldots,\mathbf{u}_T,\mathbf{x}_1,\ldots,\mathbf{x}_T,\theta} \sum_{t=1}^T c(\mathbf{x}_t,\mathbf{u}_t) \text{ s.t. } \mathbf{x}_t = f(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}), \mathbf{u}_t = \pi_\theta(\mathbf{x}_t) $$

Note that:

$$ \min_{\mathbf{u}_1,\ldots,\mathbf{u}_T,\mathbf{x}_1,\ldots,\mathbf{x}_T,\theta} \sum_{t=1}^T c(\mathbf{x}_t,\mathbf{u}_t) \text{ s.t. } \mathbf{x}_t = f(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) $$

is a generic trajectory optimization problem, which we can solve in any way (such as using iLQR). We, however, need to impose the following constraint on it:

$$ \mathbf{u}_t = \pi_\theta(\mathbf{x}_t) $$

Let us rewrite this problem as follows:

$$ \min_{\tau, \theta} c(\tau) \text{ s.t. } \mathbf{u}_t = \pi_\theta(\mathbf{x}_t) $$

where $c(\tau)$ is the generic trajectory optimization problem. We can use dual gradient descent to solve this optimization problem. The Lagrangian is given by:

$$ \mathcal{L}(\tau, \theta, \lambda) = c(\tau) + \sum_{t=1}^T \lambda_t \left(\pi_\theta(\mathbf{x}_t)-\mathbf{u}_t\right) $$

For dual gradient descent, we repeat until convergence:

Find $\tau^* \leftarrow \text{argmin}_\tau \mathcal{L}(\tau,\theta,\lambda)$ (e.g. with iLQR)
Find $\theta^* \leftarrow \text{argmin}_\theta \mathcal{L}(\tau^*,\theta,\lambda)$ (e.g. with SGD)
Update $\lambda \leftarrow \lambda + \alpha \frac{d\mathcal{L}(\tau^*,\theta^*,\lambda)}{d\lambda}$

Sometimes instead of the normal Lagrangian, an augmented version is used:

$$ \mathcal{\bar{L}}(\tau, \theta, \lambda) = c(\tau) + \sum_{t=1}^T \lambda_t \left(\pi_\theta(\mathbf{x}_t)-\mathbf{u}_t\right) + \sum_{t=1}^T \rho_t \left( \pi_\theta(\mathbf{x}_t)-\mathbf{u}_t\right)^2 $$

where $\rho$ is chosen heuristically. This usually works better than the simpler version.

The algorithm presented above (also referred to as Guided Policy Search) can also be interpreted as an imitation of an optimal control expert by the policy (since step 2 is just supervised learning where we are trying to minimize the difference between the action that the policy takes and the one obtained via trajectory optimization in step 1). One interesting thing to note here is that because the constraint $\pi_\theta(\mathbf{x}_t)=\mathbf{u}_t$ needs to be satisfied, if the policy $\pi_\theta$ is unable to mimic the expert action $\mathbf{u}_t$, the expert will have to stop taking actions $\mathbf{u}_t$. In other words, the expert adapts to $\pi_\theta$ (sometimes referred to as the learner) by avoiding actions that $\pi_\theta$ cannot take.

2. General Guided Policy Search Scheme

The algorithm presented in the previous section can be generalized in the following way:

Repeat until convergence:

Optimize $p(\tau)$ (or $\tau$ in the deterministic case) with respect to some surrogate $\bar{c}(\mathbf{x}_t,\mathbf{u}_t)$
Optimize $\theta$ with respect to some supervised objective
Update the dual variables $\lambda$

Here, we need to choose some form of $p(\tau)$, an optimization method for $p(\tau)$, the surrogate $\bar{c}(\mathbf{x}_t,\mathbf{u}_t)$ and a supervised objective for $\pi_\theta(\mathbf{u}_t \vert \mathbf{x}_t)$.

Note that in the deterministic case presented in the previous section, $\bar{c}(\tau)$ was just the Lagrangian. In the stochastic case, we could make use of the local models we discussed in these notes. In that case, step 1 solves the following optimization problem:

$$ \min_{p} \sum_{t=1}^T \mathbb{E}_{p(\mathbf{x}_t,\mathbf{u}_t)}\left[ c(\mathbf{x}_t,\mathbf{u}_t)\right] \text{ s.t. } D_{KL}(p(\tau) \vert\vert \bar{p}(\tau)) < \epsilon $$

Also note that:

$$ \pi_\theta(\mathbf{u}_t\vert\mathbf{x}_t) = p(\mathbf{u}_t\vert\mathbf{x}_t) $$

where:

$$ p(\mathbf{u}_t \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{K}_t(\mathbf{x}_t-\mathbb{\hat{x}}_t)+\mathbf{k}_t+\mathbb{\hat{u}}_t,\Sigma_t) $$

One trick sometimes used is the Input Remapping Trick. Instead of training the policy on states $\mathbf{x}$, the input remapping trick instead trains it on observations $\mathbb{o}$:

$$ \pi_\theta(\mathbf{u}_t\vert\mathbb{o}_t) = p(\mathbf{u}_t\vert\mathbf{x}_t) $$

Note that during training time we still require access to the states $\mathbf{x}$ for learning the dynamics of the system. The policy, however, only has access to the observations (and not the states).

3. PLATO Algorithm

The PLATO algorithm combines DAgger with the Guided Policy Search algorithm:

Repeat until convergence:

Train $\pi_\theta(\mathbf{u}_t \vert \mathbf{x}_t)$ on human data $\mathcal{D} = \{\mathbb{o}_1,\mathbf{u}_1,\ldots,\mathbb{o}_N,\mathbf{u}_N\}$.
Run $\hat{\pi}(\mathbf{u}_t \vert \mathbb{o}_t)$ to get $\mathcal{D}_{\hat\pi} = \{\mathbb{o}_1,\ldots,\mathbb{o}_M\}$.
Ask the optimal control expert (planner) to label each $\mathbb{o}_t$ in $\mathcal{D}_{\hat\pi}$ with actions $\mathbf{u}_t$.
Aggregate $\mathcal D \leftarrow \mathcal D\cup \mathcal{D_{\hat\pi}}$.

where $\hat\pi$ is a mixture of $\pi_\theta$ and the optimal control expert. If $\pi_\theta$ is not a good policy (such as in the beginning), then by mixing it with the optimal control expert, we can prevent actions that could result in a catastrophe. One example of $\hat\pi$ is:

$$ \hat\pi = \underset{argmin}{\hat\pi}\sum_{t=t'}^T \mathbb{E}_{\hat\pi} \left[ c(\mathbf{x}_t,\mathbf{u}_t) \right] + \lambda D_{KL}\left(\hat\pi(\mathbf{u}_t\vert \mathbf{x}_t) \vert\vert \pi_\theta(\mathbf{u}_t\vert\mathbf{x}_t) \right) $$

This objective function ensures that $\hat\pi$ minimizes the cost while simultaneously remaining as close to $\pi_\theta$ as possible (which is required for convergence as we discussed in DAgger). Note that step 3 eliminates the requirement of having a human label $\mathcal{D}_{\hat\pi}$ as in the DAgger algorithm.

4. Model-Free Optimization with a Model

Sometimes even when we have a model, it is better to use a model-free algorithm (such as policy gradients) and use the model as a simulator only. Dyna is one such algorithm. It makes use of the Q-learning algorithm:

Repeat until convergence:

Given state $\mathbf{a}$, pick action $\mathbf{a}$ using exploration policy.
Observe $\mathbf{a}'$ and $r$ to get transition $(\mathbf{a},\mathbf{a},\mathbf{a}',r)$.
Update model $\hat p(\mathbf{a}'\vert \mathbf{a},\mathbf{a})$ and $\hat r(\mathbf{a},\mathbf{a})$ using $(\mathbf{a},\mathbf{a},\mathbf{a}',r)$.
Q-update: $Q(\mathbf{a},\mathbf{a}) \leftarrow Q(\mathbf{a},\mathbf{a}) + \alpha \mathbb{E}_{\mathbf{a}',r}\left[r + \max_{\mathbf{a}'}Q(\mathbf{a}',\mathbf{a}')-Q(\mathbf{a},\mathbf{a})\right]$.
Repeat K times:
1. Sample $(\mathbf{a},\mathbf{a}) \sim \mathcal{B}$ from a buffer of past states and actions.
2. Q-update: $Q(\mathbf{a},\mathbf{a}) \leftarrow Q(\mathbf{a},\mathbf{a}) + \alpha \mathbb{E}_{\mathbf{a}',r}\left[r + \max_{\mathbf{a}'}Q(\mathbf{a}',\mathbf{a}')-Q(\mathbf{a},\mathbf{a})\right]$.

In step 3, we simply update our model of the system dynamics. In addition, if we do not know the rewards for each state-action tuple (which is often not the case) we also need to fit a function approximator $\hat r(\mathbf{a},\mathbf{a})$. In step 4 we perform the Q-learning update on the transition observed in the current iteration and in step 5 we update our Q-values on previous transitions (that are stored in the buffer $\mathcal B$).

The Dyna approach can be generalized in the following fashion:

Collect some data, consisting of transitions $(\mathbf{a},\mathbf{a},\mathbf{a}',r)$.
Learn model $\hat p (\mathbf{a}'\vert \mathbf{a},\mathbf{a})$ (and optionally, $\hat r(\mathbf{a},\mathbf{a})$).
Repeat $K$ times:
1. Sample $\mathbf{a} \sim \mathcal B$ from the buffer.
2. Choose action $\mathbf{a}$ (using $\pi$ or randomly).
3. Simulate $\mathbf{a}' \sim \hat{p}(\mathbf{a}'\vert \mathbf{a},\mathbf{a})$ (and $r = \hat{r}(\mathbf{a},\mathbf{a})$).
4. Train on $(\mathbf{a},\mathbf{a},\mathbf{a}',r)$ with model-free RL.
5. (Optional) Take $N$ more model-based steps: Choose an action $\mathbf{a}'$ for the state $\mathbf{a}'$ (from step 3 above), simulate this using the learned model $\hat{p}(\mathbf{a}''\vert \mathbf{a}',\mathbf{a}')$ and retrain the model-free RL on this new transition. We can then repeat this for $\mathbf{a}''$ to get $\mathbf{a}'''$ and so on.

The advantage of Dyna is that it requires very short (as few as one step) rollouts from the model. Note that the initial states are actual real states as they have been sampled from real world transitions (which were stored in $\mathcal B$). Therefore, even if the $\hat p$ has some errors in it, only rolling it out for a few time steps will not effect the model-free RL in step 4 a lot. In contrast, classic model-based RL would just run the simulator (i.e $\hat p$) right from the start. So if the trajectories are very long then the errors of $\hat p$ at each time step will add up resulting in a huge error towards the end.

Note that in the Dyna version the model-free RL will still be able to see the entire state space as long as the states in $\mathcal B$ are sufficiently spread throughout the entire space.

5. Limitations of Model-Based RL

Need some kind of model which either may not be available or may be harder to learn than a policy.
Learning a model takes time and data. Expressive models (such as neural nets) may not be fast while fast models (such as linear models) may not be expressive.
Need some kind of additional assumptions such as linearizability, continuity and smoothness.