Lecture 10

1. Model-Based Reinforcement Learning

Recall that the probability of choosing a particular trajectory under some policy with parameters $\theta$ is given by:

$$ p_\theta(\tau) = p(\mathbf{s}_1)\prod_{t=1}^T p_\theta(\mathbf{a}_t\vert\mathbf{s}_t) p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t) $$

and the RL objective function by:

$$ \theta^* = \text{argmax}_{\pi_\theta}\mathbb{E}_{\tau\sim\pi_\theta(\tau)} \sum_{t=1}^T r(\mathbf{s}_t,\mathbf{a}_t) $$

In the approaches that we have studied so far, we assumed that the transition probabilities $p(\mathbf{s}_{t+1}\vert\mathbf{s}_t,\mathbf{a}_t)$ are unknown to us. However, in many cases we do actually know the transition probabilities such as in simulated environments and systems that can be modeled easily. In many other cases, we can learn these transition probabilities by, for example, observing the transition data. Model-based RL leverages these transition probabilities to choose the best actions for a given system.

In this set of lecture notes, we will assume that we know the transition probabilities. Our goal, thus, is to find a sequence of actions $\mathbf{a}_1,\mathbf{a}_2,\ldots,\mathbf{a}_T$ that optimize the expected reward.

2. The Deterministic Case

In this case we can need to only solve for the following optimization problem:

$$ \mathbf{a}_1,...,\mathbf{a}_T = \text{argmax}_{\mathbf{a}_1,...,\mathbf{a}_T} \sum_{t=1}^T r(\mathbf{s}_t,\mathbf{a}_t)\; s.t. \mathbf{s}_{t+1} = f(\mathbf{s}_t, \mathbf{a}_t) $$

3. Open Loop V. Closed Loop Systems

In reinforcement learning, given some environment an agent is required to take a sequence of actions that maximize its (expected) reward. An open loop system is where the agent is asked to plan ahead its sequence of actions given only the state that it will start in (the initial state). In contrast, in a closed-loop system, the agent does not plan ahead. Instead, given some state, it takes a single action, sees which state it transitions to and then decides its next action. The agent, thus, continuously interacts with the environment in closed-loop case (whereas in the open-loop case it only interacts with it once).

4. The Stochastic Open-Loop Case

In this case, we need to find the solution to the following problem:

$$ \mathbf{a}_1,\ldots,\mathbf{a}_T = \text{argmax}_{\mathbf{a}_1,\ldots,\mathbf{a}_T} \mathbb{E}_{\mathbf{s}_1,\ldots,\mathbf{s}_T\sim p(\mathbf{s}_1,\ldots,\mathbf{s}_T \vert \mathbf{a}_1,\ldots,\mathbf{a}_T)} \left[\sum_{t=1}^T r(\mathbf{s}_t, \mathbf{a}_t) \right] $$

where:

$$ p(\mathbf{s}_1,\ldots,\mathbf{s}_T\vert \mathbf{a}_1,\ldots,\mathbf{a}_T) = p(\mathbf{s}_1)\prod_{t=1}^T p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t) $$

Note that while the solution to this problem, will give a sequence of actions that will yield the maximum reward in expectation, for individual rollouts the solution may not be optimal. The agent may end up in a particular state (because of the stochasticity in the environment) that it may not have anticipated during planning. As such, the action for that time step may not be the optimal action to take in that state.

5. The Stochastic Closed-Loop Case

For this case, we simply need to find a policy $\pi$ that maximizes the expected reward:

$$ \pi = \text{argmax}_\pi \mathbb{E}_{\tau\sim p(\tau)}\left[\sum_t r(\mathbf{s}_t, \mathbf{a}_t)\right] $$

Note that because $\pi$ will take in account the state at each time step, it will correspond to the policy that always maximizes the reward (for all rollouts).

For now, however, we will be only concerned with stochastic open-loop case.

6. Stochastic Optimization

Let us rewrite the stochastic open loop case as:

$$ \mathbf{a}_1,\ldots,\mathbf{a}_T = \text{argmax}_{\mathbf{a}_1,\ldots,\mathbf{a}_T} J(\mathbf{a}_1,\ldots,\mathbf{a}_T) $$

where, for now, we do not really care for $J$ is. For brevity, we shall denote the sequence $\mathbf{a}_1,\ldots,\mathbf{a}_T$ as $\mathbf A$. We shall now discuss ways in which we can find $\mathbf A$.

A. Random Shooting Method (“Guess and Check”)

This method simply:

Picks several sequences of actions $\mathbf{A}_1,\ldots,\mathbf{A}_2$ from some distribution (e.g. uniform)
Chooses $\mathbf{A}_i$ such that $i = \text{argmax}_i J(\mathbf{A}_i)$

B. Cross-Entropy Method (CEM)

For continuous-valued inputs, the cross-entropy method repeats the following until convergence:

Sample $\mathbf{A}_1,\ldots,\mathbf{A}_N$ from some distribution $p$
Evaluate $J(\mathbf{A}_1),\ldots,J(\mathbf{A}_N)$
Pick the elites $\mathbf{A}_1,\ldots,\mathbf{A}_M$ where $% <![CDATA[ M<N %]]>$ with the highest value
Refit $p$ to the elites.

If $p$ is a Gaussian distribution, then step 4 just amounts to setting the mean and variance of $p$ to the mean and variance of the elites $\mathbf{A}_1,\ldots,\mathbf{A}_M$.

The advantages of CEM and the Random Shooting method are that they are very fast if parallelized and are very simple to implement. However, the main disadvantage of these methods is that they are only efficient for a small time horizon (i.e a small $T$). Also, they assume an open-loop setting which, as we discussed earlier, is itself suboptimal.

C. Monte-Carlo Tree Search (MCTS)

The Monte-Carlo Tree Search algorithm is a tree-based method. The nodes at the $t^{th}$ level in the tree represent the state at time $t$. The branches of the tree are the different possible actions that can be taken. The MCTS repeatedly performs the following steps for some fixed number of iterations:

Find a leaf $s_l$ using $\text{TreePolicy}(s_l)$
Evaluate the leaf using $\text{DefaultPolicy}(s_l)$
Update all values in tree between $s_1$ (the root node) and $s_l$

The $\text{DefaultPolicy}$ is just some (random) policy used to guess the value of (or the reward of being in) state $s_l$. One example of the $\text{TreePolicy}$ is the UCT policy:

Initialize $s_t = s_1$.
Repeat:
1. If all actions from state $s_t$ have not been tried, choose a new action and break.
2. Else set $s_t = \text{argmax}_{s_{t+1}} \text{Score}(s_{t+1})$.

One common example of the $\text{Score}$ function is:

$$ \text{Score}(s_t) = \frac{Q(s_t)}{N(s_t)} + 2C\sqrt{\frac{2\ln N(s_{t-1})}{N(s_t)}} $$

where $Q(s_t)$ is the total future reward obtained from being in state $s_t$ and $N(s_t)$ is the number of times the node $s_t$ has been visited in the tree. $C$ is a tunable parameter. Note that this function essentially tries to choose between nodes with higher Q values (i.e. future rewards) and rarely visited nodes.

D. Using Derivatives

We may simply take the derivatives of $J(\mathbf{a}_1,\ldots,\mathbf{a}_T)$ with respect to $\mathbf{a}_1,\ldots,\mathbf{a}_T$ and use a first-order optimization method such as gradient descent. However, in practice it really helps to use a second-order method such as the Linear Quadratic regulator (LQR) which we discuss shortly.

7. Shooting Methods V. Collocation

From the remainder of these lecture notes, we will use $\mathbf{x}_t$ and $\mathbf{u}_t$ to represent the state and action at time $t$ respectively. We will also replace the reward function $r$ with the cost function $c$. The cost function is just the negative of the reward function. As such, our goal will be to minimize the total (expected) cost. We will shortly state this formally. This terminology is more in line with control theory, which is more concerned with optimal planning and trajectories.

Consider the deterministic case only. Shooting methods solve for the following optimization problem:

$$ \min_{\mathbf{u}_1,\ldots,\mathbf{u}_T} \sum_t c(\mathbf{x}_t,\mathbf{u}_t) $$

i.e. they optimize over the actions. In contrast, colocation methods also optimize over the states:

$$ \min_{\mathbf{u}_1,\ldots,\mathbf{u}_T,\mathbf{x}_1,\ldots,\mathbf{x}_T} \sum_{t=1}^T c(\mathbf{x}_t,\mathbf{u}_t) \;\; \text{s.t.}\;\; \mathbf{x}_t = f(\mathbf{x}_{t-1},\mathbf{u}_{t-1}) $$

8. Linear Quadratic Regulator (LQR)

For now, we will restrict ourselves to shooting methods only. Note that the optimization problem can be rewritten as (recall that for now we assume a deterministic setting):

$$ \min_{\mathbf{u}_1,\ldots,\mathbf{u}_T} c(\mathbf{x}_1,\mathbf{u}_1) + c(f(\mathbf{x}_1,\mathbf{u}_1),\mathbf{u}_2) + \ldots + c(f(f(\ldots)\ldots), \mathbf{u}_T) $$

The LQR method assumes that $f$ is linear:

$$ f(\mathbf{x}_t,\mathbf{u}_t) = \mathbf{F}_t \begin{bmatrix}\mathbf{x}_t\\\mathbf{u}_t \end{bmatrix} + \mathbf{f}_t $$

and that $c$ is quadratic:

$$ c(\mathbf{x}_t,\mathbf{u}_t) = \frac{1}{2}\begin{bmatrix}\mathbf{x}_t\\\mathbf{u}_t \end{bmatrix}^T\mathbf{C}_t\begin{bmatrix}\mathbf{x}_t\\\mathbf{u}_t\end{bmatrix} + \begin{bmatrix}\mathbf{x}_t\\\mathbf{u}_t\end{bmatrix}^T \mathbf{c}_t $$

where $\mathbf{F}_t$ and $\mathbf{C}_t$ are some known symmetric matrices and $\mathbf{f}_t$ and $\mathbf{c}_t$ are some known vectors. Note that $f$ and $c$ produce a vector and a scalar respectively.

To solve the optimization problem posed above, we will simply take its derivative with respect to each of the $T$ variables ($\mathbf{u}_1,\ldots,\mathbf{u}_T$) and set the result to zero. To do so, we will frequently make use of the matrix derivatives given in this post.

For convenience, let us rewrite the matrix $\mathbf{C}_t$ and the vector $\mathbf{c}_t$ as:

$$ \begin{align} \mathbf{C}_t &= \begin{bmatrix} \mathbf{C}_{\mathbf{x}_t,\mathbf{x}_t} & \mathbf{C}_{\mathbf{x}_t,\mathbf{u}_t} \\ \mathbf{C}_{\mathbf{u}_t,\mathbf{x}_t} & \mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\end{bmatrix}\\ \mathbf{c}_t &= \begin{bmatrix} \mathbf{c}_{\mathbf{x}_T} \\ \mathbf{c}_{\mathbf{u}_T} \end{bmatrix} \end{align} $$

Concretely, if $\mathbf{x}_t \in \mathbb{R}^{d_\mathbf{x}}$ and $\mathbf{u}_t \in \mathbb{R}^{d_u}$, then $\mathbf{C}_{\mathbf{x}_t,\mathbf{x}_t} \in \mathbb{R}^{d_\mathbf{x}\times d_\mathbf{x}}$, $\mathbf{C}_{\mathbf{x}_t,\mathbf{u}_t} \in \mathbb{R}^{d_\mathbf{x}\times d_\mathbf{u}}$, $\mathbf{C}_{\mathbf{u}_t,\mathbf{x}_t} \in \mathbb{R}^{d_\mathbf{u}\times d_\mathbf{x}}$, $\mathbf{C}_{\mathbf{u}_t,\mathbf{u}_t} \in \mathbb{R}^{d_\mathbf{u}\times d_\mathbf{u}}$, $\mathbf{c}_{\mathbf{x}_t} \in \mathbb{R}^{d_\mathbf{x}}$ and $\mathbf{c}_{\mathbf{u}_t} \in \mathbb{R}^{d_\mathbf{u}}$. Note that as $\mathbf{C}_t$ is symmetric, $\mathbf{C}_{\mathbf{x}_t,\mathbf{u}_t}$ and $\mathbf{C}_{\mathbf{u}_t,\mathbf{x}_t}$ are transposes of each other and both $\mathbf{C}_{\mathbf{x}_t,\mathbf{x}_t}$ and $\mathbf{C}_{\mathbf{u}_t,\mathbf{u}_t}$ are symmetric

We may now rewrite $c(\mathbf{x}_t,\mathbf{u}_t)$ as:

$$ c(\mathbf{x}_t,\mathbf{u}_t) = \frac{1}{2}\left[\mathbf{x}_t^T \mathbf{C}_{\mathbf{x}_t,\mathbf{x}_t}\mathbf{x}_t + \mathbf{x}_t^T \mathbf{C}_{\mathbf{x}_t,\mathbf{u}_t}\mathbf{u}_t + \mathbf{u}_t^T \mathbf{C}_{\mathbf{u}_t,\mathbf{x}_t}\mathbf{x}_t + \mathbf{u}_t^T \mathbf{C}_{\mathbf{u}_t,\mathbf{u}_t}\mathbf{u}_t\right] + \mathbf{x}_t^T\mathbf{c}_{\mathbf{x}_t} + \mathbf{u}_t^T\mathbf{c_{\mathbf{u}_t}} $$

Let us begin by taking the derivative with respect to $\mathbf{u}_T$. Note that only the last term in the optimization problem above depends on $\mathbf{u}_T$:

$$ \begin{align} \nabla_{\mathbf{u}_T} \sum_{t=1}^T c(\mathbf{x}_t, \mathbf{u}_t) &= \nabla_{\mathbf{u}_T} c(\mathbf{x}_T,\mathbf{u}_T)\\ &= \frac{1}{2}\left[\mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}^T\mathbf{x}_T + \mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}\mathbf{x}_T + \mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{u}_T + \mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}^T \mathbf{u}_T\right] + \mathbf{c}_{\mathbf{u}_T}^T\\ &= \mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}\mathbf{x}_T + \mathbf{C}_{\mathbf{u}_T, \mathbf{u}_T}\mathbf{u}_T + \mathbf{c}_{\mathbf{u}_T}^T \end{align} $$

where we made use of the symmetricity of $\mathbf{C}_T$ in the last step. Setting this to zero gives:

$$ \mathbf{u}_T = -\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}^{-1}(\mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}\mathbf{x}_T+\mathbf{c}_{\mathbf{u}_T}^T) $$

This can be re-expressed as:

$$ \mathbf{u}_T = \mathbf{K}_T\mathbf{x}_T+\mathbf{k}_T $$

where:

$$ \begin{align} \mathbf{K}_T &= -\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}^{-1} \mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}\\ \mathbf{k}_T &= -\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}^{-1} \mathbf{c}_{\mathbf{u}_T}^T \end{align} $$

Now that we have written $\mathbf{u}_T$ in terms of $\mathbf{x}_T$, we can eliminate it from our objective function:

$$ \begin{align} c(\mathbf{x}_T,\mathbf{u}_T) &= \frac{1}{2}[\mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{x}_T}\mathbf{x}_T + \mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{u}_T + \mathbf{u}_T^T \mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}\mathbf{x}_T + \mathbf{u}_T^T \mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{u}_T] + \mathbf{x}_T^T\mathbf{c}_{\mathbf{x}_T} + \mathbf{u}_T^T\mathbf{c_{\mathbf{u}_T}}\\ &=\frac{1}{2}[\mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{x}_T}\mathbf{x}_T + \mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}(\mathbf{K}_T\mathbf{x}_T+\mathbf{k}_T) + (\mathbf{K}_T\mathbf{x}_T+\mathbf{k}_T)^T \mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}\mathbf{x}_T + \\&\qquad (\mathbf{K}_T\mathbf{x}_T+\mathbf{k}_T)^T \mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}(\mathbf{K}_T\mathbf{x}_T+\mathbf{k}_T)] + \mathbf{x}_T^T\mathbf{c}_{\mathbf{x}_T} + (\mathbf{K}_T\mathbf{x}_T+\mathbf{k}_T)^T\mathbf{c_{\mathbf{u}_T}}\\ &=\frac{1}{2}[\mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{x}_T}\mathbf{x}_T + \mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{K}_T\mathbf{x}_T+\mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{k}_T + \mathbf{x}_T^T\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}\mathbf{x}_T +\mathbf{k}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}\mathbf{x}_T \\&\qquad +\mathbf{x}_T^T\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{K}_T\mathbf{x}_T +\mathbf{k}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{K}_T\mathbf{x}_T + \mathbf{x}_T^T\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{k}_T +\mathbf{k}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{k}_T ] \\&\qquad + \mathbf{x}_T^T\mathbf{c}_{\mathbf{x}_T} + \mathbf{x}_T^T\mathbf{K}_T^T \mathbf{c_{\mathbf{u}_T}} +\mathbf{k}_T^T\mathbf{c_{\mathbf{u}_T}}\\ &= \frac{1}{2}[\mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{x}_T}\mathbf{x}_T + \mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{K}_T\mathbf{x}_T+\mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{k}_T + \mathbf{x}_T^T\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}\mathbf{x}_T +\mathbf{x}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}^T\mathbf{k}_T \\&\qquad +\mathbf{x}_T^T\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{K}_T\mathbf{x}_T + \mathbf{x}_T^T\mathbf{K}_T^T \mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{k}_T+ \mathbf{x}_T^T\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{k}_T +\mathbf{k}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{k}_T ] \\&\qquad + \mathbf{x}_T^T\mathbf{c}_{\mathbf{x}_T} + \mathbf{x}_T^T\mathbf{K}_T^T \mathbf{c_{\mathbf{u}_T}} +\mathbf{k}_T^T\mathbf{c_{\mathbf{u}_T}}\\ &= \frac{1}{2}[\mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{x}_T}\mathbf{x}_T + \mathbf{x}_T^T \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{K}_T\mathbf{x}_T + \mathbf{x}_T^T\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}\mathbf{x}_T +\mathbf{x}_T^T\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{K}_T\mathbf{x}_T ] \\&\qquad +\mathbf{x}_T^T\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{k}_T + \mathbf{x}_T^T\mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{k}_T + \mathbf{x}_T^T\mathbf{K}_T^T \mathbf{c_{\mathbf{u}_T}} + \mathbf{x}_T^T\mathbf{c}_{\mathbf{x}_T} +\frac{1}{2}\mathbf{k}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{k}_T +\mathbf{k}_T^T\mathbf{c_{\mathbf{u}_T}}\\ \end{align} $$

where line 4 follows from the facts that a scalar is equal to its transpose and that $\mathbf{C}_T$ is symmetric. We may re-express the last line as:

$$ c(\mathbf{x}_T,\mathbf{u}_T) = \frac{1}{2}\mathbf{x}_T^T\mathbf{V}_T\mathbf{x}_T + \mathbf{x}_T^T\mathbf{v}_T + \mathit{const.} $$

where:

$$ \begin{align} \mathbf{V}_T &= \mathbf{C}_{\mathbf{x}_T,\mathbf{x}_T} + \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{K}_T + \mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T} +\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{K}_T\\ \mathbf{v}_T &= \mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{k}_T + \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{k}_T +\mathbf{K}_T^T \mathbf{c_{\mathbf{u}_T}}+ \mathbf{c}_{\mathbf{x}_T} \end{align} $$

Now that we have solved for $\mathbf{u}_T$ and have expressed $c(\mathbf{x}_T,\mathbf{u}_T)$ in terms of $\mathbf{x}_T$ only, let us solve for $\mathbf{u}_{T-1}$ next. Note that both $c(\mathbf{x}_{T-1},\mathbf{u}_{T-1})$ and $c(\mathbf{x}_{T},\mathbf{u}_{T})$ depend on $\mathbf{u}_{T-1}$. We shall now first express $c(\mathbf{x}_{T},\mathbf{u}_{T})$ in terms of $\mathbf{u}_{T-1}$. Recall that:

$$ \mathbf{x}_{T} = f(\mathbf{x}_{T-1},\mathbf{u}_{T-1}) = \mathbf{F}_{T-1}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix} + \mathbf{f}_{T-1} $$

Therefore:

$$ \begin{align} c(\mathbf{x}_T,\mathbf{u}_T) &= \frac{1}{2}\mathbf{x}_T^T\mathbf{V}_T\mathbf{x}_T + \mathbf{x}_T^T\mathbf{v}_T + \mathit{const.}\\ &= \frac{1}{2}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T- 1}^T\mathbf{V}_T\mathbf{F}_{T-1}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix} +\frac{1}{2}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T-1}^T \mathbf{V}_T\mathbf{f}_{T-1} + \frac{1}{2}\mathbf{f}_{T-1}^T\mathbf{V}_T\mathbf{F}_{T-1}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix} \\&\qquad + \frac{1}{2}\mathbf{f}_{T-1}^T\mathbf{V}_T\mathbf{f}_{T-1} +\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T-1}^T\mathbf{v}_T + \mathbf{f}_{T-1}^T\mathbf{v}_T + \mathit{const.}\\ &= \frac{1}{2}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T- 1}^T\mathbf{V}_T\mathbf{F}_{T-1}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix} +\frac{1}{2}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T-1}^T \mathbf{V}_T\mathbf{f}_{T-1} + \left(\frac{1}{2}\mathbf{f}_{T-1}^T\mathbf{V}_T\mathbf{F}_{T-1}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}\right)^T \\&\qquad +\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T-1}^T\mathbf{v}_T + \mathit{const.}\\ &= \frac{1}{2}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T-1}^T\mathbf{V}_T\mathbf{F}_{T-1}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix} +\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T-1}^T \mathbf{V}_T\mathbf{f}_{T-1} +\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T-1}^T\mathbf{v}_T + \mathit{const.}\\ \end{align} $$

where the fourth line follows from the fact that $\mathbf{V}_T$ is symmetric. Note that $\mathbf{V}_T$ is symmetric because:

$$ \begin{align} \mathbf{V}_T^T &= (\mathbf{C}_{\mathbf{x}_T,\mathbf{x}_T} + \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{K}_T + \mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T} +\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{K}_T)^T\\ &= \mathbf{C}_{\mathbf{x}_T,\mathbf{x}_T}^T +\mathbf{K}_T^T\mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}^T + \mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T}^T\mathbf{K}_T +\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}^T\mathbf{K}_T\\ &= \mathbf{C}_{\mathbf{x}_T,\mathbf{x}_T} +\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{x}_T} + \mathbf{C}_{\mathbf{x}_T,\mathbf{u}_T}\mathbf{K}_T +\mathbf{K}_T^T\mathbf{C}_{\mathbf{u}_T,\mathbf{u}_T}\mathbf{K}_T\\ &= V_T \end{align} $$

To take the derivative with respect to $\mathbf{u}_{T-1}$ we only need to consider the last two terms in our objective function:

$$ \begin{align} \sum_{t=1}^T c(\mathbf{x}_t,\mathbf{u}_t) &= c(\mathbf{x}_{T-1},\mathbf{u}_{T-1}) + \mathbf{c}(\mathbf{x}_{T},\mathbf{u}_{T}) + \mathit{const}\\ &= \frac{1}{2}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1} \end{bmatrix}^T\mathbf{C}_{T-1}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix} + \begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T \mathbf{c}_{T-1} + \frac{1}{2}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T-1}^T\mathbf{V}_T\mathbf{F}_{T-1}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix} \\&\qquad +\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T-1}^T \mathbf{V}_T\mathbf{f}_{T-1} +\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T\mathbf{F}_{T-1}^T\mathbf{v}_T + \mathit{const.}\\ &= \frac{1}{2}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1} \end{bmatrix}^T\mathbf{Q}_{T-1}\begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix} + \begin{bmatrix}\mathbf{x}_{T-1}\\\mathbf{u}_{T-1}\end{bmatrix}^T \mathbf{q}_{T-1} + \mathit{const.} \end{align} $$

where:

$$ \begin{align} \mathbf{Q}_{T-1} &= \mathbf{C}_{T-1} + \mathbf{F}_{T-1}^T\mathbf{V}_T\mathbf{F}_{T-1}\\ \mathbf{q}_{T-1} &= \mathbf{c}_{T-1} + \mathbf{F}_{T-1}^T \mathbf{V}_T\mathbf{f}_{T-1}+\mathbf{F}_{T-1}^T\mathbf{v}_T \end{align} $$

Note that $\mathbf{Q}_{T-1}$ is symmetric:

$$ \begin{align} \mathbf{Q}_{T-1}^T &= (\mathbf{C}_{T-1} + \mathbf{F}_{T-1}^T\mathbf{V}_T\mathbf{F}_{T-1})^T\\ &= \mathbf{C}_{T-1}^T + \mathbf{F}_{T-1}^T\mathbf{V}_T^T\mathbf{F}_{T-1}\\ &= \mathbf{C}_{T-1} + \mathbf{F}_{T-1}^T\mathbf{V}_T\mathbf{F}_{T-1}\\ &= \mathbf{Q}_{T-1} \end{align} $$

This equation has the same form as the one that we had when we were solving for $\mathbf{u}_T$ (except that we now have $\mathbf{Q}_{T-1}$ instead of $\mathbf{C}_T$). Therefore:

$$ \mathbf{u}_{T-1} = \mathbf{K}_{T-1} \mathbf{x}_{T-1} + \mathbf{k}_{T-1} $$

where:

$$ \begin{align} \mathbf{K}_{T-1} &= -\mathbf{Q}_{\mathbf{u}_{T-1},\mathbf{u}_{T-1}}^{-1} \mathbf{Q}_{\mathbf{u}_{T-1},\mathbf{x}_{T-1}}\\ \mathbf{k}_{T-1} &= -\mathbf{Q}_{\mathbf{u}_{T-1},\mathbf{u}_{T-1}}^{-1} \mathbf{q}_{\mathbf{u}_{T-1}}^T \end{align} $$

We can now repeat this procedure for $\mathbf{u}_{t-2}$ and then for $\mathbf{u}_{t-3}$. Note that we are essentially expressing each $\mathbf{u}_t$ in terms of its corresponding $\mathbf{x}_t$. Once we have expressed $\mathbf{u}_1$ in this way, we can plug the value of $\mathbf{x}_1$ (which is just our initial state which we already know), calculate $\mathbf{u}_1$ and then use the values of $\mathbf{x}_1$ and $\mathbf{u}_1$ to calculate $\mathbf{u}_2$. In this way we can repeat this process all the way up to $\mathbf{u}_T$.

To summarize the LQR algorithm, we perform two passes: the backward and the forward passes. The backward pass is given as:

$\;$for $t=T$ to $1$:

$\mathbf{Q}_t$$= \mathbf{C}_t + \mathbf{F}_t^T\mathbf{V}_{t+1}\mathbf{F}_{t}$
$\mathbf{q}_{t}$$= \mathbf{c}_{t}+\mathbf{F}_{t}^T \mathbf{V}_{t+1}\mathbf{f}_{t} + \mathbf{F}_{t}^T\mathbf{v}_{t+1}$
$\mathbf{K}_{t}=$$-\mathbf{Q}_{\mathbf{u}_t,\mathbf{u}_t}^{-1}\mathbf{Q}_{\mathbf{u}_t, \mathbb{x_t}}$
$\mathbf{k}_{t}$$= -\mathbf{Q}_{\mathbf{u}_t,\mathbf{u}_t}^{-1}\mathbf{q}_{\mathbf{u}_t}$
$\mathbf{u}_{t}$$= \mathbf{K}_{t}\mathbf{x}_{t}+\mathbf{k}_{t}$
$\mathbf{V}_{t}$$= \mathbf{Q}_{\mathbf{x}_t,\mathbf{x}_t} + \mathbf{Q}_{\mathbf{x}_t,\mathbf{u}_t}\mathbf{K}_t+ \mathbf{K}_t^T\mathbf{Q}_{\mathbf{u}_t,\mathbf{x}_t} +\mathbf{K}_t^T\mathbf{Q}_{\mathbf{u}_t,\mathbf{u}_t}\mathbf{K}_t$
$\mathbf{v}_{t}$$= \mathbf{K}_t^T\mathbf{Q}_{\mathbf{u}_t,\mathbf{u}_t}\mathbf{k}_t+ \mathbf{Q}_{\mathbf{x}_t,\mathbf{u}_t}\mathbf{k}_t +\mathbf{K}_t^T \mathbf{q}_{\mathbf{u}_t}+ \mathbf{q}_{\mathbf{x}_t}$

The forward pass is given as:

$\;$for $t=1$ to $T$:

$\mathbf{u}_{t}$$= \mathbf{K}_t\mathbf{x}_t+\mathbf{k}_t$
$\mathbf{x}_{t+1}$$=f(\mathbf{x}_t,\mathbf{u}_t)$

Up till this point we have assumed that our system is deterministic. Let us now relax this condition and assume a stochastic system i.e.:

$$ \mathbf{x}_{t+1} \sim p(\mathbf{x}_{t+1}\vert \mathbf{x}_t,\mathbf{u}_t) $$

Let us assume that $p$ is a Gaussian distribution of the following form:

$$ p(\mathbf{x}_{t+1}\vert \mathbf{x}_t,\mathbf{u}_t) = \mathcal{N}\left(\mathbf{F}_t \begin{bmatrix}\mathbf{x}_t\\\mathbf{u}_t \end{bmatrix} + \mathbf{f}_t,\Sigma_t\right) $$

It turns out that the optimal solution in this case is the same as that in the deterministic case, i.e. $\mathbf{u}_t = \mathbf{K}_t\mathbf{x}_t+\mathbf{k}_t$.

9. Iterative LQR (iLQR)

The LQR algorithm assumes a system with linear dynamics and a quadratic cost function. Suppose that we have a non-linear system and cost function instead. We may approximate each of these using Taylor’s expansion:

$$ \begin{align} f(\mathbf{x}_t,\mathbf{u}_t) &\approx f(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) + \nabla_{\mathbf{x}_t,\mathbf{u}_t} f(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) \begin{bmatrix}\mathbf{x}_t-\hat{\mathbf{x}}_t\\\mathbf{u}_t-\hat{\mathbf{u}}_t\end{bmatrix}\\ c(\mathbf{x}_t,\mathbf{u}_t) &\approx c(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) + \nabla_{\mathbf{x}_t, \mathbf{u}_t} c(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) \begin{bmatrix}\mathbf{x}_t-\hat{\mathbf{x}}_t\\\mathbf{u}_t-\hat{\mathbf{u}}_t \end{bmatrix} + \frac{1}{2}\begin{bmatrix}\mathbf{x}_t-\hat{\mathbf{x}}_t\\\mathbf{u}_t-\hat{\mathbf{u}}_t \end{bmatrix}^T \nabla_{\mathbf{x}_t, \mathbf{u}_t}^2 c(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t)\begin{bmatrix}\mathbf{x}_t-\hat{\mathbf{x}}_t\\\mathbf{u}_t-\hat{\mathbf{u}}_t \end{bmatrix} \end{align} $$

Let us rearrange these equations as follows:

$$ \begin{align} f(\mathbf{x}_t,\mathbf{u}_t) - f(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) &\approx \nabla_{\mathbf{x}_t,\mathbf{u}_t} f(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) \begin{bmatrix}\mathbf{x}_t-\hat{\mathbf{x}}_t\\\mathbf{u}_t-\hat{\mathbf{u}}_t\end{bmatrix}\\ c(\mathbf{x}_t,\mathbf{u}_t) - c(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) &\approx \nabla_{\mathbf{x}_t, \mathbf{u}_t} c(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) \begin{bmatrix}\mathbf{x}_t-\hat{\mathbf{x}}_t\\\mathbf{u}_t-\hat{\mathbf{u}}_t \end{bmatrix} + \frac{1}{2}\begin{bmatrix}\mathbf{x}_t-\hat{\mathbf{x}}_t\\\mathbf{u}_t-\hat{\mathbf{u}}_t \end{bmatrix}^T \nabla_{\mathbf{x}_t, \mathbf{u}_t}^2 c(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t)\begin{bmatrix}\mathbf{x}_t-\hat{\mathbf{x}}_t\\\mathbf{u}_t-\hat{\mathbf{u}}_t \end{bmatrix} \end{align} $$

Now let $\delta \mathbf{x}_t=\mathbf{x}-\hat{\mathbf{x}}_t$, $\delta \mathbf{u}_t=\mathbf{u}-\hat{\mathbf{u}}_t$, $\mathbf{F}_t = \nabla_{\mathbf{x}_t,\mathbf{u}_t} f(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t)$, $\mathbf{C}_t = \nabla_{\mathbf{x}_t, \mathbf{u}_t}^2 c(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t)$ and $\mathbf{c}_t = \nabla_{\mathbf{x}_t, \mathbf{u}_t} c(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t)$. Therefore:

$$ \begin{align} \bar{f}(\delta\mathbf{x}_t,\delta\mathbf{u}_t) &\approx \mathbf{F}_t \begin{bmatrix}\delta\mathbf{x}_t\\\delta\mathbf{u}_t\end{bmatrix}\\ \bar{c}(\delta\mathbf{x}_t,\delta\mathbf{u}_t) &\approx \mathbf{c}_t \begin{bmatrix}\delta\mathbf{x}_t\\\delta\mathbf{u}_t\end{bmatrix} + \frac{1}{2}\begin{bmatrix}\delta\mathbf{x}_t\\\delta\mathbf{u}_t\end{bmatrix}^T \mathbf{C}_t\begin{bmatrix}\delta\mathbf{x}_t\\\delta\mathbf{u}_t\end{bmatrix} \end{align} $$

This is the same form that we had in the LQR case. The iterative LQR repeatedly performs the following until convergence:

Backward pass: For $t=T$ to $1$ evaluate $\mathbf{F}_t$, $\mathbf{C}_t$ and $\mathbf{c}_t$ and express $\delta\mathbf{u}_t$ in terms of $\delta\mathbf{x}_t$ (note that our variables are $\delta\mathbf{x}_t$ and $\delta\mathbf{u}_t$ now).
Forward pass: At time step $1$ evaluate:
$$ \mathbf{u}_1 = \delta \mathbf{u}_1 + \hat{\mathbf{u}}_1 = \mathbf{K}_t\delta\mathbf{x}_1 + \mathbf{k}_t + \hat{\mathbf{u}}_1 $$
where $\delta\mathbf{x}_1 = \mathbf{x}_1-\hat{\mathbf{x}}_1$. Use the actual non-linear dynamics to obtain $\mathbf{x}_2$ from $\mathbf{x}_1$ and $\mathbf{u}_1$. Repeat this process for $t=2$ to $T$.
Set $\hat{\mathbf{x}}_t=\mathbf{x}_t$ and $\hat{\mathbf{u}}_t=\mathbf{u}_t$.

For the first step, we may initialize all $\hat{\mathbf{u}}_t$ randomly and evaluate the corresponding $\hat{\mathbf{x}}_t$. $\hat{\mathbf{x}}_t$ and $\hat{\mathbf{u}}_t$ are often called as the nominal trajectory.

10. Differential Dynamic Programming (DDP)

Instead of using a linear approximation for the system dynamics, the DDP algorithm uses a quadratic approximation:

$$ f(\mathbf{x}_t,\mathbf{u}_t) \approx f(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) + \nabla_{\mathbf{x}_t,\mathbf{u}_t} f(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) \begin{bmatrix}\delta\mathbf{x}_t\\\delta\mathbf{u}_t\end{bmatrix} + \frac{1}{2} \left(\nabla_{\mathbf{x}_t,\mathbf{u}_t}^2 f(\hat{\mathbf{x}}_t,\hat{\mathbf{u}}_t) \begin{bmatrix}\delta\mathbf{x}_t\\\delta\mathbf{u}_t\end{bmatrix}\right)\begin{bmatrix}\delta\mathbf{x}_t\\\delta\mathbf{u}_t\end{bmatrix}\\ $$