Lecture 7

1. Policy Iteration

A. High Level Idea

Recall that $A^\pi(\mathbf{a}_t,\mathbf{s}_t)$ is an estimate of how much better action $\mathbf{a}_t$ is than other actions. Suppose that at each time step we choose an action using:

$$ \mathbf{a}_t = \underset{\mathbf{a}_t}{\text{argmax}} A^\pi(\mathbf{s}_t, \mathbf{a}_t) $$

This is at least as good as any $\mathbf{a}_t \sim \pi(\mathbf{a}_t\vert\mathbf{s}_t)$. This means that we can simply forget about learning a policy explicitly and just focus on learning $A^\pi(\mathbf{s}_t,\mathbf{a}_t)$. We may write down our new policy as:

$$ \pi'(\mathbf{a}_t\vert\mathbf{s}_t) = \begin{cases} 1 & \text{if}\;\mathbf{a}_t = \underset{\mathbf{a}_t}{\text{argmax}}\;A^\pi (\mathbf{s}_t, \mathbf{a}_t)\\ 0 & \text{otherwise} \end{cases} $$

Note that we now have a deterministic policy. The policy iteration algorithm does the following:

Evaluate $A^\pi(\mathbf{s}_t,\mathbf{a}_t)$.
Set $\pi \leftarrow \pi'$.

Recall that:

$$ A^\pi(\mathbf{s}_t,\mathbf{a}_t) = r(\mathbf{s}_t,\mathbf{a}_t) + \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1} \vert \mathbf{s}_t, \mathbf{a}_t)} \left[V^\pi(\mathbf{s}_{t+1})\right] - V^\pi(\mathbf{s}_t) $$

Therefore, we only need to evaluate $V^\pi(\mathbf{s})$.

B. Policy Iteration with Dynamic Programming

Suppose that we knew the transition probabilities $p(\mathbf{s}'\vert\mathbf{s},\mathbf{a})$. Furthermore, suppose that both $\mathbf{s}$ and $\mathbf{a}$ can take on only a small number of discrete values. We can easily store the transition probabilities in such a case in a table.

Recall that our bootstrapped update is given by:

$$ \begin{align} V^\pi(\mathbf{s}) &= \mathbb{E}_{\mathbf{a}\sim \pi_\theta(\mathbf{a}\vert\mathbf{s})} \left[Q(\mathbf{s},\mathbf{a})\right] \\ &= \mathbb{E}_{\mathbf{a}\sim \pi_\theta(\mathbf{a}\vert\mathbf{s})} \left[ r(\mathbf{s},\mathbf{a}) + \gamma\mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}\vert \mathbf{s},\mathbf{a})} \left[V^\pi(\mathbf{s}')\right]\right]\\ &= r(\mathbf{s},\pi(\mathbf{s})) + \gamma\mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}\vert \mathbf{s},\pi(\mathbf{s}))} \left[V^\pi(\mathbf{s}')\right]\\ \end{align} $$

where the last step follows from the fact that our policy assigns a probability of $1$ to only one action (the one that maximizes the advantage) and $0$ to all others. We denote the action for which the probability is $1$ with $\pi(s)$.

Note that we can easily compute the inner expectation because we only have a small number of possible transitions (as per our initial assumption). We can, therefore, repeatedly, perform the bootstrapped update to improve our estimate of the value function.

This can be simplified even further by noticing that because:

$$ A^\pi(\mathbf{s}) = Q(\mathbf{s},\mathbf{a}) - V^\pi(\mathbf{s}) $$

we have:

$$ \underset{\mathbf{a}}{\text{argmax}}\; A^\pi(\mathbf{s},\mathbf{a}) = \underset{\mathbf{a}}{\text{argmax}}\; Q^\pi(\mathbf{s},\mathbf{a}) $$

The value iteration does the following:

Updates $Q^\pi(\mathbf{s},\mathbf{a}) = r(\mathbf{s},\mathbf{a}) + \gamma\mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}'\vert \mathbf{s},\mathbf{a})}\left[V^\pi(\mathbf{s}')\right]$.
Sets $V^\pi(\mathbf{s}) = \max_\mathbf{a} Q^\pi(\mathbf{s},\mathbf{a})$.

The second step is true because our policy assigns a probability of $1$ to only one action and so the expectation of the Q-function is just equal to its value for that action.

It can be shown (see this) that policy iteration converges in a finite number of steps.

C. Fitted Value Iteration

Let us relax the condition that the states need to be discrete. For now we will assume that the actions are discrete (though we will relax this condition later on). To avoid the curse of dimensionality that is associated with the discretization of continuous spaces (see this), we will use a function approximator $V^\pi_\phi$ for the value function. We are interested in minimizing:

$$ L(\phi) = \vert\vert V^\pi_\phi(\mathbf{s})-\max_\mathbf{a} Q^\pi(\mathbf{s}, \mathbf{a}) \vert\vert $$

The fitted value iteration repeatedly performs the following two steps until convergence:

Sets $y^{(i)} = \max_\mathbf{a}\left[ r(\mathbf{s}^{(i)} ,\mathbf{a}^{(i)} ) + \gamma\mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}'^{(i)} \vert \mathbf{s}^{(i)} , \mathbf{a}^{(i)} )}\left[V^\pi_\phi(\mathbf{s}')\right]\right]$.
Sets $\phi = \text{argmin}_\phi \frac{1}{2}\sum_i \vert\vert V^\pi_\phi(\mathbf{s}^{(i)} )-y^{(i)} \vert\vert^2$.

2. Fitted Q-Iteration

A. The Algorithm

Consider step 1 in the fitted value iteration algorithm in the previous section. In order to take the maximum of the expression in the square brackets under different actions, we would need to rewind to the same state multiple times. This may not be possible, especially when we a have a large continuous space (where we might never visit the same state twice). It turns out that there is an easy way to deal with this problem.

Recall that:

$$ \begin{align} Q^\pi(\mathbf{s},\mathbf{a}) &= r(\mathbf{s},\mathbf{a}) + \mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}'\vert\mathbf{s},\mathbf{a})}\left[V^\pi(\mathbf{s}') \right]\\ &= r(\mathbf{s},\mathbf{a}) + \mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}'\vert\mathbf{s}, \mathbf{a})}\left[Q^\pi(\mathbf{s}',\pi(\mathbf{s}'))\right] \end{align} $$

For our tabular case (dynamic programming) we may now repeatedly update $Q^\pi$ using the equation above (as we previously did with the value function in the value iteration algorithm). Note that in order to take the maximum value of the Q-function under different actions, we no longer need to rewind to the same state multiple times to take different actions. Instead, we can simply use our estimate of $Q^\pi$ to compute the values of the Q-function for different actions for a particular state and then just take their maximum.

The fitted Q-iteration algorithm caters to the continuous case (like the fitted value iteration algorithm):

Set $y^{(i)} = r(\mathbf{s}^{(i)},\mathbf{a}^{(i)}) + \mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}'\vert\mathbf{s}^{(i)} , \mathbf{a}^{(i)} )}\left[\max_{\mathbf{a}'} \left[ Q^\pi_\phi(\mathbf{s}',\mathbf{a}')\right]\right]$.
Set $\phi = \text{argmin}_\phi \frac{1}{2}\sum_i \vert\vert Q^\pi_\phi(\mathbf{s}^{(i)},\mathbf{a}^{(i)})-y^{(i)} \vert\vert^2$.

If we don’t know the transition dynamics, we can simply approximate the expectation with a single sampled state:

$$ \mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}'\vert\mathbf{s}^{(i)} , \mathbf{a}^{(i)} )}\left[\max_{\mathbf{a}'} \left[ Q^\pi_\phi(\mathbf{s}',\mathbf{a}')\right]\right] \approx \max_{\mathbf{a}'} \left[ Q^\pi_\phi(\mathbf{s}',\mathbf{a}')\right] $$

The full fitted Q-iteration algorithm repeats the following until convergence:

Collect dataset $\{\mathbf{s}^{(i)},\mathbf{a}^{(i)},\mathbf{s}'^{(i)},\mathbf{r}^{(i)}\}$ using some policy.
Set $y^{(i)} = r(\mathbf{s}^{(i)},\mathbf{a}^{(i)}) + \max_{\mathbf{a}'} \left[ Q^\pi_\phi(\mathbf{s}'^{(i)},\mathbf{a}')\right]$.
Set $\phi = \text{argmin}_\phi \frac{1}{2}\sum_i \vert\vert Q^\pi_\phi(\mathbf{s}^{(i)},\mathbf{a}^{(i)})-y^{(i)} \vert\vert^2$.

This algorithm works as follows: after each execution of step 1, step 2 is run $K$ times. After each execution of step 2, step 3 is run $S$ times.

One thing to note here is that Q-iteration does not assume that the samples came from running a particular policy. $Q^\pi_\phi$ needs to approximate the Q-function for all states and actions including those that came from running some other policy. In the expression $r(\mathbf{s}^{(i)},\mathbf{a}^{(i)}) + \max_{\mathbf{a}'} \left[ Q^\pi_\phi(\mathbf{s}'^{(i)},\mathbf{a}')\right]$, the first term is independent of any policy. Also, the transition to $\mathbf{s}'^{(i)}$ is independent of any policy given $\mathbf{s}^{(i)}$ and $\mathbf{a}^{(i)}$. So neither of the terms in this expression assume that the samples came from any particular policy.

Therefore, fitted Q-iteration works for off-policy samples. It also has no high variance policy gradient. However, fitted Q-iteration is not guaranteed to converge for non-linear function approximators (as we’ll see towards the end of these lecture notes).

B. What Is Fitted Q-Iteration Optimizing?

Step 3 in the full fitted Q-iteration algorithm is essentially minimizing:

$$ \mathcal{E} = \frac{1}{2}\mathbb{E}_{\mathbf{s},\mathbf{a}\sim \beta}\left[\left(Q^\pi_\phi(\mathbf{s},\mathbf{a}) - \left(r(\mathbf{s}, \mathbf{a}) + Q^\pi_\phi(\mathbf{s}',\mathbf{a}')\right) \right)^2\right] $$

where $\beta$ is some buffer that contains the state-action pairs (that resulted from some collection policy) and $\mathcal{E}$ is known as the Bellman error. It can be shown that when:

$$ Q^\pi_\phi(\mathbf{s},\mathbf{a}) = r(\mathbf{s}, \mathbf{a}) + Q^\pi_\phi(\mathbf{s}',\mathbf{a}') $$

i.e. when $\mathcal{E} = 0$, $Q^\pi_\phi$ corresponds to the optimal Q-function $Q^*$ which results in the optimal policy $\pi^*$.

C. Online Q-Iteration Algorithm

The online version of the Q-iteration algorithm repeatedly does the following:

Take some action $\mathbf{a}$ and observe $\{\mathbf{s},\mathbf{a},\mathbf{s}',\mathbf{r}\}$.
Set $y = r(\mathbf{s},\mathbf{a}) + \max_{\mathbf{a}'} \left[ Q^\pi_\phi(\mathbf{s}',\mathbf{a}')\right]$.
Set $\phi = \text{argmin}_\phi \frac{1}{2}\sum_i \vert\vert Q^\pi_\phi(\mathbf{s},\mathbf{a})-y \vert\vert^2$.

D. Exploration with Q-Learning

For Q-iteration our policy is:

$$ \pi(\mathbf{a}_t\vert\mathbf{s}_t) = \begin{cases} 1 & \text{if}\;\mathbf{a}_t = \underset{\mathbf{a}_t}{\text{argmax}}\;Q^\pi_\phi (\mathbf{s}_t, \mathbf{a}_t)\\ 0 & \text{otherwise} \end{cases} $$

However, using this policy for step 1 of the Q-iteration algorithm is not a particularly good idea. There might be parts of the state space that have a very high positive reward but can only be reached after incurring a (small) negative reward. However, with this policy we might never be able to explore those parts of the state space.

To remedy for this problem, there are a variety of different collection policies that can be chosen from.

Epsilon-greedy:
$$ \pi(\mathbf{a}_t\vert\mathbf{s}_t) = \begin{cases} 1 - \epsilon& \text{if}\;\mathbf{a}_t = \underset{\mathbf{a}_t}{\text{argmax}}\; Q^\pi_\phi (\mathbf{s}_t, \mathbf{a}_t)\\ \epsilon & \text{otherwise} \end{cases} $$
where $\epsilon$ is some hyperparameter.
Boltzmann exploration:
$$ \pi(\mathbf{a}_t\vert\mathbf{s}_t) \propto \exp (Q^\pi_\phi(\mathbf{s}_t, \mathbf{a}_t)) $$

3. Value Function Learning Theory

A. Tabular Case

Consider the tabular case of value iteration. Define the Bellman Backup operator $\mathcal B$ on the the value function $V$ as:

$$ \mathcal BV = \max_\mathbf a \left[r_\mathbf a + \gamma\mathcal T_\mathbf a V\right] $$

where $r_\mathbf a$ is a vector whose $i^{\text{th}}$ entry is equal to $r(\mathbf s_i,\mathbf a)$ and $\mathcal T_\mathbf a$ is a matrix of transition probabilities such that $\mathcal T_{a_{i,j}} = p(s'=j \vert s=i,a)$. Note that this just vectorizes the value iteration algorithm:

$$ V(\mathbf s) := \max_\mathbf a \left[r(\mathbf s, \mathbf a) + \mathbb E_{\mathbf s'\sim p(\mathbf s,\mathbf a)}\left[Q^\pi(\mathbf s, \mathbf a)\right] \right] $$

Clearly, $V^*$ - the optimal value function - is a fixed point of $\mathcal B$ i.e.

$$ V^* = \mathcal B V* $$

It turns out that such a point always exist, is always unique and always corresponds to the optimal value function. But does the value iteration algorithm always converge to this fixed point? It turns out that it does converge because $\mathcal B$ is a contraction in the infinity norm i.e.:

$$ \vert\vert \mathcal BV - \mathcal B \bar V\vert\vert_\infty \leq \vert\vert V - \bar V \vert\vert_\infty $$

where the infinity norm is defined as $\vert\vert V \vert\vert_\infty = \max_\mathbf s V(\mathbf s)$. So if we substitute $V^*$ for $\bar V$ then because $\mathcal BV^* = V^*$, we have:

$$ \vert\vert \mathcal BV - \mathcal B V^*\vert\vert_\infty \leq \vert\vert V - V^* \vert\vert_\infty $$

i.e. each iteration of the value iteration algorithm gets us closer to the optimal value function.

B. Non-Tabular Case

Recall that the fitted value iteration algorithm is given by:

Sets $y^{(i)} = \max_\mathbf{a}\left[ r(\mathbf{s}^{(i)} ,\mathbf{a}^{(i)} ) + \gamma\mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}'^{(i)} \vert \mathbf{s}^{(i)} , \mathbf{a}^{(i)} )}\left[V_\phi(\mathbf{s}')\right]\right]$.
Sets $\phi = \text{argmin}_\phi \frac{1}{2}\sum_i \vert\vert V_\phi(\mathbf{s}^{(i)} )-y^{(i)} \vert\vert^2$.

Note that the first step simply applies the Bellman backup operator to $V_\phi$ and equivalently $y^{(i)}=\mathcal (BV_\phi)(\mathbf s^{(i)})$. We may rewrite the second step as:

$$ V'_\phi = \underset{V'_\phi\in \Omega}{\text{argmin}}\;\frac{1}{2}\sum_i \vert\vert V_\phi'(\mathbf s^{(i)})-\mathcal (BV_\phi)(\mathbf s^{(i)}) \vert\vert^2 $$

where $\Omega$ is the hypothesis space of all our function approximators. Note that here $V_\phi$ is our value function from the previous iteration. Let us define an operator:

$$ \Pi: \Pi V = \text{argmin}_{V_\phi} \frac{1}{2}\sum_i \vert\vert V_\phi'(\mathbf s^{(i)})-V(\mathbf s^{(i)}) \vert\vert^2 $$

$\Pi$ is simply a projection onto the $\Omega$ space in terms of the l2-norm i.e. it returns the hypothesis $V'_\phi$ from $\Omega$ that has the shortest Euclidean distance from $V$. Note that we made no assumption about $V$ being in the $\Omega$ space. As it turns out, $\Pi$ is also a contraction, but in the l2-norm. We may now write each iteration of the fitted value iteration algorithm as:

$$ V \leftarrow \Pi\mathcal B V $$

While both $\mathcal B$ and $\Pi$ are contractions $\Pi\mathcal B$ is not. So, in general and often in practice, fitted value iteration does not converge.

C. What About Fitted Q-Iteration

We may define the Bellman backup operator on the Q-function as:

$$ \mathcal B:\mathcal B Q = r + \gamma\mathcal T\max_\mathbf a Q $$

and $\Pi$ as:

$$ \Pi: \Pi Q = \text{argmin}_{Q_\phi} \frac{1}{2}\sum_i \vert\vert Q_\phi'(\mathbf s^{(i)},\mathbf a^{(i)})-Q(\mathbf s^{(i)},\mathbf a^{(i)}) \vert\vert^2 $$

The fitted Q-iteration algorithm, therefore, repeatedly performs the update:

$$ Q \leftarrow \Pi\mathcal BQ $$

Again, while both $\mathcal B$ and $\Pi$ are contractions $\Pi\mathcal B$ is not, So, fitted Q-iteration is also not guaranteed to approach. This is also true for the online version.

The argument that fitted Q-iteration should converge because it essentially just performs gradient descent in the second step and because gradient descent converges, fitted Q-iteration too should converge is not correct. This is because Q-iteration is not gradient descent. While the target values too depend on $Q_\phi$ no gradient flows through them. So we are not actually taking the gradient of $\frac{1}{2}\sum_i \vert\vert Q^\pi_\phi(\mathbf{s},\mathbf{a})-y \vert\vert^2$ with respect to the parameters $\phi$ in the second step.

D. A Sad Corollary

The arguments presented above against the convergence of value functions also apply to the value-function fitting in the actor-critic algorithm.