1. Optimality Variables

Up until now, we have been concerned about modeling \(p(\tau)\). Our goal has been to choose a sequence of actions \(\mathbf{a}_1,\ldots,\mathbf{a}_T\) such that:

$$ \mathbf{a}_1,\ldots,\mathbf{a}_T = \text{argmax}_{\mathbf{a}_1,\ldots,\mathbf{a}_T}\sum_{t=1}^T r(\mathbf{s}_t,\mathbf{a}_t) $$

The solution to this optimization problem gives us the sequence of actions that maximizes the total reward. However, this is not a good model for human behavior. Humans generally tend to exhibit a near-optimal behavior. So, for example, if a car is drifting to the right, then its human driver might allow it to continue to do so (may be because he is lazy), before he brings it back to the middle of the road. Therefore, while the driver ultimately does reach his destination (which was his real goal), he does so in a near-optimal fashion. To model this type of behavior, we would need to forgo our previous formulation.

Let \(p(\tau)\) denote the probability of a sequence of states \(\mathbf{s}_{1:T}\) that an agent visits and the corresponding actions \(\mathbf{a}_{1:T}\) that it takes i.e. \(p(\tau) = p(\mathbf{s}_{1:T}, \mathbf{a}_{1:T})\). Note that \(p(\tau)\) is the probability of a random sequence of state-action pairs. However, we know that the agent (as in the driver’s example above) is not taking random actions. Instead, it has a defined objective that it is trying to achieve via a near-optimal behavior. To account for this, let us introduce the binary-valued optimality variables \(\mathcal{O}\). The value of the \(\mathcal{O}\) at time step \(t\) (denoted with \(\mathcal{O}_t\)) is \(1\) if the agent takes an approximately optimal action and \(0\) otherwise. One way to model the probability of \(\mathcal{O}_t\) is:

$$ p(\mathcal{O}_t=1\vert\mathbf{s}_t,\mathbf{a}_t) \propto \exp(r(\mathbf{s}_t,\mathbf{a}_t)) $$

This intuition behind this choice is that actions that have high rewards are all approximately optimal (so, for example, allowing the car to drift a little extra to the right will have a reward closer to not allowing it to drift at all and much higher than allowing it run off the road).

We will assume Markov’s property for the optimality variables i.e. \(\mathcal{O}_t\) is independent of \(\mathbf{s}_{1:t-1},\mathbf{a}_{1:t-1},\mathcal{O}_{1:t-1}\) given \(\mathbf{s}_t\). The following figure shows a graphical visualization of an MDP with optimality variables:

We will primarily concern ourselves with finding \(p(\tau \vert \mathcal{O}_{1:T})\) which is the probability of a trajectory \(\tau\) given that the agent is taking approximately optimal actions at all time steps. Note that:

$$ \begin{align} p(\tau \vert \mathcal{O}_{1:T}) &= \frac{p(\tau, \mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\\ &= \frac{p(\mathbf{s}_1,\mathbf{a}_1,\mathcal{O}_1,\ldots,\mathbf{s}_T,\mathbf{a}_T,\mathcal{O}_T)}{p(\mathcal{O}_1,\ldots,\mathcal{O}_T)}\\ &= \frac{p(\mathbf{s}_1)p(\mathbf{a}_1\vert\mathbf{s}_1)p(\mathcal{O}_1\vert\mathbf{s}_1,\mathbf{a}_1),\ldots,p(\mathcal{O}_T\vert\mathbf{s}_1,\mathbf{a}_1,\mathcal{O}_1,\ldots,\mathbf{s}_T,\mathbf{a}_T)}{p(\mathcal{O}_1,\ldots,\mathcal{O}_T)}\\ &= \frac{p(\mathbf{s}_1)\prod_{t=1}^T p(\mathbf{a}_t\vert\mathbf{s}_t) p(\mathcal{O}_t\vert\mathbf{s}_t,\mathbf{a}_t) p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)}{p(\mathcal{O}_1,\ldots,\mathcal{O}_T)}\\ &= \frac{\left(p(\mathbf{s}_1)\prod_{t=1}^T p(\mathbf{a}_t\vert\mathbf{s}_t) p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)\right)\left(\prod_{t=1}^T p(\mathcal{O}_t\vert\mathbf{s}_t,\mathbf{a}_t)\right)}{p(\mathcal{O}_1,\ldots,\mathcal{O}_T)}\\ &= \frac{p(\tau)\prod_{t=1}^T p(\mathcal{O}_t\vert\mathbf{s}_t,\mathbf{a}_t)}{p(\mathcal{O}_1,\ldots,\mathcal{O}_T)}\\ &\propto p(\tau)\prod_{t=1}^T \exp(r(\mathbf{s}_t,\mathbf{a}_t))\\ &\propto p(\tau)\exp\left(\sum_{t=1}^T r(\mathbf{s}_t,\mathbf{a}_t)\right) \end{align} $$

where we made use of the Markov’s property in the fourth step. Note that \(p(\tau \vert \mathcal{O}_{1:T})\) is a shorthand for \(p(\tau \vert \mathcal{O}_{1:T}=1)\).

Using the above formulation, we can calculate three quantities of interest: backward messages, a policy and forward messages. We discuss each of these in detail in the following sections.

2. Backward Messages

We define denote the backward message at time step \(t\) with \(\beta_t\) and define it to be:

$$ \beta_t(\mathbf{s}_t,\mathbf{a}_t) = p(\mathcal{O}_{t:T}\vert \mathbf{s}_t,\mathbf{a}_t) $$

if both the state and action are given, or:

$$ \beta_t(\mathbf{s}_t) = p(\mathcal{O}_{t:T}\vert \mathbf{s}_t) $$

if only the state is given i.e. \(\beta_t\) is the probability that all future time steps from \(t\) to \(T\) are optimal given the state (and action in the former case) at \(t\). Note that:

$$ \begin{align} \beta_t(\mathbf{s}_t,\mathbf{a}_t) &= p(\mathcal{O}_{t:T}\vert \mathbf{s}_t,\mathbf{a}_t)\\ &= \int p(\mathcal{O}_{t:T}, \mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t) d\mathbf{s}_{t+1}\\ &= \int p(\mathcal{O}_t,\mathcal{O}_{t+1:T}, \mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t) d\mathbf{s}_{t+1}\\ &= \int p(\mathcal{O}_{t+1:T}\vert \mathcal{O}_t, \mathbf{s}_{t+1}, \mathbf{s}_t,\mathbf{a}_t) p(\mathcal{O}_t,\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)d\mathbf{s}_{t+1}\\ &= \int p(\mathcal{O}_{t+1:T}\vert \mathcal{O}_t, \mathbf{s}_{t+1}, \mathbf{s}_t,\mathbf{a}_t) p(\mathbf{s}_{t+1}\vert \mathcal{O}_t, \mathbf{s}_t,\mathbf{a}_t)p(\mathcal{O}_t\vert \mathbf{s}_t, \mathbf{a}_t)d\mathbf{s}_{t+1}\\ &= \int p(\mathcal{O}_{t+1:T}\vert \mathbf{s}_{t+1}) p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)p(\mathcal{O}_t\vert \mathbf{s}_t, \mathbf{a}_t)d\mathbf{s}_{t+1}\\ &= p(\mathcal{O}_t\vert \mathbf{s}_t, \mathbf{a}_t) \int p(\mathcal{O}_{t+1:T}\vert \mathbf{s}_{t+1}) p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)d\mathbf{s}_{t+1}\\ &= p(\mathcal{O}_t\vert \mathbf{s}_t, \mathbf{a}_t) \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)}\left[p(\mathcal{O}_{t+1:T}\vert \mathbf{s}_{t+1})\right]\\ &= p(\mathcal{O}_t\vert \mathbf{s}_t, \mathbf{a}_t) \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)}\left[\beta_{t+1}(\mathbf{s}_{t+1})\right] \end{align} $$

where in the fourth last step we have made use of Markov’s property for \(\mathcal{O}_{1:T}\) and noted that \(\mathbf{s}_{t+1}\) only depends on the state and action at the previous time step (the system dynamics are the same irrespective of whether the agent has performed optimally or not in the past1.

Also note that:

$$ \begin{align} \beta_t(\mathbf{s}_t) &= p(\mathcal{O}_{t:T}\vert\mathbf{s}_t)\\ &= \int p(\mathcal{O}_{t:T},\mathbf{a}_t\vert\mathbf{s}_t)d\mathbf{a}_t\\ &= \int p(\mathcal{O}_{t:T}\vert\mathbf{s}_t,\mathbf{a}_t)p(\mathbf{a}_t \vert \mathbf{s}_t) d\mathbf{a}_t\\ &= \mathbb{E}_{\mathbf{a}_t \sim p(\mathbf{a}_t \vert \mathbf{s}_t)} \left[\beta_t(\mathbf{s}_t,\mathbf{a}_t)\right] \end{align} $$

For now, we can assume that \(p(\mathbf{a}_t \vert \mathbf{s}_t)\) is a uniform distribution i.e. without optimality constraints all actions are equally likely. Note that \(p(\mathbf{a}_t\vert\mathbf{s}_t)\) can be interpreted as the prior distribution of \(\mathbf{a}_t\vert\mathbf{s}_t\) before any assumptions about optimality are made.

We can recursively calculate the backward messages starting from time step \(T-1\) to \(1\):

for \(t=T-1\) to \(1\):

  1. \(\beta_t(\mathbf{s}_t,\mathbf{a}_t)\) = \(p(\mathcal{O}_t\vert \mathbf{s}_t, \mathbf{a}_t) \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)}\left[\beta_t(\mathbf{s}_{t+1})\right]\)
  2. \(\beta_t(\mathbf{s}_t)\) \(=\mathbb{E}_{\mathbf{a}_t \sim p(\mathbf{a}_t \vert \mathbf{s}_t)} \left[\beta_t(\mathbf{s}_t,\mathbf{a}_t)\right]\)

Note that \(\beta_T(\mathbf{s}_T,\mathbf{a}_T)\) is just \(p(\mathcal{O}_T\vert \mathbf{s}_T, \mathbf{a}_T)\).

3. Backward Messages In Terms of Q and Value Functions

Let:

$$ \begin{align} V_t(\mathbf{s}_t) &= \log \beta_t(\mathbf{s}_t)\\ Q_t(\mathbf{s}_t,\mathbf{a}_t) &= \log\beta_t(\mathbf{s}_t,\mathbf{a}_t) \end{align} $$

Then we have:

$$ V_t(\mathbf{s}_t) = \log \int \exp(Q_t(\mathbf{s}_t,\mathbf{a}_t))d\mathbf{a}_t $$

Note that we have ignored \(p(\mathbf{a}_t \vert \mathbf{s}_t)\) in the integral as it is a constant (we have earlier assumed it be a uniform distribution). Also note that as \(Q_t(\mathbf{s}_t,\mathbf{a}_t)\) gets bigger \(V_t(\mathbf{s}_t)\) approaches \(\max_{\mathbf{a}_t}Q_t(\mathbf{s}_t,\mathbf{a}_t)\). For this reason the equation above is also known as a soft max of \(Q_t(\mathbf{s}_t,\mathbf{a}_t)\).

We also have:

$$ \begin{align} Q_t(\mathbf{s}_t,\mathbf{a}_t) &= \log \left(p(\mathcal{O}_t\vert \mathbf{s}_t, \mathbf{a}_t) \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)}\left[\beta_{t+1}(\mathbf{s}_{t+1})\right]\right)\\ &= \log p(\mathcal{O}_t\vert \mathbf{s}_t, \mathbf{a}_t) +\log \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)}\left[\exp\left(V_{t+1}(\mathbf{s}_{t+1})\right)\right]\\ &= r(\mathbf{s}_t, \mathbf{a}_t) +\log \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)}\left[\exp\left(V_{t+1}(\mathbf{s}_{t+1})\right)\right] \end{align} $$

When the system dynamics are deterministic, this simply reduces to:

$$ \begin{align} Q_t(\mathbf{s}_t,\mathbf{a}_t) &= r(\mathbf{s}_t, \mathbf{a}_t) +\log \left(\exp(V_{t+1}(\mathbf{s}_{t+1})\right)\\ &= r(\mathbf{s}_t, \mathbf{a}_t) + V_{t+1}(\mathbf{s}_{t+1})\\ \end{align} $$

We will discuss the stochastic case later in these lecture notes. We can rewrite the backward messages algorithm as:

for \(t=T\) to \(1\):

  1. \(Q_t(\mathbf{s}_t,\mathbf{a}_t) =\) \(r(\mathbf{s}_t, \mathbf{a}_t) +\log \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)}\left[\exp\left(V_{t+1}(\mathbf{s}_{t+1})\right)\right]\)
  2. \(V_t(\mathbf{s}_t) =\) \(\log \int \exp(Q_t(\mathbf{s}_t,\mathbf{a}_t))d\mathbf{a}_t\)

Finally, note that if we had not assumed a uniform distribution for \(p(\mathbf{a}_t\vert\mathbf{s}_t)\) we would have:

$$ \begin{align} V_t(\mathbf{s}_t) &= \log \int p(\mathbf{a}_t\vert\mathbf{s}_t) \exp(Q_t(\mathbf{s}_t,\mathbf{a}_t))d\mathbf{a}_t\\ &= \log \int \exp\left(\log\left(p(\mathbf{a}_t\vert\mathbf{s}_t)\right)\right) \exp(Q_t(\mathbf{s}_t,\mathbf{a}_t))d\mathbf{a}_t\\ &= \log \int \exp\left(Q_t(\mathbf{s}_t,\mathbf{a}_t)+\log\left(p(\mathbf{a}_t\vert\mathbf{s}_t)\right)\right)d\mathbf{a}_t\\ &= \log \int \exp\left(r(\mathbf{s}_t, \mathbf{a}_t) +\log \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)}\left[\exp\left(V_{t+1}(\mathbf{s}_{t+1})\right)\right]+\log\left(p(\mathbf{a}_t\vert\mathbf{s}_t)\right)\right)d\mathbf{a}_t\\ \end{align} $$

Therefore we can simply assume a new reward function:

$$ \tilde{r}(\mathbf{s}_t, \mathbf{a}_t) = r(\mathbf{s}_t, \mathbf{a}_t) + \log\left(p(\mathbf{a}_t\vert\mathbf{s}_t)\right) $$

We can assume that our reward function is actually \(\tilde{r}(\mathbf{s}_t, \mathbf{a}_t)\) and therefore use a uniform distribution for \(p(\mathbf{a}_t\vert\mathbf{s}_t)\) without loss of generality.

4. Policy Computation

We define our policy to be \(p(\mathbf{a}_t\vert\mathbf{s}_t,\mathcal{O}_{1:T})\) i.e. we want to be optimal at all time steps. Note that:

$$ \begin{align} \pi(\mathbf{a}_t\vert\mathbf{s}_t) &= p(\mathbf{a}_t \vert \mathbf{s}_t, \mathcal{O}_{1:T})\\ &= p(\mathbf{a}_t \vert \mathbf{s}_t, \mathcal{O}_{t:T})\\ &= \frac{p(\mathbf{a}_t, \mathbf{s}_t \vert \mathcal{O}_{t:T})}{p(\mathbf{s}_t\vert \mathcal{O}_{t:T})}\\ &= \frac{p(\mathcal{O}_{t:T}\vert\mathbf{s}_t, \mathbf{a}_t)p(\mathbf{a}_t, \mathbf{s}_t)/p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}\vert\mathbf{s}_t)p(\mathbf{s}_t)/p(\mathcal{O}_{t:T})}\\ &= \frac{p(\mathcal{O}_{t:T}\vert\mathbf{s}_t, \mathbf{a}_t)}{p(\mathcal{O}_{t:T}\vert\mathbf{s}_t)}\frac{p(\mathbf{a}_t, \mathbf{s}_t)}{p(\mathbf{s}_t)}\\ &= \frac{\beta_t(\mathbf{s}_t,\mathbf{a}_t)}{\beta_t(\mathbf{s}_t)} p(\mathbf{a}_t\vert\mathbf{s}_t) \end{align} $$

As \(p(\mathbf{a}_t\vert\mathbf{s}_t)\) is a uniform distribution, we can simply ignore it. Our policy is thus given as:

$$ \pi(\mathbf{a}_t\vert\mathbf{s}_t) = \frac{\beta_t(\mathbf{s}_t,\mathbf{a}_t)}{\beta_t(\mathbf{s}_t)} $$

In terms of the Q and value functions we have:

$$ \begin{align} \pi(\mathbf{a}_t\vert\mathbf{s}_t) &= \frac{\exp\left(Q_t(\mathbf{s}_t,\mathbf{a}_t)\right)}{\exp\left(V_t(\mathbf{s}_t) \right)}\\ &= \exp\left(Q_t(\mathbf{s}_t,\mathbf{a}_t)-V_t(\mathbf{s}_t)\right)\\ &= \exp\left(A_t(\mathbf{s}_t,\mathbf{a}_t)\right) \end{align} $$

where \(A_t\) is the advantage.

5. Forward Messages

The forward message at time step \(t\) is defined as:

$$ \alpha_t(\mathbf{s}_t) = p(\mathbf{s}_t\vert\mathcal{O}_{1:t-1}) $$

i.e. it is the probability of the state at time \(t\) given that all previous states were optimal. Note that:

$$ \begin{align} \alpha_t(\mathbf{s}_t) &= p(\mathbf{s}_t\vert\mathcal{O}_{1:t-1})\\ &=\int p(\mathbf{s}_t,\mathbf{s}_{t-1},\mathbf{a}_{t-1}\vert\mathcal{O}_{1:t-1}) d\mathbf{s}_{t-1}d\mathbf{a}_{t-1}\\ &= \int p(\mathbf{s}_t\vert\mathbf{s}_{t-1},\mathbf{a}_{t-1},\mathcal{O}_{1:t-1})p(\mathbf{a}_{t-1}\vert\mathbf{s}_{t-1},\mathcal{O}_{1:t-1})p(\mathbf{s}_{t-1}\vert\mathcal{O}_{1:t-1})d\mathbf{s}_{t-1}d\mathbf{a}_{t-1}\\ &= \int p(\mathbf{s}_t\vert\mathbf{s}_{t-1},\mathbf{a}_{t-1})p(\mathbf{a}_{t-1}\vert\mathbf{s}_{t-1},\mathcal{O}_{1:t-1})p(\mathbf{s}_{t-1}\vert\mathcal{O}_{1:t-1})d\mathbf{s}_{t-1}d\mathbf{a}_{t-1}\\ \end{align} $$

where the last step made use of the fact that the system dynamics are the same irrespective of the previous optimality variables. We have:2

$$ \begin{align} p(\mathbf{a}_{t-1}\vert\mathbf{s}_{t-1},\mathcal{O}_{1:t-1}) &= p(\mathbf{a}_{t-1}\vert\mathbf{s}_{t-1},\mathcal{O}_{1:t-2},\mathcal{O}_{t-1})\\ &= \frac{p(\mathcal{O}_{t-1}\vert\mathbf{a}_{t-1},\mathbf{s}_{t-1},\mathcal{O}_{1:t-2})p(\mathbf{a}_{t-1}\vert\mathbf{s}_{t-1},\mathcal{O}_{1:t-2})}{p(\mathcal{O}_{t-1}\vert\mathbf{s}_{t-1},\mathcal{O}_{1:t-2})}\\ &= \frac{p(\mathcal{O}_{t-1}\vert\mathbf{a}_{t-1},\mathbf{s}_{t-1})p(\mathbf{a}_{t-1}\vert\mathbf{s}_{t-1})}{p(\mathcal{O}_{t-1}\vert\mathbf{s}_{t-1})} \end{align} $$

where the last step used the Markov’s assumption, and:

$$ \begin{align} p(\mathbf{s}_{t-1}\vert\mathcal{O}_{1:t-1}) &= p(\mathbf{s}_{t-1}\vert\mathcal{O}_{1:t-2},\mathcal{O}_{t-1})\\ &= \frac{p(\mathcal{O}_{t-1}\vert\mathbf{s}_{t-1},\mathcal{O}_{1:t-2})p(\mathbf{s}_{t-1}\vert\mathcal{O}_{1:t-2})}{p(\mathcal{O}_{t-1}\vert\mathcal{O}_{1:t-2})}\\ &= \frac{p(\mathcal{O}_{t-1}\vert\mathbf{s}_{t-1})p(\mathbf{s}_{t-1}\vert\mathcal{O}_{1:t-2})}{p(\mathcal{O}_{t-1}\vert\mathcal{O}_{1:t-2})} \end{align} $$

Therefore:

$$ \begin{align} \alpha_t(\mathbf{s}_t) &= \int p(\mathbf{s}_t\vert\mathbf{s}_{t-1},\mathbf{a}_{t-1})\left(\frac{p(\mathcal{O}_{t-1}\vert\mathbf{a}_{t-1},\mathbf{s}_{t-1})p(\mathbf{a}_{t-1}\vert\mathbf{s}_{t-1})}{p(\mathcal{O}_{t-1}\vert\mathbf{s}_{t-1})}\right)\left(\frac{p(\mathcal{O}_{t-1}\vert\mathbf{s}_{t-1})p(\mathbf{s}_{t-1}\vert\mathcal{O}_{1:t-2})}{p(\mathcal{O}_{t-1}\vert\mathcal{O}_{1:t-2})}\right)d\mathbf{s}_{t-1}d\mathbf{a}_{t-1}\\ &= \int\frac{p(\mathbf{s}_t\vert\mathbf{s}_{t-1},\mathbf{a}_{t-1})p(\mathcal{O}_{t-1}\vert\mathbf{a}_{t-1},\mathbf{s}_{t-1})p(\mathbf{a}_{t-1}\vert\mathbf{s}_{t-1})p(\mathbf{s}_{t-1}\vert\mathcal{O}_{1:t-2})}{p(\mathcal{O}_{t-1}\vert\mathcal{O}_{1:t-2})}d\mathbf{s}_{t-1}d\mathbf{a}_{t-1}\\ &= \int\frac{p(\mathbf{s}_t\vert\mathbf{s}_{t-1},\mathbf{a}_{t-1})p(\mathcal{O}_{t-1}\vert\mathbf{a}_{t-1},\mathbf{s}_{t-1})p(\mathbf{a}_{t-1}\vert\mathbf{s}_{t-1})\alpha_{t-1}(\mathbf{s}_{t-1})}{p(\mathcal{O}_{t-1}\vert\mathcal{O}_{1:t-2})}d\mathbf{s}_{t-1}d\mathbf{a}_{t-1}\\ \end{align} $$

Note that \(\alpha_1(\mathbf{s}_1)=p(\mathbf{s}_1)\). This is usually known. We can thus use this to calculate the other \(\alpha\) in an iterative fashion.

Finally note that:

$$ \begin{align} p(\mathbf{s}_t\vert\mathcal{O}_{1:T}) &= \frac{p(\mathbf{s}_t,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\\ &= \frac{p(\mathbf{s}_t,\mathcal{O}_{1:t-1},\mathcal{O}_{t:T})}{p(\mathcal{O}_{1:T})}\\ &= \frac{p(\mathcal{O}_{t:T}\vert\mathbf{s}_t,\mathcal{O}_{1:t-1})p(\mathbf{s}_t,\mathcal{O}_{1:t-1})}{p(\mathcal{O}_{1:T})}\\ &= \frac{p(\mathcal{O}_{t:T}\vert\mathbf{s}_t)p(\mathbf{s}_t\vert\mathcal{O}_{1:t-1})p(\mathcal{O}_{1:t-1})}{p(\mathcal{O}_{1:T})}\\ &= \beta_t(\mathbf{s}_t)\alpha_t(\mathbf{s}_t)\frac{p(\mathcal{O}_{1:t-1})}{p(\mathcal{O}_{1:T})}\\ &\propto \beta_t(\mathbf{s}_t)\alpha_t(\mathbf{s}_t) \end{align} $$

This is visually depicted in the following figure:

6. The Optimism Problem

Assume stochastic system dynamics. Consider the following again:

$$ Q(\mathbf{s}_t,\mathbf{a}_t) = r(\mathbf{s}_t,\mathbf{a}_t) + \log\mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert\mathbf{s}_t,\mathbf{a}_t)}\left[\exp\left(V_{t+1}(\mathbf{s}_{t+1})\right)\right] $$

Note that the second term is a soft max over the value functions at all possible states at \(t+1\). This is different to what we did in Q-learning where we took the expected value of the value function at the next time step:

$$ Q(\mathbf{s}_t,\mathbf{a}_t) = r(\mathbf{s}_t,\mathbf{a}_t) + \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}\vert\mathbf{s}_t,\mathbf{a}_t)}\left[V_{t+1}(\mathbf{s}_{t+1})\right] $$

Taking the soft max intuitively means this: if among the possible states at \(t+1\), there is a state with a very high value function and this state has a non-zero probability of occurring, then it will completely dominate the soft max even if there are other states that have low value functions but are far more likely. This means that the Q-values will always be optimistic (i.e. much higher than they really should be). This also creates a risk seeking behavior: if an agent behaves according to this Q-function, it might take actions that have a high risk (meaning that these actions can land the agent in a state with a high value function but with a very low, albeit non-zero, probability).

To see why this occurs, consider the inference problem that we were trying to solve:

$$ p(\tau\vert\mathcal{O}_{1:T}) = p(\mathbf{s}_1\vert\mathcal{O}_{1:T}) \prod_{t=1}^T p(\mathbf{a}_t\vert\mathbf{s}_t,\mathcal{O}_{1:T})p(\mathbf{s}_{t+1}\vert\mathbf{s}_t, \mathbf{a}_t,\mathcal{O}_{1:T}) $$

The derivation for this is similar to the one for \(p(\tau)\) in these notes. Consider the last term above. We know that \(\mathbf{s}_{t+1}\) is independent of \(\mathcal{O}_{1:t}\) because of Markov’s assumption. However, it is not independent of \(\mathcal{O}_{t+1:T}\). Using Bayes rule we have:

$$ p(\mathbf{s}_{t+1}\vert\mathbf{s}_t, \mathbf{a}_t,\mathcal{O}_{t+1:T}) = \frac{p(\mathcal{O}_{t+1:T}\vert\mathbf{s}_{t+1},\mathbf{s}_t, \mathbf{a}_t)p(\mathbf{s}_{t+1}\vert\mathbf{s}_t, \mathbf{a}_t)}{p(\mathcal{O}_{t+1:T})} $$

We know that the terms \(\mathcal{O}_{t+1:T}\) are not independent of \(\mathcal{s}_{t+1}\) and so:

$$ p(\mathbf{s}_{t+1}\vert\mathbf{s}_t, \mathbf{a}_t,\mathcal{O}_{t+1:T}) \neq p(\mathbf{s}_{t+1}\vert\mathbf{s}_t, \mathbf{a}_t) $$

\(p(\mathcal{s}_{t+1}\vert\mathcal{s}_t,\mathcal{a}_t,\mathcal{O}_{t+1})\) are the system dynamics given that the agent performed optimally. Intuitively this means that the distribution \(\mathcal{s}_{t+1}\vert\mathcal{s}_t,\mathcal{a}_t,\mathcal{O}_{t+1}\) assigns higher probabilities to state transitions that yield states with higher rewards despite the fact that these transitions may have had a lower probability under the distribution \(\mathcal{s}_{t+1}\vert\mathcal{s}_t,\mathcal{a}_t\).

Now consider what happens when we are inferring/planning i.e. when we are calculating \(p(\tau\vert\mathcal{O}_{1:T})\) for different \(\tau\). We plan thinking that the system dynamics are given by the distribution \(\mathcal{s}_{t+1}\vert\mathcal{s}_t,\mathcal{a}_t,\mathcal{O}_{t+1}\). However, when we roll out our plan we find that the system dynamics are actually governed by \(\mathcal{s}_{t+1}\vert\mathcal{s}_t,\mathcal{a}_t\). This means that the state transitions are not what we thought they would be during planning. We may, thus, end up in low reward states. Note that using \(p(\mathcal{s}_{t+1}\vert\mathcal{s}_t,\mathcal{a}_t,\mathcal{O}_{t+1})\) also encourages the risk-seeking behavior we talked about above. It encourages the agent to take risky actions by assigning high probabilities to transitions that yield states with high value functions (where, in reality, those transitions may have low probabilities).3

The question that we thus need to ask is this: given that the agent is performing optimally, what is the probability of \(\tau\) given that the transition probability did not change?

To this end, we simply replace \(p(\mathcal{s}_{t+1}\vert\mathcal{s}_t,\mathcal{a}_t,\mathcal{O}_{t+1})\) with \(p(\mathcal{s}_{t+1}\vert\mathcal{s}_t,\mathcal{a}_t)\). Formally, we will try to find a new distribution \(q(\mathbf{s}_{1:T},\mathbf{a}_{1:T})\) that is close to \(p(\mathbf{s}_{1:T},\mathbf{a}_{1:T}\vert\mathcal{O}_{1:T})\) but has the same system dynamics as \(p(\mathbf{s}_{t+1}\vert\mathbf{s}_t,\mathbf{a}_t)\):

$$ q(\mathbf{s}_{1:T},\mathbf{a}_{1:T}) = p(\mathbf{s}_1)\prod_{t=1}^T q(\mathbf{a}_t\vert\mathbf{s}_t)p(\mathbf{s}_{t+1}\vert\mathbf{s}_t, \mathbf{a}_t) $$

In the previous lecture notes on variational inference, we showed that:

$$ \log p(x) \geq \mathbb{E}_{z \vert q(z)}\log\left[\log p(x,z)-\log q(z)\right] $$

Using this, we have:

$$ \begin{align} \log p(\mathcal{O}_{1:T}) &\geq \mathbb{E}_{(\mathbf{s}_{1:T},\mathbf{a}_{1:T})\sim q(\mathbf{s}_{1:T},\mathbf{a}_{1:T})}\left[\log p(\mathcal{O}_{1:T},\mathbf{s}_{1:T},\mathbf{a}_{1:T})-q(\mathbf{s}_{1:T},\mathbf{a}_{1:T})\right]\\ &= \mathbb{E}_{(\mathbf{s}_{1:T},\mathbf{a}_{1:T})\sim q(\mathbf{s}_{1:T},\mathbf{a}_{1:T})}\left[\log \left(p(\mathbf{s}_1)\prod_{t=1}^T p(\mathbf{a}_t\vert\mathbf{s}_t)p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)p(\mathcal{O}_t\vert\mathbf{s}_t,\mathbf{a}_t)\right)- \\ \log \left(p(\mathbf{s}_1)\prod_{t=1}^T q(\mathbf{a}_t\vert\mathbf{s}_t)p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)\right)\right]\\ &= \mathbb{E}_{(\mathbf{s}_{1:T},\mathbf{a}_{1:T})\sim q(\mathbf{s}_{1:T},\mathbf{a}_{1:T})}\left[\log p(\mathbf{s}_1)+ \sum_{t=1}^T\log p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t) +\sum_{t=1}^T\log p(\mathcal{O}_t\vert\mathbf{s}_t,\mathbf{a}_t)\\- \log p(\mathbf{s}_1)-\sum_{t=1}^T\log q(\mathbf{a}_t\vert\mathbf{s}_t)-\sum_{t-1}^T\log p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t)\right]\\ &=\mathbb{E}_{(\mathbf{s}_{1:T},\mathbf{a}_{1:T})\sim q(\mathbf{s}_{1:T},\mathbf{a}_{1:T})}\left[\sum_{t=1}^T\log p(\mathcal{O}_t\vert\mathbf{s}_t,\mathbf{a}_t)-\sum_{t=1}^T\log q(\mathbf{a}_t\vert\mathbf{s}_t)\right]\\ &=\mathbb{E}_{(\mathbf{s}_{1:T},\mathbf{a}_{1:T})\sim q(\mathbf{s}_{1:T},\mathbf{a}_{1:T})}\left[\sum_{t=1}^Tr(\mathbf{s}_t,\mathbf{a}_t)-\log q(\mathbf{a}_t\vert\mathbf{s}_t)\right]\\ &=\sum_{t=1}^T\mathbb{E}_{(\mathbf{s}_t,\mathbf{a}_t)\sim q(\mathbf{s}_t,\mathbf{a}_t)}\left[r(\mathbf{s}_t,\mathbf{a}_t)-\log(q(\mathbf{a}_t\vert\mathbf{s}_t))\right]\\ &= \sum_{t=1}^T\mathbb{E}_{\mathbf{s}_t\sim q(\mathbf{s}_t)}\left[\mathbb{E}_{(\mathbf{a}_t\vert\mathbf{s}_t)\sim q(\mathbf{a}_t\vert\mathbf{s}_t)}\left[r(\mathbf{s}_t,\mathbf{a}_t)-\log(q(\mathbf{a}_t \vert \mathbf{s}_t))\right]\right]\\ &= \sum_{t=1}^T\mathbb{E}_{\mathbf{s}_t\sim q(\mathbf{s}_t)}\left[\mathbb{E}_{(\mathbf{a}_t\vert\mathbf{s}_t)\sim q(\mathbf{a}_t\vert\mathbf{s}_t)}\left[r(\mathbf{s}_t,\mathbf{a}_t)\right]+\mathcal{H}(q(\mathbf{a}_t \vert \mathbf{s}_t))\right] \end{align} $$

where we have ignored \(p(\mathbf{a}_t\vert\mathbf{s}_t)\) in step 3 since it is assumed to be uniform. Note that to maximize \(p(\mathcal{O}_{1:T})\) we need to maximize both the total reward and the action entropy.

Let us now find the values of \(q\) at each time step that will maximize this bound. Let us begin with \(t=T\) (we only consider the terms that depend on \(T\)):

$$ q_T(\mathbf{a}_T\vert\mathbf{s}_T) = \text{argmax}_q\;\mathbb{E}_{(\mathbf{s}_T,\mathbf{a}_T)\sim q(\mathbf{s}_T,\mathbf{a}_T)}\left[r(\mathbf{s}_t,\mathbf{a}_t)-\log(q(\mathbf{a}_t\vert\mathbf{s}_t))\right] $$

Taking the derivative of the above expression with respect to \(q\) and setting it to zero yields (we state this without proof):

$$ q_T(\mathbf{a}_T\vert\mathbf{s}_T) = \frac{\exp(r(\mathbf{s}_T,\mathbf{a}_T))}{\int\exp(r(\mathbf{s}_T,\mathbf{a}))d\mathbf{a}} $$

This can be rewritten as:

$$ \begin{align} q_T(\mathbf{a}_T\vert\mathbf{s}_T) &= \frac{\exp(r(\mathbf{s}_T,\mathbf{a}_T))}{\exp\log\int\exp(r(\mathbf{s}_T,\mathbf{a}))d\mathbf{a}}\\ &= \frac{\exp(Q_T(\mathbf{s}_T,\mathbf{a}_T))}{\exp \log\int \exp(Q_T(\mathbf{s}_T,\mathbf{a})) d\mathbf{a}}\\ &= \frac{\exp(Q_T(\mathbf{s}_T,\mathbf{a}_T))}{\exp(V_T(\mathbf{s}_T))}\\ &= \exp(Q_T(\mathbf{s}_T,\mathbf{a}_T)-V_T(\mathbf{s}_T)) \end{align} $$

where we use the fact that \(Q_t(\mathbf{s}_t,\mathbf{a}_t) = r(\mathbf{s}_t,\mathbf{a}_t)\) at \(t=T\) (since there is no state after \(T\)).

Plugging this back into the bound we had gives:

$$ \begin{align} \mathbb{E}_{(\mathbf{s}_T,\mathbf{a}_T)\sim q(\mathbf{s}_T,\mathbf{a}_T)}&\left[r(\mathbf{s}_T,\mathbf{a}_T)-\log(q(\mathbf{a}_T\vert\mathbf{s}_T))\right]\\ &= \mathbb{E}_{(\mathbf{s}_T,\mathbf{a}_T)\sim q(\mathbf{s}_T,\mathbf{a}_T)}\left[r(\mathbf{s}_T,\mathbf{a}_T)-\log(\exp(Q_T(\mathbf{s}_T,\mathbf{a}_T)-V_T(\mathbf{s}_T)))\right]\\ &= \mathbb{E}_{(\mathbf{s}_T,\mathbf{a}_T)\sim q(\mathbf{s}_T,\mathbf{a}_T)}\left[r(\mathbf{s}_T,\mathbf{a}_T)-r(\mathbf{s}_T,\mathbf{a}_T)-V_T(\mathbf{s}_T)\right]\\ &= \mathbb{E}_{(\mathbf{s}_T,\mathbf{a}_T)\sim q(\mathbf{s}_T,\mathbf{a}_T)}\left[V_T(\mathbf{s}_T)\right]\\ &= \int\int q(\mathbf{s}_T,\mathbf{a}_T)V_T(\mathbf{s}_T)d\mathbf{s}_Td\mathbf{a}_T\\ &= \int V_T(\mathbf{s}_T)\int q(\mathbf{s}_T,\mathbf{a}_T)d\mathbf{a}_Td\mathbf{s}_T\\ &= \int V_T(\mathbf{s}_T) q(\mathbf{s}_T)d\mathbf{s}_T\\ &= \mathbb{E}_{\mathbf{s}_T\sim q(\mathbf{s}_T)}\left[V_T(\mathbf{s}_T)\right] \end{align} $$

We can now solve for \(T-1\):

$$ \begin{align} q_{T-1}(\mathbf{a}_{T-1}\vert\mathbf{s}_{T-1}) &= \text{argmax}_q\;\mathbb{E}_{(\mathbf{s}_{T-1},\mathbf{a}_{T-1})\sim q(\mathbf{s}_{T-1},\mathbf{a}_{T-1})}\left[r(\mathbf{s}_{T-1},\mathbf{a}_{T-1})+\mathbb{E}_{\mathbf{s}_T\sim q(\mathbf{s}_T)}\left[V_T(\mathbf{s}_T)\right]-\log(q(\mathbf{a}_{T-1}\vert\mathbf{s}_{T-1}))\right]\\ &= \text{argmax}_q\;\mathbb{E}_{(\mathbf{s}_{T-1},\mathbf{a}_{T-1})\sim q(\mathbf{s}_{T-1},\mathbf{a}_{T-1})}\left[Q_{T-1}(\mathbf{s}_{T-1},\mathbf{a}_{T-1})-\log(q(\mathbf{a}_{T-1}\vert\mathbf{s}_{T-1}))\right]\\ \end{align} $$

This is minimized when:

$$ \begin{align} q_{T-1}(\mathbf{s}_{T-1},\mathcal{a}_{T-1}) &= \frac{\exp(Q_{T-1}(\mathbf{s}_{T-1},\mathcal{a}_{T-1}))}{\int \exp(Q_{T-1}(\mathbf{s}_{T-1},\mathcal{a}))d\mathbf{a}}\\ &= \exp(Q_{T-1}(\mathbf{s}_{T-1},\mathcal{a}_{T-1})-V_{T-1}(\mathbf{s}_{T-1})) \end{align} $$

Substituting this back into the bound gives:

$$ \mathbb{E}_{(\mathbf{s}_{T-1},\mathbf{a}_{T-1})\sim q(\mathbf{s}_{T-1},\mathbf{a}_{T-1})}\left[Q_{T-1}(\mathbf{s}_{T-1},\mathbf{a}_{T-1})-\log(q(\mathbf{a}_{T-1}\vert\mathbf{s}_{T-1}))\right] = \mathbb{E}_{(\mathbf{s}_{T-1})\sim q(\mathbf{s}_{T-1})}\left[V(\mathbf{s}_{T-1})\right]\\ $$

We can therefore recursively solve for \(t=T-1\) to \(1\) via:

  1. \(Q_t(\mathbf{s}_t,\mathcal{a}_t) =\) \(r(\mathbf{s}_t,\mathcal{a}_t)+\mathbb{E}_{\mathbf{s}_{t+1}\sim q(\mathbf{s}_{t+1})}\left[V_{t+1}(\mathbf{s}_{t+1})\right]\)
  2. \(V_t(\mathbf{s}_t)=\) \(\log\int\exp(Q_t(\mathbf{s}_t,\mathbf{a}))d\mathbf{a}\)

This is exactly similar to the value iteration algorithm we saw earlier. The only difference is that step 2 now has a soft max. For this reason, the algorithm above is called as soft value iteration.


  1. The system dynamics are not independent of future optimality variables. We will elaborate on this later. 

  2. Note that \(p(A\vert B,C) = \frac{p(C \vert A,B)p(A\vert B)}{p(C \vert B)}\). 

  3. This problem obviously does not arise in deterministic systems because they always transition to fixed states.