1. Terminology and Notations
Let \(\mathbf{s}_t\) and \(\mathbf{o}_t\) be the state and observation of a system at time \(t\). Let \(\mathbf{a}_t\) be the action taken at time \(t\). Actions are taken according to some policy, denoted with \(\pi_\theta\). \(\theta\) are the parameters of the policy. A fully-observed policy models the distribution \(\pi_\theta(\mathbf{a}_t \vert \mathbf{s}_t)\) whereas a partially-observed policy models \(\pi_\theta(\mathbf{a}_t\vert \mathbf{o}_t)\).
We define \(r(\mathbf{\mathbf{s}_t},\mathbf{a}_t)\) to be the reward function. Intuitively, the reward function tells us which actions and states are better. Finally, let \(p(\mathbf{s}_{t+1}\vert \mathbf{s_t},\mathbf{a_t})\) denote the probability of transitioning to state \(\mathbf{s}_{t+1}\) if action \(\mathbf{a}_t\) is taken in state \(\mathbf{s}_t\).
2. Markov Chains
A Markov chain is defined as:
$$ \mathcal{M} = \{\mathcal{S},\mathcal{T}\} $$
where \(\mathcal{S}\) is known as the state space i.e. all \(\mathbf{s} \in \mathcal{S}\) and \(\mathcal{T}\) is the transition operator, i.e. it contains \(p(\mathbf{s}_{t+1}\vert \mathbf{s}_t)\). The reason why \(\mathcal{T}\) is called an operator is that if we define:
$$ \begin{align} \mu_{t,i} &= p(\mathbf{s}_t=i)\\ \mathcal{T}_{i,j} &= p(\mathbf{s}_{t+1}=i\vert \mathbf{s}_t=j) \end{align} $$
then from:
$$ p(\mathbf{s}_{t+1}) = \sum_{\mathbf{s}_t} p(\mathbf{s}_{t+1}\vert \mathbf{s}_t)p({\mathbf{s}_t}) $$
it follows that:
$$ \vec{\mu}_{t+1} = \mathcal{T}\vec{\mu}_t $$
In the formulation above, we have assumed that the Markov property holds.
3. Markov Decision Processes
A Markov Decision Process (MDP) is an extension of a Markov chain and is defined as:
$$ \mathcal{M} = \{\mathcal{S},\mathcal{A},\mathcal{T},r\} $$
where \(\mathcal{S}\) is the state space i.e. all \(\mathbf{s} \in \mathcal{S}\), \(\mathcal{A}\) is the action space i.e. all \(\mathbf{a} \in \mathcal{A}\), \(\mathcal{T}\) is the transition operator, i.e. it contains \(p(\mathbf{s}_{t+1}\vert \mathbf{s}_t)\) and \(r\) is the reward function such that \(r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}\). As before, note that if we define:
$$ \begin{align} \mu_{t,j} &= p(\mathbf{s}_t=j) \\ \zeta_{t,k} &= p(\mathbf{a}_t=k)\\ \mathcal{T}_{i,j,k} &= p(\mathbf{s}_{t+1}=i \vert \mathbf{s}_{t}=j,\mathbf{a}_t=k) \end{align} $$
then:
$$ \mu_{t+1,i} = \sum_{j,k}\mathcal{T}_{i,j,k}\mu_{t,j}\zeta_{t,k} $$
Note that actions and states can be both discrete and continuous.
In contrast to this fully-observed version of an MDP, we define a partially-observed MDP (POMDP) as:
$$ \mathcal{M} = \{\mathcal{S},\mathcal{A},\mathcal{O},\mathcal{T},\mathcal{E}, r\} $$
where \(\mathcal{O}\) is the observation space i.e. all \(\mathbf{o}\in \mathcal{O}\) and \(\mathcal{E}\) contains the emissions probabilities \(p(\mathbf{o}_t\vert\mathbf{s}_t)\). Note that the observations too can be discrete or continuous.
One important difference between observations and states is that the latter are Markovian while the former are not.
4. The Goal of Reinforcement Learning
For now, we will restrict ourselves to fully-observed Markov Decision Processes. Suppose that we have a system that starts off in state \(\mathbf{s}_1\) and takes action \(\mathbf{a}_1\) according to some policy \(\pi_\theta\), which causes the system to end up in state \(\mathbf{s}_2\). We refer to a series of states and actions taken \(\mathbf{s}_1,\mathbf{a}_1,\ldots,\mathbf{s}_T,\mathbf{a}_T\) as a trajectory \(\tau\). Note that:
$$ \begin{align} p(\tau) &= p(\mathbf{s}_1,\mathbf{a}_1,\ldots,\mathbf{s}_T,\mathbf{a}_T)\\ &= p(\mathbf{s}_1)p(\mathbf{a_1}\vert \mathbf{s}_1)p(\mathbf{s}_2\vert \mathbf{s}_1,\mathbf{a}_1)\ldots p(\mathbf{s}_T \vert \mathbf{s}_1,\mathbf{a_1},\ldots ,\mathbf{s}_{T-1},\mathbf{a}_{T-1})p(\mathbf{a}_T \vert \mathbf{s}_1,\mathbf{a_1},\ldots ,\mathbf{s}_{T})\\ &= p(\mathbf{s}_1)\pi_\theta(\mathbf{a_1}\vert \mathbf{s}_1)p(\mathbf{s}_2\vert \mathbf{s}_1,\mathbf{a}_1)\ldots p(\mathbf{s}_T \vert \mathbf{s}_{T-1},\mathbf{a}_{T-1})\pi_\theta(\mathbf{a}_T \vert \mathbf{s}_{T})\\ &=p(\mathbf{s}_1)\prod_{t=1}^T \pi_\theta(\mathbf{a}_t \vert \mathbf{s}_t)p(\mathbf{s}_{t+1}\vert \mathbf{s}_t,\mathbf{a}_t) \end{align} $$
where the second line made use of the chain rule of probability, the third line used Markov’s assumption on \(p(\mathbf{s}_k\vert \ldots)\) and noted that \(\pi_\theta(\mathbf{a}_t\vert \mathbf{s}_t)\) gives the probability of taking action \(\mathbf{a}_t\) in state \(\mathbf{s}_t\) (note that the policy does not consider states and actions before time \(t\)). In order to make the dependence of \(p(\tau)\) explicit on \(\theta\), we shall henceforth be writing it as \(p_\theta(\tau)\).
The goal of reinforcement learning is to find \(\theta\).
A. Finite Horizon Case
In the case when \(T\) is finite, we can simply find:
$$ \begin{align} \theta^* &= \underset{\theta}{\text{argmax}}\;\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\sum_{t=1}^T r(\mathbf{s}_t,\mathbf{a}_t)\right]\\ & = \underset{\theta}{\text{argmax}}\int p_\theta(\tau)\sum_{t=1}^T r(\mathbf{s}_t,\mathbf{a}_t)d\tau\\ &= \underset{\theta}{\text{argmax}}\sum_{t=1}^T\int p_\theta (\mathbf{s}_1, \mathbf{a}_1,\ldots, \mathbf{s}_T,\mathbf{a}_T) r(\mathbf{s}_t,\mathbf{a}_t)d\tau\\ &= \underset{\theta}{\text{argmax}}\sum_{t=1}^T\int\int p_\theta(\mathbf{s}_t, \mathbf{a}_t\vert \tau/ \mathbf{s}_t,\mathbf{a}_t) p_\theta(\tau/\mathbf{s}_t, \mathbf{a}_t) r(\mathbf{s}_t,\mathbf{a}_t) d(\tau/\mathbf{s}_t,\mathbf{a}_t) d(\mathbf{s}_t,\mathbf{a}_t)\\ &= \underset{\theta}{\text{argmax}}\sum_{t=1}^T\int\int p_\theta (\mathbf{s}_t, \mathbf{a}_t\vert \tau/\mathbf{s}_t, \mathbf{a}_t) p_\theta(\tau/ \mathbf{s}_t, \mathbf{a}_t)d(\tau/\mathbf{s}_t,\mathbf{a}_t) r(\mathbf{s}_t,\mathbf{a}_t) d(\mathbf{s}_t,\mathbf{a}_t)\\ &= \underset{\theta}{\text{argmax}}\sum_{t=1}^T\int p_\theta(\mathbf{s}_t,\mathbf{a}_t) r(\mathbf{s}_t,\mathbf{a}_t) d(\mathbf{s}_t,\mathbf{a}_t)\\ &= \underset{\theta}{\text{argmax}}\;\sum_{t=1}^T \mathbb{E}_{(\mathbf{s}_t,\mathbf{a}_t)\sim p_\theta(\mathbf{s}_t,\mathbf{a}_t)} \left[ r(\mathbf{s}_t,\mathbf{a}_t)\right] \end{align} $$
where \(p(\mathbf{s}_t,\mathbf{a}_t\tau/\mathbf{s}_t,\mathbf{a}_t) = p(\mathbf{s}_t,\mathbf{a}_t\vert \mathbf{s}_1,\mathbf{a}_2,\ldots,\mathbf{s}_{t-1},\mathbf{a}_{t-1},\mathbf{s}_{t+1},\mathbf{a}_{t+1},\mathbf{s}_{t-1},\mathbf{a}_{t-1},\mathbf{s}_{T},\mathbf{a}_{T})\). The probability \(p_\theta(\mathbf{s}_t,\mathbf{a}_t)\) is often referred to as a state-action marginal.
B. Infinite Horizon Case
Let us define a Markov chain on \((\mathbf{s},\mathbf{a})\), i.e. use the concatenation of \(\mathcal{S}\) and \(\mathcal{A}\) for \(\mathcal{S}\). If we let \(\mu_t\) to correspond to the distribution \(p_\theta(\mathbf{s}_t,\mathbf{a}_t)\) in the same fashion as before, then:
$$ \mu_{t+1} = \mathcal{T}\mu_t $$
Here \(\mathcal{T}\) is the state-action transition operator, i.e. it contains the distribution \(p_\theta(\mathbf{s}_{t+1},\mathbf{a}_{t+1} \vert \mathbf{s}_t, \mathbf{a}_t)\). Note that:
$$ \mu_{t+k} = \mathcal{T}^k\mu_t $$
Let us ask the following question: does \(\mu_t=p_\theta(\mathbf{s}_t,\mathbf{a}_t)\) converge to some stationary distribution \(\mu=p_\theta(\mathbf{s},\mathbf{a})\), i.e. does the distribution \(p(\mathbf{s}_t,\mathbf{a}_t)\) become constant after a certain time. For this we require that:
$$ \mu = \mathcal{T}\mu $$
which just means that \(\mu\) corresponds to the eigenvector of \(\mathcal{T}\) whose eigenvalue is \(1\). It turns out that such a vector always exists under some regulatory conditions.
Therefore, in the limit \(T \rightarrow \infty\):
$$ \theta^* = \underset{\theta}{\text{argmax}}\;\frac{1}{T}\sum_{t=1}^T \mathbb{E}_{(\mathbf{s}_t,\mathbf{a}_t)\sim p_\theta(\mathbf{s}_t,\mathbf{a}_t)} \left[ r(\mathbf{s}_t,\mathbf{a}_t)\right] $$
approaches to:
$$ \begin{align} \theta^* &= \underset{\theta}{\text{argmax}}\;\frac{1}{T}\sum_{t=1}^T \mathbb{E}_{(\mathbf{s},\mathbf{a})\sim p_\theta(\mathbf{s},\mathbf{a})} \left[ r(\mathbf{s},\mathbf{a})\right]\\ &= \underset{\theta}{\text{argmax}}\;\mathbb{E}_{(\mathbf{s}_t,\mathbf{a}_t)\sim p_\theta(\mathbf{s},\mathbf{a})} \left[ r(\mathbf{s},\mathbf{a})\right] \end{align} $$
Note that we have added an extra \(1/T\) term to prevent the summation from exploding to infinity.
In both the finite and infinite cases we are interested in maximizing the expected reward. In reinforcement learning, we almost always care about expectations. One reason for this is that even if a reward function is not smooth (which means that we cannot directly take its gradient and optimize), its expectation is often so. Let us demonstrate this with a toy example. Suppose that you are driving a car. Assume a reward function:
$$ r = \begin{cases} +1 & \text{remain on road}\\ -1 & \text{fall off a cliff} \end{cases} $$
Suppose that we are trying to model:
$$ p_\theta(\text{fall off a cliff}) = \theta $$
i.e. we remain on the road if \(% <![CDATA[ \theta < 0.5 %]]>\) and fall off the cliff otherwise. Now clearly the reward function is not smooth in \(\theta\). However, the expected reward is given by:
$$ \begin{align} \mathbb{E}[r] &= p_\theta(\text{remain on road})r(\text{remain on road}) + p_\theta(\text{fall off a cliff})r(\text{fall off a cliff})\\ &= (1-\theta)(1) + \theta(-1) \\ &= -2\theta + 1 \end{align} $$
which is smooth in \(\theta\).
5. The Anatomy of a Reinforcement Learning Algorithm
A reinforcement learning algorithm generally involves three steps:
- Generate samples i.e. run the policy and collect the observations made and actions taken.
- Estimate the returns (read ‘rewards’ for now) or fit a model to predict the next state given the current state and action.
- Improve the policy.
We will discuss what is exactly meant by ‘fitting a model’ in step 2 when we discuss model-based RL. There are several different ways in which the returns can be used to improve the policy. We will discuss them too in detail later on.
6. Working With Stochastic Systems
Stochastic systems, unlike deterministic ones, do not necessarily produce the same output for the same set of inputs each time. One way of dealing with such systems is by using conditional expectations.
A. Conditional Expectations
Recall that the reinforcement learning objective is given as follows:
$$ \sum_T\mathbb{E}_{(\mathbf{s}_t,\mathbf{a}_t)\sim p_\theta(\mathbf{s}_t,\mathbf{a}_t)}\left[r(\mathbf{s}_t,\mathbf{a}_t)\right] = \mathbb{E}_{(\mathbf{s}_1,\mathbf{a}_1)\sim p_\theta(\mathbf{s}_1,\mathbf{a}_1)}\left[r(\mathbf{s}_1,\mathbf{a}_1)\right] +\mathbb{E}_{(\mathbf{s}_2,\mathbf{a}_2)\sim p_\theta(\mathbf{s}_2,\mathbf{a}_2)}\left[r(\mathbf{s}_2,\mathbf{a}_2)\right] + \ldots \\ $$
Note that:
$$ \begin{align} \mathbb{E}_{(\mathbf{s}_1,\mathbf{a}_1)\sim p_\theta(\mathbf{s}_1,\mathbf{a}_1)}\left[r(\mathbf{s}_1,\mathbf{a}_1)\right] &= \int_{\mathbf{s}_1}\int_{\mathbf{a}_1}p_\theta(\mathbf{s}_1,\mathbf{a}_1)r(\mathbf{s}_1,\mathbf{a}_1)d\mathbf{s}_1d\mathbf{a}_1\\ &= \int_{\mathbf{s}_1}\int_{\mathbf{a}_1}p(\mathbf{s}_1)p_\theta(\mathbf{a}_1\vert\mathbf{s}_1)r(\mathbf{s}_1,\mathbf{a}_1)d\mathbf{s}_1d\mathbf{a}_1\\ &=\int_{\mathbf{s}_1}p(\mathbf{s}_1)\int_{\mathbf{a}_1}p_\theta(\mathbf{a}_1\vert\mathbf{s}_1)r(\mathbf{s}_1,\mathbf{a}_1)d\mathbf{s}_1d\mathbf{a}_1\\ &= \mathbb{E}_{\mathbf{s_1}\sim p(\mathbf{s}_1)}\left[\mathbb{E}_{\mathbf{a}_1\sim p_\theta(\mathbf{a}_1\vert\mathbf{s}_1)}\left[r(\mathbf{s}_1,\mathbf{a}_1)\right]\right] \end{align} $$
(where we drop \(\theta\) in \(p(\mathbf{s}_1)\) because it does not depend on \(\theta\) and \(p_\theta(\mathbf{a}_1\vert\mathbf{s}_1)=\pi_\theta(\mathbf{a}_1\vert\mathbf{s}_1)\)). Also note that:
$$ \begin{align} \mathbb{E}_{(\mathbf{s}_2,\mathbf{a}_2)\sim p_\theta(\mathbf{s}_2,\mathbf{a}_2)}\left[r(\mathbf{s}_2,\mathbf{a}_2)\right] &= \int_{\mathbf{s}_2}p(\mathbf{s}_2)\int_{\mathbf{a}_2}p_\theta(\mathbf{a}_2\vert\mathbf{s}_2)r(\mathbf{s}_2,\mathbf{a}_2)d\mathbf{s}_2d\mathbf{a}_2\\ &= \int_{\mathbf{s}_2}\left(\int_{\mathbf{s}_1}\int_{\mathbf{a}_1} p_\theta(\mathbf{s}_2\vert \mathbf{s}_1,\mathbf{a}_1)p_\theta(\mathbf{s}_1,\mathbf{a}_1)d\mathbf{s}_1\mathbf d{a}_1\right)\int_{\mathbf{a}_2}p_\theta(\mathbf{a}_2\vert\mathbf{s}_2)r(\mathbf{s}_2,\mathbf{a}_2)d\mathbf{s}_2d\mathbf{a}_2\\ &= \int_{\mathbf{s}_1}\int_{\mathbf{a}_1}p_\theta(\mathbf{s}_1,\mathbf{a}_1)\int_{\mathbf{s}_2}p_\theta(\mathbf{s}_2\vert \mathbf{s}_1,\mathbf{a}_1)\int_{\mathbf{a}_2}p_\theta(\mathbf{a}_2\vert\mathbf{s}_2)r(\mathbf{s}_2,\mathbf{a}_2)d\mathbf{s}_2d\mathbf{a}_2d\mathbf{s}_1d\mathbf{a}_1\\ &= \int_{\mathbf{s}_1}p_\theta(\mathbf{s}_1)\int_{\mathbf{a}_1}p_\theta(\mathbf{a}_1\vert\mathbf{s}_1)\int_{\mathbf{s}_2}p_\theta(\mathbf{s}_2\vert \mathbf{s}_1,\mathbf{a}_1)\int_{\mathbf{a}_2}p_\theta(\mathbf{a}_2\vert\mathbf{s}_2)r(\mathbf{s}_2,\mathbf{a}_2)d\mathbf{s}_2d\mathbf{a}_2d\mathbf{s}_1d\mathbf{a}_1\\ &= \mathbb{E}_{\mathbf{s}_1\sim p(\mathbf{s}_1)}\left[\mathbb{E}_{\mathbf{a}_1\sim p_\theta(\mathbf{a}_1\vert \mathbf{s}_1) }\left[\mathbb{E}_{\mathbf{s}_2 \sim p_\theta(\mathbf{s}_2\vert\mathbf{s}_1,\mathbf{a}_1)}\left[\mathbb{E}_{\mathbf{a}_2\sim p_\theta(\mathbf{a}_2\vert\mathbf{s}_2)}\left[r(\mathbf{s}_2,\mathbf{a}_2)\right]\right]\right]\right] \end{align} $$
All expectations can be written in the form above. Adding them up yields:
$$ \begin{align} &\mathbb{E}_{\mathbf{s_1}\sim p(\mathbf{s}_1)}\left[\mathbb{E}_{\mathbf{a}_1\sim p_\theta(\mathbf{a}_1\vert\mathbf{s}_1)}\left[r(\mathbf{s}_1,\mathbf{a}_1)\right]\right] + \mathbb{E}_{\mathbf{s}_1\sim p(\mathbf{s}_1)}\left[\mathbb{E}_{\mathbf{a}_1\sim p_\theta(\mathbf{a}_1\vert \mathbf{s}_1) }\left[\mathbb{E}_{\mathbf{s}_2 \sim p_\theta(\mathbf{s}_2\vert\mathbf{s}_1,\mathbf{a}_1)}\left[\mathbb{E}_{\mathbf{a}_2\sim p_\theta(\mathbf{a}_2\vert\mathbf{s}_2)}\left[r(\mathbf{s}_2,\mathbf{a}_2)\right]\right]\right]\right] + \ldots\\ &= \mathbb{E}_{\mathbf{s_1}\sim p(\mathbf{s}_1)}\left[\mathbb{E}_{\mathbf{a}_1\sim p_\theta(\mathbf{a}_1\vert\mathbf{s}_1)}\left[r(\mathbf{s}_1,\mathbf{a}_1) + \mathbb{E}_{\mathbf{s}_2 \sim p_\theta(\mathbf{s}_2\vert\mathbf{s}_1,\mathbf{a}_1)}\left[\mathbb{E}_{\mathbf{a}_2\sim p_\theta(\mathbf{a}_2\vert\mathbf{s}_2)}\left[r(\mathbf{s}_2,\mathbf{a}_2)+\ldots \right]\right]\right]\right] \end{align} $$
This simply uses the fact that:
$$ \mathbb{E}_{c\sim p(c)}[f(a,c)]+\mathbb{E}_{c\sim p(c)}[f(b,c)] = \mathbb{E}_{c\sim p(c)}[f(a,c)+f(b,c)] $$
Suppose that we knew:
$$ Q(\mathbf{s}_1,\mathbf{a}_2) = r(\mathbf{s}_1,\mathbf{a}_1)+\mathbb{E}_{\mathbf{s}_2 \sim p_\theta(\mathbf{s}_2\vert\mathbf{s}_1,\mathbf{a}_1)}\left[\mathbb{E}_{\mathbf{a}_2\sim p_\theta(\mathbf{a}_2\vert\mathbf{s}_2)}\left[r(\mathbf{s}_2,\mathbf{a}_2)+\ldots \right]\right] $$
Then:
$$ \sum_T\mathbb{E}_{(\mathbf{s}_t,\mathbf{a}_t)\sim p_\theta(\mathbf{s}_t,\mathbf{a}_t)}\left[r(\mathbf{s}_t,\mathbf{a}_t)\right] = \mathbb{E}_{\mathbf{s_1}\sim p(\mathbf{s}_1)}\left[\mathbb{E}_{\mathbf{a}_1\sim p_\theta(\mathbf{a}_1\vert\mathbf{s}_1)}\left[Q(\mathbf{s}_1,\mathbf{a}_1)\right]\right] $$
This is quite a simple optimization problem. In order to maximize the inner expectation we only need to choose:
$$ \mathbf{a}_1 = \underset{\mathbf{a}}{\text{argmax}}\;Q(\mathbf{s}_1,\mathbf{a}) $$
i.e. we need to modify our policy \(\pi_\theta\) to choose \(\mathbf{a}_1\) such that \(Q(\mathbf{s}_1,\mathbf{a}_1)\) is maximized.
B. The Q-Function
Formally, we define the Q-function to be:
$$ Q^\pi(\mathbf{s}_t,\mathbf{a}_t) = \sum_{t'=t}^T \mathbb{E}_{(\mathbf{s}_{t'},\mathbf{a}_{t'})\sim \pi_\theta(\mathbf{s}_{t'},\mathbf{a}_{t'}\vert \mathbf{s}_{t},\mathbf{a}_{t})}\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\right] $$
which is simply the total expected reward from taking action \(\mathbf{a}_t\) in state \(\mathbf{s}_t\) and then following the policy \(\pi\).
C. The Value Function
We define the value function to be:
$$ V^\pi(\mathbf{s}_t)=\sum_{t'=t}^T \mathbb{E}_{(\mathbf{s}_{t'},\mathbf{a}_{t'})\sim \pi_\theta(\mathbf{s}_{t'},\mathbf{a}_{t'}\vert \mathbf{s}_{t})}\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\right]\\ $$
which is simply the reward from following policy \(\pi\) from state \(\mathbf{s}_t\).
Note that:
$$ V^\pi(\mathbf{s}_t) = \mathbb{E}_{\mathbf{a}_t \sim \pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)}\left[Q(\mathbf{s}_t,\mathbf{a}_t)\right] $$
as:
$$ \begin{align} \mathbb{E}_{\mathbf{a}_t \sim \pi_\theta(\mathbf{a}_t\vert\mathbf{s}_t)}\left[Q(\mathbf{s}_t,\mathbf{a}_t)\right] &= \int_{\mathbf{a}_t}p(\mathbf{a}_t\vert \mathbf{s}_t)\sum_{t'=t}^T \mathbb{E}_{(\mathbf{s}_{t'},\mathbf{a}_{t'})\sim \pi_\theta(\mathbf{s}_{t'},\mathbf{a}_{t'}\vert \mathbf{s}_{t},\mathbf{a}_{t})}\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\right]d\mathbf{a}_t\\ &= \sum_{t'=t}^T \int_{\mathbf{a}_t}\int_{\mathbf{s}_{t'}}\int_{\mathbf{a}_{t'}}p(\mathbf{a}_t\vert \mathbf{s}_t)\pi_\theta(\mathbf{s}_{t'},\mathbf{a}_{t'}\vert \mathbf{s}_{t},\mathbf{a}_{t})r(\mathbf{s}_{t'},\mathbf{a}_{t'})d\mathbf{s}_{t'}d\mathbf{a}_{t'}d\mathbf{a}_t\\ &= \sum_{t'=t}^T \int_{\mathbf{s}_{t'}}\int_{\mathbf{a}_{t'}}\pi_\theta(\mathbf{s}_{t'},\mathbf{a}_{t'}\vert \mathbf{s}_{t})r(\mathbf{s}_{t'},\mathbf{a}_{t'})d\mathbf{s}_{t'}d\mathbf{a}_{t'}\\ &= \sum_{t'=t}^T \mathbb{E}_{(\mathbf{s}_{t'},\mathbf{a}_{t'})\sim \pi_\theta(\mathbf{s}_{t'},\mathbf{a}_{t'}\vert \mathbf{s}_{t})}\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\right]\\ &= V^\pi(\mathbf{s}_t) \end{align} $$
Note that:
$$ \begin{align} \mathbb{E}_{\mathbf{s}_1\sim p(\mathbf{s}_1)}\left[V^\pi(\mathbf{s}_1)\right] &= \int_{\mathbf{s}_1}p(\mathbf{s}_1)\sum_{t'=1}^T \mathbb{E}_{(\mathbf{s}_{t'},\mathbf{a}_{t'})\sim \pi_\theta(\mathbf{s}_{t'},\mathbf{a}_{t'}\vert \mathbf{s}_{1})}\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\right]d\mathbf{s}_1\\ &= \sum_{t'=1}^T \int_{\mathbf{s}_1}\int_{\mathbf{s}_{t'}}\int_{\mathbf{a}_{t'}}p(\mathbf{s}_1)p(\mathbf{s}_{t'},\mathbf{a}_{t'}\vert \mathbf{s}_{1}))\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\right]d\mathbf{a}_{t'}d\mathbf{s}_{t'}d\mathbf{s}_1\\ &= \sum_{t'=1}^T \int_{\mathbf{s}_{t'}}\int_{\mathbf{a}_{t'}}p(\mathbf{s}_{t'},\mathbf{a}_{t'})\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\right]d\mathbf{a}_{t'}d\mathbf{s}_{t'}\\ &= \sum_{t'=1}^T \mathbb{E}_{(\mathbf{s}_{t'},\mathbf{a}_{t'})\sim \pi_\theta(\mathbf{s}_{t'},\mathbf{a}_{t'})}\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\right] \end{align} $$
which is the reinforcement learning objective function.
D. Using Q and Value Functions
We present two ideas surrounding Q and value functions that are often used in reinforcement learning:
Idea 1: Improve the Policy
If we know \(Q^\pi(\mathbf{s},\mathbf{a})\) we can improve \(\pi\) by setting \(\pi'(\mathbf{a}\vert \mathbf{s})=1\) if \(\mathbf{a}=\underset{\mathbf{a}}{\text{argmax}}\;Q^\pi(\mathbf{s},\mathbf{a})\). \(\pi'\) will always be as good as \(\pi\).
Idea 2: Increase the Probability of Good Actions
If \(Q^\pi(\mathbf{s},\mathbf{a})>V^\pi(\mathbf{s})\) then this means that \(\mathbf{a}\) is better than average. We can thus modify \(\pi(\mathbf{a}\vert \mathbf{s})\) so that it assigns a higher probability of \(\mathbf{a}\).
7. Why Do We Have So Many RL Algorithms?
When choosing an RL algorithm, we often need to consider the following points:
- Sample Efficiency: How many samples are required to get a good policy?
- Off Policy: Can the policy be improved without having to generate new samples each time it is modified even slightly. The opposite of off policy is on policy.
- Convergence: While supervised learning algorithms are almost always gradient descent (and so it usually converges), reinforcement learning algorithms are often not so. So different algorithms have different convergence patterns.
- Assumptions: Different RL algorithms make different assumptions such as:
- Full observability i.e. have access to the true state of the system (can be mitigated through recurrence)
- Stochastic or deterministic
- Continuous (smooth) or discrete
- Episodic (finite) or infinite horizon
- Different things are easier or hard in different settings: Is it easier to represent the policy or is it easier to estimate the returns?
Note: I have omitted the brief overview of different RL algorithms presented in the lecture in these notes as the presentation at best will be quite vague at this point.