Lectures 1-3

I. Introduction

Suppose that a person is asked to move from point A to B in a minimum amount of time. This requires a series of actions to be taken. If there are several different routes to get to point B from A, then the goal in this case is to choose the shortest one.

In reinforcement leaning (RL), we assign rewards to all possible actions in each state. The aim is then to simply move from an initial state to an end (desired) state in such a way that the cumulative reward is maximized. The combination of actions taken and intermediate states visited is known as a policy. Therefore, in other words, the goal of reinforcement learning is to find a policy that maximizes the rewards.

II. Notation

We denote the state of the system and the action it takes at time step $t$ with $s_t$ and $a_t$. We also define $\pi_\theta$ to be our policy ($\theta$ are its parameters). $\pi_\theta$ essentially models the distribution $p(a_t \vert s_t)$.

In some cases, the policy may only have access to a distorted version of the states, which we shall refer to as the observations (and denote at each time step with $o_t$). In this case, the policy models the distribution $p(a_t \vert o_t)$. Such policies are known as partially-observed policies (in contrast to the earlier fully-observed policies).

III. Markov’s Property

The Markov’s property states that the probability of a state in the future is independent of the states in the past given the current state. In the RL framework, we may express this mathematically as:

$$ p(s_{t+1} \vert s_t,a_t,s_{t-1},a_{t-1},...,s_1,a_1) = p(s_{t+1}\vert s_t,a_t) $$

IV. Imitation Learning

One way in which the RL objective can be realized is by trying to imitate an expert. Consider the problem of driving a car. We may note down the actions that a human driver takes given the observations he or she makes of the surroundings. This would give us a training data which may then be used to find a policy using supervised learning. Such an approach is known as behavioral cloning.

However, the problem with this approach is that unless and until we imitate the expert perfectly, the policy $\pi_\theta$ will make errors when it is run. An error may result in the system being in a state that it had not seen during training (for example, the car might drift off the road a bit). Now because the training set did not contain such examples, the policy will not know what to do in this case. If it takes a wrong decision, then the system will find itself in a state even further from those in the training set. Errors will thus accumulate over time. This is shown in Figure 1.

In order words, the distributions from which the observations in the training set and when the policy was run came are not necessarily the same. Denote these distributions with $p_{train}(o_t)$ and $p_{\pi_\theta}(o_t)$. One way to make these two distributions equal is through Dataset Aggregation (DAgger):

Train $\pi_\theta$ from expert data $\mathcal{D_{train}}=\{o_1,...,o_N\}$
Run $\pi_\theta$ to get dataset $\mathcal{D}_\pi = \{o_1,...,o_M\}$
Ask expert to label $\mathcal{D}_\pi$ with actions $\{a_1,...,a_M\}$
Aggregate $\mathcal{D_{train}} \leftarrow \mathcal{D_{train}} \cup \mathcal{D}_\pi$
Repeat

While DAgger does address the problem of distributional drift, it introduces an added complication of having an expert label data obtained from running the policy (step 3). This may be too time consuming in some cases. So instead, can we, for example, mimic the expert so accurately (without overfitting) the first time that the problem of distributional drift does not arise at all? There are several reasons why this may not be easy:

Non-Markovian Behavior: Humans generally do not conform to the Markovian assumption. Each decision they take usually depends upon past observations, which is something the policy does not have access to. This may be addressed using recurrent neural networks.
Multimodal Behavior: Humans may choose to do different things in exactly the same situation for reasons (such as their mood) that may not depend upon the observations. For the case when the actions are discrete-valued, all valid actions may be assigned a high and others a low probability mass. For the continuous case, a neural network outputs the parameters of a distribution (for example the mean and covariance of a Gaussian). An action is then sampled from that distribution. To tackle multimodality, we may use one of the following:
- Output a mixture of Gaussians: Define the policy to be $\pi(a\vert o) = \sum_i w_i\mathcal{N}(\mu_i,\Sigma_i)$ where $w_i$, $\mu_i$ and $\Sigma_i$ are the outputs of a neural network.
- Latent variable models: These models introduce a random noise to the input of the neural network. The network still outputs only a single distribution but it uses this noise to decide between all the valid distributions.
- Autoregressive discretization: This discretizes the continuous actions. However, in order to avoid the curse of dimensionality, autoregressive discretization discretizes only one dimension at a time. It begins by taking the first dimension, discretizing it and having a neural network output a probability distribution (using softmax) over the discrete bins. It then samples a value from this probability distribution and feeds it to another neural network. This second neural network then outputs a probability distribution over the discrete bins of the second dimension of the action. A value is then sampled from this distribution and fed to a third neural network. This is repeated for all dimensions of the action.

While imitation learning is able to solve a wide variety of tasks, there are several problems with it:

Humans need to provide data, which is typically finite.
Humans are not good at providing certain actions (such as controlling a robot with 20 legs and 8 arms).
Humans can learn autonomously and continuously self-improve, so why cannot machines do the same?

V. Reward and Cost Functions

In machine learning, we are usually interested in defining a cost function, $c$, that can be minimized. In the case of reinforcement learning, the optimization problem that we are interested in is:

$$ \underset{a_1,...,a_T}{\min}\sum_{t=1}^T c(s_t,a_t) \textit{ s.t. } s_t=f(s_{t-1},a_{t-1}) $$

Alternatively, we may maximize a reward function:

$$ \underset{a_1,...,a_T}{\max}\sum_{t=1}^T r(s_t,a_t) \textit{ s.t. } s_t=f(s_{t-1},a_{t-1}) $$

Note that $c(s_t,a_t)=-r(s_t,a_t)$.

VI. An Analysis of Distributional Drift

In the section on imitation learning above, we talked about the distributional drift problem. We now present a mathematical analysis of this problem.

A. A Simple Analysis

Let $\pi^*$ denote the expert policy that we are trying to imitate. Define the cost function as follows:

$$ c(s,a) = \begin{cases} 0 & \text{if } a=\pi^*(s)\\ 1 & \text{otherwise} \end{cases} $$

Let $\pi_\theta$ denote the policy that we obtained through imitation learning. Assume that for all $s \in \mathcal{D}_{train}$:

$$ \pi_\theta(a \neq \pi^*(s) \vert s) \leq \epsilon $$

i.e. the probability of the learnt policy from differing with the expert policy on a state that was present in the training data is less than $\epsilon$. Furthermore, we assume that once $\pi_\theta$ has made an error, it will never recover i.e. it will continue to make errors.

Consider the finite case when $T$ actions are to be taken. Given that the initial state is drawn from $\mathcal{D}_{train}$, it can be seen that:

$$ \mathbb{E}[\sum_{t=1}^Tc(s_t,a_t)] \leq \epsilon T + (1-\epsilon)\mathbb{E}[\sum_{t=2}^Tc(s_t,a_t)] $$

If $\pi_\theta$ makes a mistake on the first time step, then according to the assumption above the total cost incurred will be $T$. The probability of this happening is $\epsilon$. This is what $\epsilon T$ represents. The $\leq$ is used because if the policy recovers to a state drawn from $\mathcal{D}_{train}$ after making an error the cost will be smaller. By recursively evaluating the inequality above, we get:

$$ \begin{align} \mathbb{E}[\sum_{t=1}^Tc(s_t,a_t)] &\leq \epsilon T + (1-\epsilon)(\epsilon (T-1) +(1-\epsilon)(...))\\ &\leq eT^2 \end{align} $$

B. A More General Analysis

Let us assume that for all $s \sim \mathcal p_{train}$:

$$ \pi_\theta(a \neq \pi^*(s)\vert s) \leq \epsilon $$

This is a stronger assumption from the one in the simpler analysis done before as we now assume the above inequality for the entire distribution from which the training set was drawn (rather than only the training set). However, this is reasonable enough because of the generalization bounds provided by learning theory.

Define the cost function as in the simpler analysis and denote $c(s_t,a_t)$ with $c_t$.

$$ c(s,a) = \begin{cases} 0 & \text{if } a=\pi^*(s)\\ 1 & \text{otherwise} \end{cases} $$

Note that:

$$ \begin{align} \mathbb{E}_{p_{train}}[c_t] &= p_{train}(a_t\neq \pi^*(s_t))c(a_t \neq \pi^*(s_t)) + p_{train}(a_t= \pi^*(s_t))c(a_t = \pi^*(s_t))\\ &=\epsilon \end{align} $$

With DAgger we know that $p_{\pi_\theta} \rightarrow p_{train}$ and so:

$$ \mathbb{E}_{p_{\pi_\theta}}[\sum_{t=1}^Tc(s_t,a_t)] = \sum_{t=1}^T\mathbb{E}_{p_{\pi_\theta}}[c(s_t,a_t)] \leq \epsilon T $$

For the case when $p_{train} \neq p_{\pi_\theta}$ assume, as before, that once $\pi_\theta$ has made an error it will never recover. Suppose that we run our policy $\pi_\theta$. Then the probability of $s_t$ (which is the state at time $t$) may be expressed as:

$$ p_{\pi_\theta}(s_t) = (1-\epsilon)^tp_{train}(s_t) + (1-(1-\epsilon)^T)p_{mistake}(s_t) $$

$(1-\epsilon)^T$ is the probability of $\pi_\theta$ not making an error up till time $t$. In order words, $(1-\epsilon)^T$ is the probability that $s_t$ is in the training distribution $p_{train}$. Similarly, $1-(1-\epsilon)^T$ is the probability of making at least one error in the first $t$ time steps. This means that $s_t$ is no longer in the training distribution. We denote this alternative (albeit complex) distribution with $p_{mistake}$.

The total variation divergence between two probability distributions is given by:

$$ \vert P(\mathbf{s})-Q(\mathbf{s})\vert = \sum_s \vert P(s)-Q(s)\vert $$

i.e. it is the sum of the absolute values of the differences between $P$ and $Q$ for all $s$. Note that the maximum total variation divergence is $2$. This corresponds to the case when $P(s_i)$ $=Q(s_j)$ $=1$ for some $i \neq j$. Note that in this case $P(s_k) =0\; \forall k\neq i$ and $Q(s_k)=0\; \forall k\neq j$ because probabilities must sum up to $1$.

The total variation divergence between $p_{\pi_\theta}$ and $p_{train}$ is given by:

$$ \begin{align} \sum_{s_t}\vert p_{\pi_\theta}(s_t) - p_{train}(s_t)\vert &= (1-(1-\epsilon)^t)\sum_{s_t}\vert p_{mistake}(s_t)-p_{train}(s_t)\vert\\ &\leq 2(1-(1-\epsilon)^t)\\ &\leq 2\epsilon t \end{align} $$

where the last step used the identity:

$$ (1-\epsilon)^t \geq 1-\epsilon t \; \forall \epsilon \in [0,1] $$

Note that:

$$ \begin{align} \mathbb{E}_{p_{\pi_\theta}}\left[\sum_{t=1}^T c_t\right] &= \sum_{t=1}^T\mathbb{E}_{p_{\pi_\theta}}\left[c_t\right]\\ &= \sum_{t=1}^T\sum_{s_t}p_{\pi_\theta}(s_t)c_t\\ &= \sum_{t=1}^T\sum_{s_t}p_{train}(s_t)c_t + (p_{\pi_\theta}(s_t)-p_{train}(s_t))c_t\\ &\leq \sum_{t=1}^T\sum_{s_t}p_{train}(s_t)c_t + \vert p_{\pi\theta}(s_t)-p_{train}(s_t)\vert c_t\\ &\leq \sum_{t=1}^T\sum_{s_t}p_{train}(s_t)c_t + \vert p_{\pi_\theta}(s_t)-p_{train}(s_t)\vert c_{max}\\ &\leq \sum_{t=1}^T\mathbb{E}_{p_{train}}[c_t] + \sum_{s_t}\vert p_{\pi_\theta}(s_t)-p_{train}(s_t)\vert\\ &\leq \sum_{t=1}^T \epsilon + 2\epsilon t\\ &\leq \epsilon T + 2\epsilon T^2 \end{align} $$

The fourth line follows from the fact that $p_{\pi_\theta}(s_t)-p_{train}(s_t)$ is bounded by its absolute value. The fifth line uses the fact that $c_t \leq c_{max}$. The sixth line just rearranges the terms and notes that $c_{max}=1$. The last line uses the fact that $t \leq T\; \forall t$.