Lectures 17-18

1. Exploration and Exploitation

Consider the following two problems:

How can an agent discover high-reward strategies that require a temporally extended sequence of complex behaviors that, individually, are not rewarding?
How can an agent decide whether to attempt new behaviors (to discover ones with higher reward) or continue to do the best thing it knows so far?

The first of these problems is that of exploitation. In this case, while each individual action may not rewarding, the agent knows that collectively these actions will yield a high reward over time. The agent is thus exploiting its knowledge about the state-action space. The second problem is that of exploration. In this case, the agent takes new actions in the hope of getting a higher reward. The agent is thus exploring the state-action space.

2. Bandits

We will mainly study exploration problems in the context of bandits. A bandit is a simple slot machine. An agent repeatedly pulls the lever of the machine and receives a certain amount of reward. $K$-armed bandits (or multi-armed bandits) generalizes this concept to $K$ bandits. At each time step, an agent repeatedly pulls the lever on one of the bandits and receives a certain amount of reward. The goal of the agent is to devise a policy so as to get the maximum possible reward.

Let $r(\mathbf{a}_i)$ denote the reward that the agent gets by pulling the lever on the $i$th bandit. We assume that $r(\mathbf{a}_i)$ is sampled from some probability distribution parameterized by $\theta_i$ i.e. $r(\mathbf{a}_i) \sim p_{\theta_i}(r(\mathbf{a}_i))$. Initially, the $\theta_i$ are unknown. Each time the agent takes an action and receives a certain reward, it can make a guess about the values of the $\theta_i$. At each time step, the agent thus maintains a belief $\hat{p}(\theta_1,\ldots,\theta_n)$ over the values of $\theta_i$.

We may define a POMDP for this process by letting $[\theta_1,\ldots,\theta_K]$ be the state. Note that we only have one state in this case. Whatever action we take, the state remains the same. Also note that this is a partially-observed MDP (and not a simple MDP) since we do not know the true values of the $\theta_i$ (i.e the state).

We measure how good exploration algorithms are through a quantity called regret. Regret is simply the difference between the expected reward of the best action $\mathbf{a}^*$ and the reward of the actions actually taken over a horizon of $T$ (note that since we only have one state, the optimal policy will always choose the action that yields the highest reward in expectation over and over again):

$$ \text{Reg}(T) = T\mathbb{E}[r(\mathbf{a}^*)] - \sum_{t=1}^T r(\mathbf{a}_t) $$

We will now discuss some strategies for exploration that work very well in practice.

3. Optimistic Exploration

The intuition behind optimistic exploration is to try each action until one is sure that it is not great. To do so, we keep track of the average reward $\hat\mu_{\mathbf{a}}$ for each action $\mathbf{a}$. For exploitation, we then choose:

$$ \mathbf{a} = \text{argmax}_{\mathbf{a}} \hat\mu_{\mathbf{a}} $$

and for exploration we choose:

$$ \mathbf{a} = \text{argmax}_{\mathbf{a}} \hat\mu_{\mathbf{a}} + C\sigma_{\mathbf{a}} $$

Note that $C\sigma_{\mathbf{a}}$ allows the agent to choose explore new actions. One example of it is:

$$ C\sigma_{\mathbf{a}} = \sqrt{\frac{2\ln T}{N(\mathbf{a})}} $$

where $N(\mathbf{a})$ is the number of times action $\mathbf{a}$ was taken.

4. Probability Matching or Posterior/Thomson Sampling

The main idea of this approach is summarized in the following algorithm:

Sample $\theta_1,\ldots,\theta_K \sim \hat{p}(\theta_1,\ldots,\theta_KN)$ where $\hat p$ is our belief of the state.
Pretend $\theta_1,\ldots,\theta_K$ are the correct values for the state.
Take the optimal action.
Update $\hat{p}$.
Repeat.

$\hat p(\theta_1,\ldots,\theta_K)$ could just, for example, be proportional to the total reward obtained by assuming that the state is equal to $\theta_1,\ldots,\theta_K$.

5. Information Gain

The intuition behind this approach is to only choose actions that will yield the most new information about the $\theta_i$. Consider the following experiment:

We want to determine the value of the $\theta_i$. Let $\mathcal{H}(\hat p(\theta_1,\ldots,\theta_K))$ be the current entropy of our belief about the $\theta_i$. Note that the lower this entropy is, the more precisely we know the value of the $\theta_i$. Let us take an action $\mathbf{a}$ and observe the resulting reward $r(\mathbf{a})$. Denote the entropy of the $\theta_i$ after making this observation with $\mathcal{H}(\hat p(\theta_1,\ldots,\theta_K)\vert r(\mathbf{a}))$. Intuitively, we would like to choose an action $\mathbf{a}$ such that the entropy of the $\theta_i$ is maximally lowered i.e. the difference between the two entropies is maximized. However, the problem with this is that we do not know the reward $r(\mathbf{a})$ that we will get upon taking action $\mathbf{a}$ (if we knew the rewards of all actions, then we would not have had the need to explore and we cannot simply take all actions and observe their rewards before choosing which action to take). To avoid this problem, we calculate this difference in expectation over $r(\mathbf{a})$ using our current belief about it (which can, for example, be simply developed by keeping a record of how many times an action $\mathbf{a}$ resulted in a certain reward in previous time steps):

$$ \text{IG}([\theta_1,\ldots,\theta_K],r(\mathbf{a})) = \mathbb{E}_{r(\mathbf{a})}\left[\mathcal{H}(\hat p(\theta_1,\ldots,\theta_K)) - \mathcal{H}(\hat p(\theta_1,\ldots,\theta_K)\vert r(\mathbf{a}))\right] $$

In the previous three sections, we looked at different exploration strategies in the context of bandit problems. We will now discuss each of these strategies for complex high dimensional MDPs.

6. Optimistic Exploration In RL

Previously, we have seen that optimistic exploration chooses actions according to:

$$ \mathbf{a} = \text{argmax}_{\mathbf{a}} \hat\mu_{\mathbf{a}} + C\sigma_{\mathbf{a}} $$

One example of $C\sigma_{\mathbf{a}}$ is:

$$ C\sigma_{\mathbf{a}} = \sqrt{\frac{2\ln T}{N(\mathbf{a})}} $$

It turns out that a lot of functions work for $C\sigma_\mathbf{a}$ as long as they decrease with $N(\mathbf{a})$. We can use this idea of an ‘exploration-bonus’ for MDPs by modifying our reward function as follows:

$$ r^+(\mathbf{s},\mathbf{a}) = r(\mathbf{s},\mathbf{a}) + \mathcal{B}(N(\mathbf{s})) $$

where $\mathcal{B}$ is the exploration-bonus. Note that $\mathcal B$ can depend on the counts of both the state and the action. However, for simplicity we assume that it only depends on the counts of the states. We can now simply plug our new reward function $r^+(\mathbf{s},\mathbf{a})$ instead of $r(\mathbf{s},\mathbf{a})$ into any model-free algorithm.

However, there is one problem with the above formulation. In high-dimensional state spaces it is very unlikely that we see the same state twice. However, we do often see states that are similar to one another. We must take this observation into account. One way of doing so is to fit a density model $p_\theta(\mathbf{s})$ that captures these similarities. $p_\theta(\mathbf{s})$ essentially models the probability that we have seen state $\mathbf{s}$. Thus, $p_\theta(\mathbf{s})$ might even be higher for a state that we have not visited if that state is similar to the ones we have seen before. We will talk about different choices for $p_\theta(\mathbf{s})$ shortly. For now, assume that we have this density model. The algorithm that we use is as follows:

Repeat until convergence:

Fit the model $p_\theta(\mathbf{s})$ to all states $\mathcal D$ seen so far.
Take a step $i$ and observe $\mathbf{s}_i$.
Fit a new model $p_{\theta'}$ to $D \cup \mathbf{s}_i$.
Use $p_\theta(\mathbf{s})$ and $p_{\theta'}(\mathbf{s})$ to estimate $\hat N(\mathbf{s}_i)$.
Set $r^+(\mathbf{s}_i) = r(\mathbf{s}_i) + \mathcal{B}(\hat N(\mathbf{s}_i))$.

Step 4 is performed by noting that:

$$ \begin{align} p_\theta(\mathbf{s}_i) &= \frac{\hat{N}(\mathbf{s}_i)}{\hat{n}}\\ p_{\theta'}(\mathbf{s}_i) &= \frac{\hat{N}(\mathbf{s}_i)+1}{\hat{n}+1} \end{align} $$

where $\hat n$ are the total number of states visited. Solving for the two unknowns $\hat n$ and $\hat{N}(\mathbf{s}_i)$ yields:

$$ \begin{align} \hat{N}(\mathbf{s}_i) &= \hat{n}p_\theta(\mathbf{s}_i)\\ \hat{n} &= \frac{1-p_{\theta'}(\mathbf{s}_i)}{p_{\theta'}(\mathbf{s}_i)-p_{\theta}(\mathbf{s}_i)}p_{\theta}(\mathbf{s}_i) \end{align} $$

There are several different ways in which $p_\theta(\mathbf{s})$ can be modeled. One idea is to use some hash function $\phi(\mathbf{s})$ to compress each state $\mathbf{s}$ that we see into a $k$-bit code. The number of times that a state $\mathbf{s}_i$ has been seen is, then, simply given by the number of times $\phi(\mathbf{s}_i)$ has been seen. Note that the smaller the value of $k$ is the more hash collisions there are. $\phi(\mathbf{s})$ is chosen such that it assigns similar states the same hash. For example, one could use a standard auto-encoder to obtain a latent representation of a state $\mathbf{s}$ followed by some standard hash function. Note that the latent representations of similar states obtained from the autoencoder will be close to each other and hence will receive the same hash.

Another idea makes use of the following intuition: a state $\mathbf{s}_i$ is different (novel) if it is easy to distinguish it from all previous states. Given a set of previously seen states $\mathcal{D}$ we are interested in determining whether a new state $\mathbf{s}_i$ belongs to $\mathcal{D}$. To this end, we can simply train a discriminator $\mathcal{D}_{\mathbf{s}}$ such that it outputs a $1$ when it sees the state $\mathbf{s}_i$ and a $0$ when it sees states from $\mathcal{D}$. Now, note that if the state $\mathbf{s}_i$ has previously occurred and is a part of $\mathcal{D}$, then the classifier will neither be able to output a $0$ or a $1$ for $\mathbf{s}_i$. It will have to output some sort of a fraction. It can be shown that for a particular choice of the loss function, the optimal discriminator $\mathcal{D}^*$ will output:

$$ \mathcal{D}_{\mathbf{s}}^* = \frac{1}{1+p_\theta(\mathbf{s})} $$

The output of the discriminator can thus be used to calculate $p_\theta(\mathbf{s})$. Of course, it is computationally very expensive to retrain a discriminator for each new state seen. One could, thus, simply train one neural network that takes in input $\mathbf{s}_i$ and $\mathcal{D}$ and outputs $\mathcal{D}_{\mathbf{s}}^*$. Finally, as each state seen in a continuous environment is, technically, distinct, some sort of regularization method is usually required to ensure that the discriminator does focus on the similarities between different states. So, for example, the states could be encoded via an autoencoder before inputting them to the discriminator.

7. Posterior Sampling in RL

In our discussion above on posterior sampling, we maintained a belief distribution $\hat p$ over the parameters of the reward function $[\theta_1,\ldots,\theta_K]$. We then sampled a set of parameters from this distribution, and took the optimal action according to them. In RL, we are usually more interested in the Q-values at a given state (rather than the reward function), since the Q-values extend temporally (whereas the rewards are at particular states only). We can thus modify our algorithm for posterior sampling as follows:

Sample a Q-function: $Q \sim \hat{p}(Q)$
Pretend the sampled $Q$ is correct.
Act according to $Q$ for one episode.
Update $\hat{p}$.
Repeat.

The only issue with this algorithm is the following: Q-functions are essentially neural networks. So, how does one represent a distribution over neural networks. One way of solving this problem is to use a bootstrapped mechanism. Given a dataset, we sample $N$ sub-datasets (with replacement) from it. We then use these datasets to train $N$ different models $f_{\theta_i}$. To sample from $\hat{p}_\theta$, we simply sample $j \in [1,\ldots,N]$ and pick $f_{\theta_j}$.

8. Information Gain in RL

Previously, we talked about choosing actions that provide us the highest information gain about $[\theta_1,\ldots,\theta_K]$. However, in RL we might have sparse rewards. As such, the information gain about the reward function may not prove to be a very good strategy. Alternatively, we could calculate the information gain about the system dynamics. It is easy to show that the information gain $IG(z,y)$ can be written as $D_{KL}((z \vert y) \vert\vert p(z))$. If we let $\theta$ be the parameters of our model of the transition dynamics $p_\theta(\mathbf{s}_{t+1}\vert\mathbf{s}_t,\mathbf{a}_t)$, then at time $t$ we would be interested in choosing an action $\mathbf{a}_t$ that results in an action $\mathbf{s}_{t+1}$ such that it maximizes the following KL divergence:

$$ D_{KL}(p(\theta \vert h) \vert\vert p(\theta \vert h,\mathbf{s}_{t+1},\mathbf{s}_t,\mathbf{a})) $$

where $h$ is the history of all past transitions. This can be accomplished by training the parameters $\phi$ of a variational distribution $q(\theta\vert\phi)$ to approximate $p(\theta\vert h)$. We can then update $\phi$ using the new transition $\mathbf{s}_{t+1},\mathbf{s}_t,\mathbf{a}_t$. Denote these update parameters with $\phi'$. We can now simply choose the action that maximizes $D_{KL}(q(\theta\vert\phi)\vert\vert q(\theta\vert\phi'))$.

This concludes the notes for this course. The remaining lectures (19-25) are more of a general survey of the literature on several problems (transfer learning, meta-learning etc.) and a discussion on some of the applications of RL.