In this set of lecture notes we will discuss the problem of overfitting in model-based reinforcement learning.

1. The Problem

In the previous set of lecture notes, we outlined the following algorithm for model-based RL:

  1. Run base policy \(\pi_0(\mathbf{a}_t\vert\mathbf{s}_t)\) to collect \(\mathcal{D} = \{(\mathbf{s}_t,\mathbf{a}_t,\mathbf{s}_{t+1})^{(i)}\}\).
  2. Train a neural network \(f'\) to minimize \(\vert\vert f(\mathbf{s}^{(i)}_t,\mathbf{a}^{(i)}_t)-\mathbf{s}^{(i)}_{t+1} \vert\vert^2\).
  3. Repeat:
    1. Plan through \(f(\mathbf{s}_t,\mathbf{a}_t)\) to choose actions (e.g. use LQR).
    2. Execute the first action only and observe the resulting state \(\mathbf{s}_{t+1}\).
    3. Add \((\mathbf{s}_{t},\mathbf{a}_{t},\mathbf{s}_{t+1})\) to \(\mathcal{D}\).
    4. After every \(N\) steps retrain \(f'\).

Consider step 1 in the inner loop. The problem with using \(f(\mathbf{s}_t,\mathbf{a}_t)\) is that while it has a high capacity (we need complex models to accurately model complicated system dynamics), the data available to train it is typically sparse. This means the \(f(\mathbf{s}_t,\mathbf{a}_t)\) will most likely overfit the data it is trained on. Consequently, \(f(\mathbf{s}_t,\mathbf{a}_t)\) may be erroneously optimistic about low-reward trajectories and pessimistic about high-reward trajectories that it was not trained on. The planner (e.g. LQR) can therefore exploit this issue by choosing trajectories that \(f(\mathbf{s}_t,\mathbf{a}_t)\) thinks have higher rewards, but in reality have low rewards.

We, thus, need to modify our planner so that it takes into account the uncertainty in the model \(f(\mathbf{s}_t,\mathbf{a}_t)\) itself, i.e. it should only take actions that it can yield a higher reward in expectation with respect to the uncertain dynamics. Intuitively, the further we are from the data that \(f(\mathbf{s}_t,\mathbf{a}_t)\) was trained on, the more uncertain we should be about the predictions that it makes.

2. Uncertainty Aware Models

One way to measure uncertainty is to calculate the entropy of the output of the model (using the probabilities it assigns to each state). However, the probabilities that the model assigns to different states only reflects its understanding of the data it was trained on. Consequently, the entropy only captures the so-called aleatoric or statistical uncertainty i.e. the uncertainty (noise) inherent in the training data. What we are interested in instead is the epistemic or model uncertainty i.e. the uncertainty in the model itself. In simpler words, although the model is certain about the data, we are not certain about the model itself.

One idea of estimating model uncertainty notes that we usually try to find the parameters \(\theta^*\) of theta model such that:

$$ \theta^* = \text{argmax}_{\theta} \log p(\mathcal{D} \vert \theta) $$

which is simply equivalent to (assuming a uniform prior on \(\theta\)):

$$ \theta^* = \text{argmax}_{\theta} \log p(\theta \vert \mathcal{D}) $$

If instead we could somehow calculate \(p(\theta\vert\mathcal{D})\), then the entropy of the distribution \(\theta \vert \mathcal{D}\) can tell us the uncertainty in the model itself. Note that the entropy is higher if more than one \(\theta\) have a high probability. In other words, the higher the number of different models we can fit to \(\mathcal{D}\), the less certain we are of each model being correct.

Bootstrap ensemble models train \(N\) different (independent) models. This gives \(N\) parameter vectors. Each of these \(\theta_i\)’s is assigned an equal probability of \(1/N\). Formally, this can be written as:

$$ p(\theta \vert \mathcal{D}) = \frac{1}{N} \sum_{i=1}^N \delta(\theta_i) $$

where \(\delta\) is the Dirac delta function.

We can now simply make predictions about the new state given the current state and action according to:

$$ \int p(\mathbf{s}_{t+1} \vert \mathbf{s}_{t},\mathbf{a}_{t}, \theta)p(\theta \vert \mathcal{D})d\theta \approx \frac{1}{N}\sum_{i=1}^N p_{\theta_i}(\mathbf{s}_{t+1} \vert \mathbf{s}_{t},\mathbf{a}_{t}) $$

In this way, states on which all of the models agree on are assigned higher probabilities. Also, if the models disagree on a particular state (indicating a high entropy and therefore a high uncertainty) the probabilities assigned to that state get averaged out (resulting in a lower probability for it).

For bootstrap ensembles we, however, need to train \(N\) different independent models. While this can be achieved by sampling \(N\) sub-datasets with replacement from the training dataset \(\mathcal{D}\) and training each model on a separate dataset, in practice random initialization of the parameters coupled with stochastic gradient descent makes the models sufficiently independent even if they are trained on the same dataset.

The objective function for bootstrap ensembles is given as:

$$ J(\mathbf{a}_1,\ldots,\mathbf{a}_T) = \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T r(\mathbf{s}_{t,i},\mathbf{a}_t) \text{ where } \mathbf{s}_{t+1,i} = f_{\theta_i}(\mathbf{s}_{t,i},\mathbf{a}_t) $$