1. Review

In the previous set of lecture notes we presented the Q-iteration algorithm. The online version is given as:

  1. Take some action \(\mathbf{a}\) and observe \((\mathbf{s},\mathbf{a},\mathbf{s}',r)\).
  2. Generate the label \(y = r(\mathbf{s},\mathbf{a})+\gamma\max_a Q_\phi(\mathbf{s}',\mathbf{a})\).
  3. Choose \(\phi = \text{argmin}_{\phi}\frac{1}{2}\vert\vert Q_\phi(\mathbf{s},\mathbf{a})-y\vert\vert^2\).

The third step updates \(\phi\) using:

$$ \phi := \phi - \alpha\frac{dQ_\phi(\mathbf{s},\mathbf{a})}{d\phi} \left( Q_\phi(\mathbf{s}, \mathbf{a})-y\right) $$

We previously showed that there are no convergence guarantees for Q-iteration. However, in this set of lecture notes we discuss some modifications that mitigate some of the problems with Q-iteration and in doing so allow it to work well in practice.

2. Correlated Samples in Online Q-Learning

The first problem with Q-iteration is the way we collect data. Each state is sampled using the state and action at the previous time step. Therefore, sequential states are highly correlated. Such a (temporal) correlation in the dataset hinders gradient-descent based optimizers from converging.

One intuitive reason for this is that temporal correlation causes optimizers to locally fit different regions of the state space. In other words, the optimizer begins by locally converging in some region of state space, then moves onto some other region and converges there and in doing so forgets about the previous region of state space it was in.

One solution to this problem is to use parallel “workers”. Each worker collects data individually, which is then used to update \(\phi\). This can be done either synchronously or asynchronously. In the synchronous version, the data from each worker is collected and a (single) gradient step on \(\phi\) is taken on the collected data. In the asynchronous version, as soon as a worker collects a data point, \(\phi\) is updated without waiting for other workers to finish. We will discuss these two methods in more detail in a later set of lecture notes.

Another solution to this correlation problem is to use replay buffers. Here, the data is added to some buffer. Batches are then sampled from this buffer (denoted with \(\beta\)) and used to update \(\phi\):

  1. Collect some dataset \(\{(\mathbf{s}^{(i)},\mathbf{a}^{(i)},\mathbf{s}'^{(i)},r^{(i)})\}\) using some policy and add it to \(\beta\).
  2. Sample a batch \(\{(\mathbf{s}^{(i)},\mathbf{a}^{(i)},\mathbf{s}'^{(i)},r^{(i)})\}\) from \(\beta\).
  3. Update \(\phi := \phi - \alpha\sum_i \frac{dQ_\phi(\mathbf{s}^{(i)},\mathbf{a}^{(i)})}{d\phi}(Q_\phi(\mathbf{s}^{(i)},\mathbf{a}^{(i)})-[r(\mathbf{s}^{(i)},\mathbf{a}^{(i)})+\gamma\max_{\mathbf{a}} Q_\phi(\mathbf{s}'^{(i)},\mathbf{a})])\).

One way to run this algorithm is to repeat steps two and three \(K\) times during each iteration. The buffer usually is of finite capacity. Hence, some sort of policy is often required to evict old data points in order to make space for new points.

3. Q-Iteration Is Not Gradient Descent

As we discussed in the previous set of lecture notes Q-iteration is not gradient really descent on the objective function \(\text{argmin}_\phi\vert\vert Q_\phi(\mathbf{s},\mathbf{a})-y\vert\vert\). This because even though \(y\) depends upon \(\phi\) we do not take derivatives with respect to it. One way to address this problem is to use some previous estimate of \(\phi\), which we denote as \(\phi'\), to generate \(y\). This way \(y\) won’t be dependent on the current estimate of \(\phi\) and hence we would be able to perform actual gradient descent on \(\text{argmin}_\phi\vert\vert Q_\phi(\mathbf{s},\mathbf{a})-y\vert\vert\). The Q-learning algorithm with target networks is given as:

  1. Save target network parameters: \(\phi' \leftarrow \phi\).
    1. Collect dataset \(\{(\mathbf{s}^{(i)},\mathbf{a}^{(i)},\mathbf{s}'^{(i)},r^{(i)})\}\) using some policy and add it to \(\beta\)
      1. Sample a batch \(\{(\mathbf{s}^{(i)},\mathbf{a}^{(i)},\mathbf{s}'^{(i)},r^{(i)})\}\) from \(\beta\).
      2. Update \(\phi := \phi - \alpha\sum_i \frac{dQ_\phi(\mathbf{s}^{(i)},\mathbf{a}^{(i)})}{d\phi}(Q_\phi(\mathbf{s}^{(i)},\mathbf{a}^{(i)})-[r(\mathbf{s}^{(i)},\mathbf{a}^{(i)})+\gamma \max_{\mathbf{a}} Q_{\phi'}(\mathbf{s}'^{(i)},\mathbf{a})])\).

The innermost loop is run \(K\) times while the outermost inner loop is run for \(N\) times. The entire algorithm is repeated until convergence. The “classic” deep Q-learning algorithm sets \(K=1\). Interchanging the top two steps gives the fitted Q-iteration algorithm.

Note that \(\phi'\) lags behind \(\phi\) by a certain number of iterations. However, this lag will be different for different iterations in the inner loops. One way to have the same lag between iterations, which is known to help in practice, is to use some sort of averaging technique each time \(\phi\) is updated. One popular alternative is Polyak averaging:

$$ \phi' = \tau\phi' + (1-\tau)\phi $$

In practice setting \(\tau=0.999\) works very well.

4. Overestimation in Q-Learning

During training, \(Q_\phi\) tends to overestimate the actual Q-values. This is because \(Q_\phi\) is not a perfect estimate of the actual Q-values. Each estimate by \(Q_\phi\) contains some error. Therefore, when we take the maximum over these Q-value estimates in:

$$ y = r(\mathbf{s},\mathbf{a}) + \gamma\max_{\mathbf{a}'}Q_{\phi'}(\mathbf{s}', \mathbf{a}') $$

we exaggerate the positive noise and undervalue the negative noise.

Double Q-learning mitigates this noise by learning two Q-value function approximators instead of one. Note that we may rewrite our targets as:

$$ y = r(\mathbf{s},\mathbf{a}) + \gamma Q_{\phi'}(\mathbf{s}', \text{argmax}_{\mathbf{a}'} Q_{\phi'}(\mathbf{s}', \mathbf{a}')) $$

We may now use two separate function approximators:

$$ \begin{align} Q_{\phi_A} &\leftarrow r(\mathbf{s},\mathbf{a}) + \gamma Q_{\phi_B}(\mathbf{s}',\text{argmax}_{\mathbf{a}'} Q_{\phi_A}(\mathbf{s}', \mathbf{a}'))\\ Q_{\phi_B} &\leftarrow r(\mathbf{s},\mathbf{a}) + \gamma Q_{\phi_A}(\mathbf{s}',\text{argmax}_{\mathbf{a}'} Q_{\phi_B}(\mathbf{s}', \mathbf{a}')) \end{align} $$

Clearly, as long as the two function approximators are noisy in different ways (i.e their noises are not correlated), the problem of overestimation goes away.

However, in order to use this solution we would need learn two Q-functions. One way to resolve this issue is to note that with target networks (discussed in the previous section) we already are keeping track of two Q-functions, the current and target Q-functions, which we denoted with \(Q_\phi\) and \(Q_{\phi'}\). Therefore, we can simply use our current Q-function estimate to choose an action:

$$ y = r(\mathbf{s},\mathbf{a}) +\gamma Q_{\phi'}(\mathbf{s}', \text{argmax}_{\mathbf{a}'} Q_\phi(\mathbf{s}', \mathbf{a}')) $$

5. Multi-Step Returns

The Q-learning targets are given by:

$$ y = r(\mathbf{s},\mathbf{a}) + \gamma\max_{\mathbf{a}'}Q_{\phi'}(\mathbf{s}', \mathbf{a}') $$

As \(Q_\phi\) is (usually) initialized randomly, its estimates in the beginning are not very good. Therefore, the only values that matter in this case are the current rewards \(r(\mathbf{s},\mathbf{a})\). However, once \(Q_\phi\) gets better, its estimates too get important.

The way these target values are computed are similar to the bootstrapped update for the value function, discussed in a previous set of lecture notes. While the actor-critic algorithm reduced variance by using low variance estimates of the value function instead of the high variance sum-of-rewards, it did however introduce bias. Just as in the actor-critic algorithm, if the function approximator (for the Q-values) in Q-learning is not perfect, we incur a high bias in the targets. One way to reduce this bias is to use multi-step returns:

$$ y_t = \sum_{t'=t}^{t+N-1} r(\mathbf{s}_{t'},\mathbf{a}_{t'}) + \gamma^N \max_{\mathbf{a}'} Q_{\phi'} (\mathbf{s}_{t+N},\mathbf{a}') $$

Multi-step returns results in a less-biased target when \(Q_\phi\) is not very accurate. This also leads to faster learning in the beginning (when \(Q_\phi\) is indeed not very good).

However, multi-step returns requires the data to be on-policy. Recall that our policy \(\pi\) is given by:

$$ \pi(\mathbf{a}\vert\mathbf{s}) = \begin{cases} 1 & \text{if}\; \mathbf{a}=\text{argmax}_\mathbf{a}Q_\phi(\mathbf{s},\mathbf{a})\\ 0 & \text{otherwise} \end{cases} $$

Note that the first term just sums up rewards obtained by running some policy. As \(y_t\) estimates the reward obtained by taking some action \(\mathbf{a}_t\) in state \(\mathbf{s}_t\) and then following the policy \(\pi\), for it to be accurate the samples too must come from the policy \(\pi\). Obviously for \(N=1\) this was not a problem because \(y\) is a function of both the current state and action. This, however, is a problem for \(N>1\) because our collection policy is usually different than (though generally not independent of) \(\pi\). There are several solutions to this problem:

  1. Simply ignore the problem: This often works well in practice for small \(N\).
  2. Cut the trace: This dynamically chooses \(N\). It begins by summing up the rewards for the initial time steps, but stops as soon as it sees an action that is different to what \(\pi\) would have taken. \(N\) is therefore the time step up till which the rewards are summed.
  3. Importance sampling.

6. Q-Learning with Continuous Actions

Up till now we have assumed our action space to be discrete. Let us now assume that it is continuous. For Q-learning during the training process (when generating the labels) and when running our policy (in real-time) we need to take the find action which has the largest Q-value. Clearly, while this is easy for (small) discrete action spaces (where we can evaluate the Q-function for each action), it is harder in the continuous case. In this section, we discuss some ways in which we can approximate this \(\max\) function:

A. Optimization

We may use gradient based optimization techniques (such as stochastic gradient descent) to find the maximum value for the action. However, performing this in the innermost loops will cause the algorithm to run considerably slow. However, as our action space is typically low dimensional we may use some stochastic optimization technique. A simple solution, in this regard, will be to just use the following approximation:

$$ \max_\mathbf{a}Q_\phi(\mathbf{s},\mathbf{a}) = \max_\mathbf{a}\left[Q_\phi(\mathbf{s}, \mathbf{a}_1), \ldots, Q_\phi(\mathbf{s},\mathbf{a}_N) \right] $$

where \((\mathbf{a}_1,\dots,\mathbf{a}_N)\) are sampled from some distribution (for e.g. a uniform distribution). However, while this technique is very simple to implement and easily parallelizable, it is not very accurate. But we already know that \(Q_\phi\) is not very accurate. So, do we really need to evaluate this \(\max\) very accurately?

More accurate solutions include the cross-entropy method and the Covariance Matrix Adaptation Evolution Strategy (CMA-ES).

B. Easily Maximizable Q-Functions

Another solution to this problem is to just use Q-functions that are easy to optimize in \(\mathbf{a}\) (i.e. have an analytical solution in \(\mathbf{a}\)). Note that we impose no such restriction with regards to \(\mathbf{s}\). One example is to use a Q-function that is quadratic in the actions:

$$ Q_\phi(\mathbf{s},\mathbf{a}) = -\frac{1}{2}(\mathbf{a}-\mu_\phi(\mathbf{s}))^T P_\phi(\mathbf{s})(\mathbf{a}-\mu_\phi(\mathbf{s})) + V_\phi(\mathbf{s}) $$

where \(\mu_\phi(\mathbf{s})\) and \(V_\phi\mathbf{s}\) are vectors and \(P_\phi(\mathbf{s})\) is a matrix. All three of these are outputs of a neural network with parameters \(\phi\) and input \(\mathbf{s}\). Note that:

$$ \begin{align} \text{argmax}_\mathbf{a} Q_\phi(\mathbf{s},\mathbf{a}) &= \mu_\theta(\mathbf{s})\\ \max_\mathbf{a} Q_\phi(\mathbf{s},\mathbf{a}) &= V(\mathbf{s}) \end{align} $$

Such functions are called as Normalized Advantage Functions (NAF). Note that while this solution requires virtually no change to the Q-iteration algorithm and is just as efficient as the discrete-case, it does lead to some loss in representational power because not all Q-functions can be expressed as functions that are easily maximizable in \(\mathbf{a}\).

C. Learning a Approximate Maximizer

We may just train a neural net \(\mu_\theta\) such that:

$$ \mu_\theta(\mathbf{s}) \approx \text{argmax}_\mathbf{a}Q_\phi(\mathbf{s},\mathbf{a}) $$

To do so, we only need to solve:

$$ \theta \leftarrow \text{argmax}_\theta Q_\phi(\mathbf{s},\mu_\theta(\mathbf{s})) $$

i.e. we need to optimize for \(\theta\). This can be done easily by noting that:

$$ \frac{dQ_\phi(\mathbf{s},\mathbf{a})}{d\theta} = \frac{dQ_\phi(\mathbf{s},\mathbf{a})}{d\mathbf{a}}\frac{d\mathbf{a}}{d\theta} $$

So our new target will be given by:

$$ y = r(\mathbf{s},\mathbf{a}) + \gamma Q_\phi(\mathbf{s}',\mu_\theta(\mathbf{s}')) $$

The so-called DDPG algorithm learns such a maximizer:

  1. Take some action \(\mathbf{a}\), observe \((\mathbf{s},\mathbf{a},\mathbf{s}',r)\) and add it to \(\beta\)
  2. Sample a mini-batch \(\{(\mathbf{s}^{(i)},\mathbf{a}^{(i)},\mathbf{s}'^{(i)},r^{(i)})\}\)from \(\beta\).
  3. Generate targets \(y^{(i)}=r^{(i)}+Q_{\phi'}(\mathbf{s}'^{(i)},\mu_{\theta'}(\mathbf{s}'^{(i)}))\).
  4. Update \(\phi \leftarrow \phi - \alpha\sum_i \frac{dQ_\phi(\mathbf{s}^{(i)},\mathbf{a}^{(i)})}{d\phi}(Q_\phi(\mathbf{s}^{(i)},\mathbf{a}^{(i)})-y^{(i)})\).
  5. Update \(\theta \leftarrow \theta + \beta \sum_i \frac{dQ_\phi(\mathbf{s}^{(i)},\mathbf{a})}{d\mathbf{a}}\frac{d\mathbf{a}}{d\theta}\).
  6. Update \(\phi'\) and \(\theta'\) (e.g. use Polyak averaging).

Note that \(\theta'\) is calculated using our target network \(Q_{\phi'}\).

7. Practical Tips for Q-Learning

  1. Q-learning takes some care to stabilize.
  2. Large replay buffers help improve stability.
  3. Start with high exploration (e.g. with a large value of \(\epsilon\) in epsilon-greedy) and gradually reduce.
  4. Start with a high learning rate and gradually reduce.
  5. Clip Bellman error gradients as they can get quite big.
  6. Double Q-learning helps a lot in practice.
  7. Run multiple seeds as Q-learning is very inconsistent between runs.