Lecture 16
1. Formalism
Given
- a set of states \(\mathbf{s} \in \mathcal{S}\) and actions \(\mathbf{a} \in \mathcal{A}\),
- (sometimes) the transition probabilities \(p(\mathbf{s}'\vert\mathbf{s},\mathbf{a})\),
- a set of trajectories \(\{\tau_i\}\) that were sampled from an (approximately) optimal policy \(\pi^*(\tau)\)
the goal of inverse reinforcement learning (IRL) is to learn the reward function \(r_\psi(\mathbf{s},\mathbf{a})\) which can then be used to recover the optimal policy \(\pi^*(\mathbf{a}\vert\mathbf{s})\).
2. Classical Approaches: Feature Matching and Maximum Margin
Most classical approaches assumed a linear reward function:
for some known features \(f\). Note that \(\Psi = [\psi_1,\ldots,\psi_F]\) and \(\mathbb{f}(\mathbf{s},\mathbf{a}) = [f_i(\mathbf{s},\mathbf{a}),\ldots,f_F(\mathbf{s},\mathbf{a})]\). Let \(\pi^{r_\psi}\) denote the optimal policy for the reward function \(r_\psi\). The feature matching approach tries to pick \(\psi\) such that:
i.e. it tries to match the expected value of the features under the optimal and the recovered policies. Note that the expectation on the RHS can be approximated with the optimal trajectories \(\{\tau_i\}\).
Alternatively, the maximum margin principle tries to choose a reward function such that the reward of the optimal policy is greater than the reward of any other policy by at least a margin of \(m\):
where \(\Pi\) is the set of all policies. We can rewrite this problem in a more tractable form in the same way as we do for support vector machines. Furthermore, note that we should also somehow take into account the fact that policies close to the optimal policy should only have a slightly less reward than it compared to other policies. The final objective function after taking these into account becomes:
where \(D(\pi,\pi^*)\) is the difference between the feature expectations under \(\pi\) and \(\pi^*\).
3. The MaxEnt IRL Algorithm
Assume that our reward function is parameterized by \(\psi\). In the probabilistic graphical model discussed in the last lecture notes, we assumed that:
where we also include \(\psi\) to emphasize that \(p(\mathcal{O}_t\vert\mathbf{s}_t,\mathbf{a}_t)\) depends on it. We also showed that:
If we know a trajectory that an expert chose (which means that that trajectory should have a very high reward), then one way of choosing \(\psi\) is by maximizing the likelihood of this trajectory (if our reward is correct then this trajectory should be very likely):
The third last step uses the fact that we assume \(p(\mathcal{O}_t\vert\mathbf{s}_t,\mathbf{a}_t)\) to be proportional to \(\exp(r(\mathbf{s}_t,\mathbf{a}_t))\) (the constants of proportionality in the numerator and denominator cancel each other out). In the last step, we ignore \(p(\tau)\) since it does not depend on \(\psi\). We shall denote \(\int p(\tau)\exp\left( r_\psi(\tau)\right)d\tau\) with \(Z\). \(Z\) is often called as the IRL partition function. We can take the \(\text{argmax}\) over over multiple trajectories too. Our likelihood function will, thus, be given by (note that the second term is an integral over all possible trajectories and is thus constant):
We need to maximize \(\mathcal{L}\) with respect to \(\psi\):
Let us rewrite this as:
Therefore, in essence, we are approximating the first term with samples drawn from the expert’s policy \(\pi^*\). The second term is the expectation under the optimal policy of our current reward function. To estimate this second term note that:
From the previous lecture notes we have:
Therefore:
We can thus use the forward and backward messages to compute \(p(\mathbf{s}_t,\mathbf{a}_t\vert\mathcal{O}_{1:T};\psi)\). In the case of tabular environments (where there are a finite number of states and actions), the two integrals over \(\mathbf{s}_t\) and \(\mathbf{a}_t\) can be computed exactly.
The MaxEnt algorithm repeats the following until convergence:
- Given \(\psi\), compute backward messages \(\beta_t(\mathbf{s}_t,\mathbf{a}_t)\).
- Given \(\psi\), compute forward messages \(\alpha_t(\mathbf{s}_t)\).
- Compute \(p(\mathbf{s}_t,\mathbf{a}_t\vert\mathcal{O}_{1:T};\psi) \propto \beta_t(\mathbf{s}_t,\mathbf{a}_t)\alpha_t(\mathbf{s}_t)\).
- Evaluate \(\nabla_\psi \mathcal{L}\).
- Update \(\psi \leftarrow \psi + \eta\nabla_\psi\mathcal{L}\).
The reason why this is called as MaxEnt is that it can be shown that this algorithm optimizes:
in the case of linear rewards \(r_\psi = \Psi^T\mathbb{f}\). Intuitively, this algorithm tries to match the features \(\mathbb{f}\) under the expert and the recovered policies while assuming as little as possible about the expert policy (as it also maximizes the entropy of the recovered policy).
4. Unknown Dynamics and Large State/Action Spaces
The problem with the MaxEnt algorithm is that to estimate the second expectation in \(\nabla_\psi \mathcal{L}\) it:
- requires a tabular environment, and
- assumes that we know the state-transition dynamics to calculate the forward and backward messages.
We thus require a better way to approximate this second expectation. One way to do so is to simply learn \(p(\mathbf{a}_t\vert\mathbf{s}_t,\mathcal{O}_{1:T};\psi)\) via any maximum-entropy RL algorithm. Recall that maximum-entropy RL algorithms maximize the following objective function:
We can then simply approximate \(\nabla_\psi \mathcal{L}\) by sampling trajectories from our learned distribution \(p(\mathbf{a}_t\vert\mathbf{s}_t,\mathcal{O}_{1:T};\psi)\):
One issue with this is that each time \(\psi\) is updated we need relearn our policy \(p(\mathbf{a}_t\vert\mathbf{s}_t,\mathcal{O}_{1:T};\psi)\). This can be computationally very expensive. One solution to this is to use lazy policy optimization i.e. we only relearn our policy say, after \(K\) time steps. However, this would mean that the policy that we use to sample trajectories is different from the policy under which we want to take our expectation. To account for this difference we can use importance sampling. Let \(\pi\) denote the policy from which we sample our trajectories. As per our assumption, the policy under which we want to take the expectation is proportional to \(\exp(r_\psi(\mathbf{s}_t,\mathbf{a}_t))\). The weights \(w\) for a trajectory \(\tau\) for importance sampling are thus:
Therefore:
where we have also normalized the \(w^{(j)}\).
Interestingly enough, it turns that if we want to estimate the \(\mathbb{E}_{x\sim p(x)}[f(x)]\) through an expectation under some other distribution \(q(x)\) via importance sampling, then the optimal distribution to use for \(q(x)\) is:
In our case the optimal \(\pi\) to use is therefore:
Recall that each time we update \(\pi\), we bring it closer to \(\exp(r_\psi(\tau))\) i.e. the optimal distribution.
5. Inverse RL as GANs
Reconsider the following:
where \(\hat w\) are the normalized importance sampling weights. This gradient can be interpreted as trying to make the trajectories sampled from \(\pi^*(\tau)\) more likely and the ones sampled from \(\pi_\theta\) less likely. In other words, it tries to update \(\psi\) such that trajectories sampled from \(\pi^*(\tau)\) have a higher reward and those sampled from \(\pi_\theta\) have a lower reward.
However, in the next step of the algorithm (as discussed in the previous section) we explicitly update our policy \(\pi_\theta\) using maximum entropy RL to maximize the expected reward under it. Viewed another way, we essentially update our policy so as to make the trajectories sampled from it harder to be distinguished from those sampled from \(\pi^*\).
This interpretation of inverse RL is quite similar to the idea behind Generative Adversarial Networks (GANs). Our policy \(\pi_\theta\) is thus the generator. It turns out that the optimal discriminator to use is:
It can be seen that \(D_\psi\) is close to \(1\) when a trajectory has a high probability under \(\pi^*\) and is close to \(0\) when it has a high probability under \(\pi_\theta\). This is what we desire.
Note that:
We can use the standard GAN objective function to update our \(\psi\):
It also turns out that we do not need to estimate \(Z\) but instead can directly optimize it using the same objective function above.