In this post we shall talk about (direct) policy search algorithms.
Let \(\Pi\) be a set of policies. Our goal it to search for the best policy \(\pi \in \Pi\). Note that this is analogous to the supervised learning setting where we define a set of hypothesis \(\mathcal{H}\) and searched for the best hypothesis \(\mathcal{h} \in \mathcal{H}\) (see this).
Let us define a stochastic policy as a function \(\Pi : \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R}\) such that \(\pi(s,a)\) gives the probability of taking an action \(a\) in state \(s\). Consequently, note that \(\sum_{a\in\mathcal{A}} \pi(s,a)=1\) and \(\pi(s,a)>=0 \ \forall a \in \mathcal{A}\). As an example suppose that we have a two action MDP. We may then define:
$$ \begin{eqnarray} \pi_{\theta}(s,a_1) &=& \frac{1}{1+\exp(-\theta^Ts)}\\ \pi_{\theta}(s,a_2) &=& 1-\frac{1}{1+\exp(-\theta^Ts)} \end{eqnarray} $$
where \(\theta\) is some parameter to be learned. Or, if we have more than two possible actions, we may choose:
$$ \pi_\theta(s,a_i) = \frac{\exp(\theta_i^Ts)}{\sum_j \exp(\theta_j^Ts)} $$
Our goal is to maximize the expected payoff, i.e. find:
$$ \max_{\theta} \left[\mathbb{E}[R(s_0,a_0)+...+R(s_T,a_T)]\right] $$
We shall refer to the term \(R(s_0,a_0)+...+R(s_T,a_T)\) as payoff. Note that:
$$ \begin{eqnarray} \max_\theta \left[\mathbb{E}[\text{payoff}]\right] &=& \max_\theta \left[ \sum_{(s_0a_0),...,(s_Ta_T)} P\left((s_0,a_0),...,(s_T,a_T)\right)[R(s_0,a_0)+...+R(s_T,a_T)]\right]\\ &=& \max_\theta \left[\sum_{(s_0a_0),...,(s_Ta_T)}P(s_0)\pi_\theta(s_0,a_0)P_{s_0a_0}(s_1)\pi_\theta(s_1,a_1)...P_{s_{T-1}a_{T-1}}(s_T)\pi_\theta(s_T,a_T) [R(s_0,a_0)+...+R(s_T,a_T)]\right] \end{eqnarray} $$
where the first line follows from the definition of expectation i.e. \(\mathbb{E}[A] = \sum_{a\in A}P(a)*a\) (note that \(P(R(s_0,a_0)+...+R(s_T,a_T))\) is just equal to the probability of the events \((s_0,a_0)\) through \((s_T,a_T)\) happening). So, the summation is just over all the possible sequences \((s_0,a_0),...,(s_T,a_T)\).
One possible algorithm to find the best policy is as follows:
Loop
{
\(\ \ \ \ \\) 1. Sample \((s_0,a_0),...,(s_T,a_T)\)
\(\ \ \ \ \\) 2. Compute \(\text{payoff} = R(s_0,a_0)+...+R(s_T,a_T)\)
\(\ \ \ \ \\) 3. Update:
$$ \theta := \theta + \alpha\left[\frac{\nabla_\theta \pi_\theta(s_0,a_0)}{\pi_\theta(s_0,a_0)} + ... + \frac{\nabla_\theta \pi_\theta(s_T,a_T)}{\pi_\theta(s_T,a_T)} \right] \times \text{payoff} $$
}
The update step is just stochastic gradient descent on \(\theta\) because:
$$ \begin{eqnarray} \nabla_\theta\mathbb{E}[\text{payoff}] &=& \sum_{(s_0a_0),...,(s_Ta_T)}P(s_0)\nabla_\theta(\pi_\theta(s_0,a_0))P_{s_0a_0}(s_1)\pi_\theta(s_1,a_1)...P_{s_{T-1}a_{T-1}}(s_T)\pi_\theta(s_T,a_T) [R(s_0,a_0)+...+R(s_T,a_T)]\\ && +P(s_0)\pi_\theta(s_0,a_0)P_{s_0a_0}(s_1)\nabla_\theta(\pi_\theta(s_1,a_1))...P_{s_{T-1}a_{T-1}}(s_T)\pi_\theta(s_T,a_T) [R(s_0,a_0)+...+R(s_T,a_T)]\\ && + \ .\ .\ . \ . \ . \ . \ .\ .\ .\\ && + P(s_0)\pi_\theta(s_0,a_0)P_{s_0a_0}(s_1)\pi_\theta(s_1,a_1)...P_{s_{T-1}a_{T-1}}(s_T)\nabla_\theta(\pi_\theta(s_T,a_T)) [R(s_0,a_0)+...+R(s_T,a_T)]\\ &=& \sum_{(s_0a_0),...,(s_Ta_T)} P(s_0)\pi_\theta(s_0,a_0)...P_{s_{T-1}a_{T-1}}(s_T)\pi_\theta(s_T,a_T)\left[\frac{\nabla_\theta \pi_\theta(s_0,a_0)}{\pi_\theta(s_0,a_0)} + ... + \frac{\nabla_\theta \pi_\theta(s_T,a_T)}{\pi_\theta(s_T,a_T)}\right][R(s_0,a_0)+...+R(s_T,a_T)]\\ &=& \sum_{(s_0a_0),...,(s_Ta_T)} P\left((s_0,a_0),...,(s_T,a_T)\right)\left[\frac{\nabla_\theta \pi_\theta(s_0,a_0)}{\pi_\theta(s_0,a_0)} + ... + \frac{\nabla_\theta \pi_\theta(s_T,a_T)}{\pi_\theta(s_T,a_T)}\right][R(s_0,a_0)+...+R(s_T,a_T)]\\ &=& \mathbb{E}\left[\left(\frac{\nabla_\theta \pi_\theta(s_0,a_0)}{\pi_\theta(s_0,a_0)} + ... + \frac{\nabla_\theta \pi_\theta(s_T,a_T)}{\pi_\theta(s_T,a_T)}\right)[R(s_0,a_0)+...+R(s_T,a_T)]\right]\\ &=& \mathbb{E}\left[\left(\frac{\nabla_\theta \pi_\theta(s_0,a_0)}{\pi_\theta(s_0,a_0)} + ... + \frac{\nabla_\theta \pi_\theta(s_T,a_T)}{\pi_\theta(s_T,a_T)}\right)\times \text{payoff}\right] \end{eqnarray} $$
where the first step follows from the product rule in differentiation and the rest from simple algebraic manipulation.