Netwon-Raphson Method

Suppose that we have a convex function $f(\theta)$ and we wish to find $\theta^*$ such that $f(\theta^*) = 0$. Note that a function $f(\theta)$ is convex in the interval $[a,b]$ if for any two points $x_1$, $x_2$ in $[a,b]$ and any $\lambda$ where $0 \leqslant \lambda \leqslant 1$:

$$ f(\lambda x_1 + \lambda x_2) \leqslant \lambda f(x_1) + \lambda f(x_2) $$

Alternatively (or consequently), a function $f(\theta)$ is convex if a tangent line $g(\theta)$ drawn at any point $\theta$ satisfies the following :

$$ f(\theta) >= g(\theta) \ \ \ \ \ \ \ \forall \ \ \theta $$

Suppose that we choose a random point $\theta$ on $f(\theta)$, such that $\theta > \theta^*$ and draw a tangent line $g(\theta)$ at that point. Suppose also that $g(\theta') = 0$. As a consequent of convexity, we would have $\theta ' \geqslant \theta^*$. Also, note that $\theta ' \leqslant \theta$. We may then repeat this process by setting $\theta = \theta '$ and drawing a new tangent line. Eventually, $\theta$ will converge to $\theta^*$.

The equation of a straight line is $g(\theta) = g'(\theta) \theta + g(0)$. At the end of each iteration, we want to set $\theta = \theta '$ such that $g(\theta ') = 0$. Therefore, we have:

$$ \begin{eqnarray} 0 &=& g'(\theta) \theta + g(0)\\ \theta &=& \frac{-g(0)}{g'(\theta)}\\ \end{eqnarray} $$

Using the equation of a straight line and the point [at which the tangent was drawn] $(\theta, f(\theta))$ and the fact that $f'(\theta) = g(\theta)$ we have at each iteration:

$$ \theta := \theta - \frac{f(\theta)}{f'(\theta)} $$

Note that the same argument may be made if we initially chose $\theta$ such that $\theta \leq \theta^*$.

Suppose, instead, we wanted to minimize (or maximize) a function $f(\theta)$. We know that the solution to the equation $f'(\theta) = 0$ gives us the required minima (or maxima). Consequently, at each iteration we have:

$$ \theta := \theta - \frac{f'(\theta)}{f''(\theta)} $$

This is known as Newton’s law. This may be further generalized to cases where $\theta$ is a vector :

$$ \theta := \theta - H^{-1} \nabla_\theta f(\theta) $$

where $H$ is a matrix, called as the Hessian, whose entries are given by:

$$ H_{ij} = \frac{\partial^2 f(\theta)}{\partial \theta_i \partial \theta_j} $$

This generalization is known as the Newton-Raphson method. Note the following points:

While the Newton-Raphson method usually requires fewer iterations than the stochastic gradient descent algorithm to converge, in practice each iteration of the Newton-Raphson method is much more computationally expensive (because of the computation of the Hessian and its inverse).
When the Newton-Raphson method is used to maximize the logistic regression log likelihood function, the resulting method is also called Fisher scoring.