Support Vector Machines
Suppose that we have a training set \(\{(x^{(i)},y^{(i)}); i=1,...,m\}\), where \(y^{(i)} \in \{-1,+1\}\). Let us assume that the \(x^{(i)}\)’s are linearly separable, i.e. \(\exists\) some hyperplane \(w^Tx+b=0\) for appropriate choices of \(w\) and \(b\) such that \(% <![CDATA[ w^Tx^{(i)}+b < 0 %]]>\) for \(y^{(i)}=-1\) and \(w^Tx^{(i)}+b>0\) for \(y^{(i)}=+1\). Our goal is to find this hyperplane.
1. Functional and Geometric Margins
We shall define the functional margin of this training set as follows:
Note that:
-
A large functional margin indicates more confidence in a certain prediction.
-
A positive functional margin indicates a correct prediction on a particular example.
-
The functional margin of an entire training set is given by:
$$ \hat{\gamma} = \min_i \hat{\gamma}^{(i)} $$ -
The functional margin can be made arbitrarily large by multiplying it with a scalar. However, note that this does not change the predictions on the training examples (the predictions only depend on the sign of \(w^Tx^{(i)}+b\)).
To solve the problem discussed in last point above we shall introduce the concept of the geometric margin.
Suppose that we have a (hyper)plane in 3d space. Let \(P_0=(x_0, y_0, z_0)\) be a known point on this plane. Therefore, the vector from the origin \((0,0,0)\) to this point is just \(% <![CDATA[ <x_0,y_0,z_0> %]]>\). Suppose that we have an arbitrary point \(P=(x,y,z)\) on the plane. The vector joining \(P\) and \(P_0\) is then given by:
Note that this vector lies in the plane. Now let \(\hat{n}\) be the normal (orthogonal) vector to the plane. Therefore:
Or:
Note that \(-\hat{n} \bullet \vec{P_0}\) is just a number. Therefore, for the hyperplane \(w^Tx+b\), \(w\) represents the normal vector \(\hat{n}\) while \(b\) is \(-\hat{n}\bullet \vec{P_0}\) and the point \(x\) is \(P\).
Suppose that we draw a vector perpendicular to the hyperplane to a point \(x^{(i)}\) in the training set. The magnitude \(\gamma^{(i)}\) of this vector is known as the geometric margin. Note that this vector is in the direction of the unit vector \(y^{(i)}\frac{w}{\vert w \vert}\). The point at which this vector cuts the hyperplane is thus given by \(x^{(i)}-y^{(i)}\gamma^{(i)}\frac{w}{\vert w \vert}\). Thus:
Hence, the geometric margin \(\gamma^{(i)}\) is related to the functional margin \(\hat{\gamma}^{(i)}\) by the following:
Note that the geometric margin is invariant to rescaling of the parameters.
We define the geometric margin of the entire training set to be:
2. The Optimal Margin Classifier
Our goal is to maximize the geometric margin. In that respect, consider the following optimization problem:
Note that the condition \(\vert\vert w \vert\vert = 1\) ensures that the functional margin is equal to the geometric margin. We may rewrite the above problem as follows:
Note that we can make \(\hat{\gamma}\) equal to any arbitrary value by simply rescaling the parameters (without changing anything meaningful). Let us choose \(\hat{\gamma}=1\). We may then pose the following optimization problem:
Note that we have essentially transformed a non-convex problem into a convex one. We shall now introduce the Lagrange duality that will help us solve this problem.
3. Lagrange Duality
Suppose that we have the following optimization problem:
We may solve this problem by defining the Lagrangian:
and solving for:
For problems of the following form:
we may define the generalized Lagrangian:
Consider the following:
Note that if the constraints \(\alpha_i\geqslant 0\), \(g_i(w) \leqslant 0\) and \(h_i(w) = 0\) are satisfied then \(\theta_p\) is just equal to \(f(w)\). Otherwise, it equals \(+\infty\).
Let us define the primal problem to be \(p^*=\min_w\theta_P(w)\). Note that this is the same problem we were trying to solve initially (i.e. \(\min_w f(w)\) subject to some constraints).
Alternatively, consider the following problem called as the dual problem:
It follows from the fact that the \(\min\max\) of a function is always greater than or equal to the \(\max\min\) that:
It can be shown that if the \(h_i\)’s are affine, the \(\alpha_i\)’s and \(g_i\)’s convex and the constraints (strictly) feasible i.e. \(% <![CDATA[ g_i(w) <0 %]]>\) \(\forall\) \(i\) then there must exist \(w^*\), \(\alpha^*\), \(\beta^*\) such that \(w^*\) is the solution to the primal problem and \(\alpha^*\), \(\beta^*\) are the solutions to the dual problem and that:
Moreover the following Karush-Kuhn-Tucker (KKT) conditions are also satisfied:
The third equation above is known as the KKT dual complementarity equation, and essentially constrains at least one of \(\alpha_i\) and \(g_i\) to be zero.
4. Optimal Margin Classifiers
Let us return to the following optimization problem:
Note that the constraint \(g(x) = -y^{(i)}(w^Tx^{(i)}+b) + 1\) is only \(0\) when the functional margin of \(x^{(i)}\) is \(1\). Hence, because of the KKT dual complementarity equation \(\alpha_i > 0\) only for points that have a functional margin equal to \(1\). These are known as the support vectors. Note that these points must be the ones that are closest to the hyperplane.
We may formulate a Lagrangian for the above problem:
Therefore:
And:
Substituting \(w\) into the \(\mathcal{L}(\gamma,w,b)\) gives:
Note that in the last step we used the fact that \(\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_iy^{(i)} = 0\). Hence, we may state the following dual optimization problem:
Once we have solved for \(\alpha\) we can find \(w\) using the equation we derived earlier:
To solve for \(b\) consider two points \(x_1\) and \(x_2\) whose functional margin is equal to \(1\) such that \(y_1 = 1\) and \(y_2=-1\). Therefore \(w^Tx_1 + b=1\) and \(w^Tx_1 + b=-1\). Hence:
Note that in the last step we used the fact that all points have a functional margin of at least \(1\), i.e. \(1\) is the minimum functional margin a point can have.
Also, note that the equation (that we will use to make a prediction on an input \(x\)):
only depends on the inner product between the training examples \(x^{(i)}\) and the input \(x\). Moreover most of the \(\alpha_i\)’s, as discussed previously, will be \(0\).
5. Kernels
Suppose that we map the input vector \(x\) to a higher (possibly infinite) dimensional vector space and input the resulting vector \(\phi(x)\) to an SVM. Consequently we would replace \(x\) in the SVM formulation with \(\phi(x)\). Let us define \(K(x,z)\)\(=\)\(\phi(x)^T\phi(z)\), where \(K(x,z)\) is called as the kernel. Note that this implies that inner product \(\phi(x)^T\phi(z)\) can be calculated using only \(x\) and \(z\), i.e. without explicitly finding \(\phi(x)\) or \(\phi(z)\).
Let us define the Kernel matrix \(K\) such that \(K_{ij} = K(x^{(i)},x^{(j)})\). Then for \(K\) to be a valid kernel \(K_{ij}=K_{ji}\) because \(\phi(x^{(i)})\phi(x^{(j)})\) \(=\) \(\phi(x^{(j)})\phi(x^{(i)})\). Also, suppose that \(z\) is any vector. Then:
which shows that \(K\) is positive semidefinite. It turns out that for \(K\) to be a valid (Mercer) kernel it is sufficient that it is symmetric and positive semidefinite.
6. Regularization and the Non-Separable Case
Consider the following problem:
Note that the regularization term allows, but penalizes, points to have a functional margin of less than \(1\). The Lagrangian is thus given by:
Therefore:
And:
Also:
Note that the last equation implies that \(0\leqslant \alpha_i\leqslant C\) because \(r_i\geqslant 0\).
Substituting \(w\) in \(\mathcal{L}(w, b, \xi, \alpha, r)\) and simplifying (see Section 4 above) gives:
We formulate the dual problem as follows:
Also, the KKT dual complementarity equations are now given as:
7. Coordinate Ascent
Consider the following optimization problem:
The coordinate ascent algorithm does the following:
Loop until convergence
{
\(\;\;\)For \(i =1,...,m\) do:
\(\;\;\;\;\)\(\max_{\hat{\alpha}_i} W(\alpha_1,\alpha_2,...,\hat{\alpha}_i,...,\alpha_m)\)
}
i.e. at each iteration, \(W(\alpha_1,\alpha_2,...,\alpha_n)\) is optimized with respect to only one parameter while all others are held constant.
8. Sequential Minimal Optimization
We have the following problem:
Note that we cannot use the coordinate ascent algorithm, as presented above, here because of the second constraint which essentially states that any one \(\alpha_i\) is exactly determined by the other \(\alpha_i\)’s (for \(i=1,...,m\)). We, thus, modify the coordinate ascent algorithm as follows:
Loop until convergence
{
\(\;\;\)For some \(i,j=1,...,m\) do:
\(\;\;\;\;\)\(\max_{\hat{\alpha}_i\hat{\alpha}_j} W(\alpha_1,...,\hat{\alpha}_i,...,\hat{\alpha}_j,...,\alpha_m)\)
}
i.e. we optimize \(W(\alpha_1,...,\alpha_m)\) with respect to two parameters at each iteration. Therefore, at each iteration we have:
where \(\zeta\) is a constant. Note that at each iteration all \(\alpha_k\)’s where \(k=1,...,m\) and \(k \neq i,j\) are constant. Note also that \(\alpha_i\) and \(\alpha_j\) must be between \(0\) and \(C\). Suppose that \(L\) and \(H\) are the lower and upper bounds on \(\alpha_j\) respectively. We may rewrite \(W(\alpha_1,...,\hat{\alpha}_i,...,\hat{\alpha}_j,...,\alpha_m)\) as \(W(\alpha_1,...,y^{(i)}\)\((\zeta-y^{(j)}\alpha_j),...,\hat{\alpha}_j,...,\alpha_m)\). Note that this is just a quadratic equation in \(\hat{\alpha_j}\). Setting its derivative to zero will yield the optimal value for \(\alpha_j\) which we shall represent by \(\alpha_j^{unclipped}\). As \(\alpha_j\) is constrained to be between \(L\) and \(H\), we shall clip its value as follows:
The convergence of this algorithm can be tested using the KKT dual complementarity conditions.