Support Vector Machines

Suppose that we have a training set $\{(x^{(i)},y^{(i)}); i=1,...,m\}$, where $y^{(i)} \in \{-1,+1\}$. Let us assume that the $x^{(i)}$’s are linearly separable, i.e. $\exists$ some hyperplane $w^Tx+b=0$ for appropriate choices of $w$ and $b$ such that $% <![CDATA[ w^Tx^{(i)}+b < 0 %]]>$ for $y^{(i)}=-1$ and $w^Tx^{(i)}+b>0$ for $y^{(i)}=+1$. Our goal is to find this hyperplane.

1. Functional and Geometric Margins

We shall define the functional margin of this training set as follows:

$$ \hat{\gamma}^{(i)} = y^{(i)}(w^Tx^{(i)}+b) $$

Note that:

A large functional margin indicates more confidence in a certain prediction.
A positive functional margin indicates a correct prediction on a particular example.
The functional margin of an entire training set is given by:
$$ \hat{\gamma} = \min_i \hat{\gamma}^{(i)} $$
The functional margin can be made arbitrarily large by multiplying it with a scalar. However, note that this does not change the predictions on the training examples (the predictions only depend on the sign of $w^Tx^{(i)}+b$).

To solve the problem discussed in last point above we shall introduce the concept of the geometric margin.

Suppose that we have a (hyper)plane in 3d space. Let $P_0=(x_0, y_0, z_0)$ be a known point on this plane. Therefore, the vector from the origin $(0,0,0)$ to this point is just $% <![CDATA[ <x_0,y_0,z_0> %]]>$. Suppose that we have an arbitrary point $P=(x,y,z)$ on the plane. The vector joining $P$ and $P_0$ is then given by:

$$ \vec{P} - \vec{P_0} = <x-x_0,y-y_0,z-z_0> $$

Note that this vector lies in the plane. Now let $\hat{n}$ be the normal (orthogonal) vector to the plane. Therefore:

$$ \hat{n} \bullet (\vec{P}-\vec{P_0}) = 0 $$

Or:

$$ \hat{n} \bullet \vec{P}- \hat{n} \bullet \vec{P_0} = 0 $$

Note that $-\hat{n} \bullet \vec{P_0}$ is just a number. Therefore, for the hyperplane $w^Tx+b$, $w$ represents the normal vector $\hat{n}$ while $b$ is $-\hat{n}\bullet \vec{P_0}$ and the point $x$ is $P$.

Suppose that we draw a vector perpendicular to the hyperplane to a point $x^{(i)}$ in the training set. The magnitude $\gamma^{(i)}$ of this vector is known as the geometric margin. Note that this vector is in the direction of the unit vector $y^{(i)}\frac{w}{\vert w \vert}$. The point at which this vector cuts the hyperplane is thus given by $x^{(i)}-y^{(i)}\gamma^{(i)}\frac{w}{\vert w \vert}$. Thus:

$$ \begin{eqnarray} w^T(x^{(i)}-y^{(i)}\gamma^{(i)}\frac{w}{\vert w \vert})+b &=& 0\\ \gamma^{(i)} &=& y^{(i)}\frac{w^Tx+b}{\vert \vert w \vert \vert} \end{eqnarray} $$

Hence, the geometric margin $\gamma^{(i)}$ is related to the functional margin $\hat{\gamma}^{(i)}$ by the following:

$$ \gamma^{(i)} = \frac{\hat{\gamma}^{(i)}}{\vert\vert w \vert\vert} $$

Note that the geometric margin is invariant to rescaling of the parameters.

We define the geometric margin of the entire training set to be:

$$ \gamma = \min_i \gamma^{(i)} $$

2. The Optimal Margin Classifier

Our goal is to maximize the geometric margin. In that respect, consider the following optimization problem:

$$ \begin{eqnarray} && max_{\gamma, w,b} \ \ \gamma \\ && \ \ \ \ s.t.\ \ \ \ y^{(i)}(w^Tx^{(i)}+b) \geqslant \gamma \ \ \ i=1,...,m\\ && \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \vert \vert w \vert \vert= 1 \end{eqnarray} $$

Note that the condition $\vert\vert w \vert\vert = 1$ ensures that the functional margin is equal to the geometric margin. We may rewrite the above problem as follows:

$$ \begin{eqnarray} && max_{\gamma,w,b} \ \ \frac{\hat{\gamma}}{\vert\vert w \vert\vert} \\ && \ \ \ \ s.t.\ \ \ \ y^{(i)}(w^Tx^{(i)}+b) \geqslant \hat{\gamma} \ \ \ i=1,...,m \end{eqnarray} $$

Note that we can make $\hat{\gamma}$ equal to any arbitrary value by simply rescaling the parameters (without changing anything meaningful). Let us choose $\hat{\gamma}=1$. We may then pose the following optimization problem:

$$ \begin{eqnarray} && min_{\gamma,w,b} \ \ \frac{1}{2}\vert\vert w \vert\vert^2 \\ && \ \ \ \ s.t.\ \ \ \ y^{(i)}(w^Tx^{(i)}+b) \geqslant 1 \ \ \ i=1,...,m \end{eqnarray} $$

Note that we have essentially transformed a non-convex problem into a convex one. We shall now introduce the Lagrange duality that will help us solve this problem.

3. Lagrange Duality

Suppose that we have the following optimization problem:

$$ \begin{eqnarray} && min_{w} \ \ f(w) \\ && \ \ \ \ s.t.\ \ \ \ h_i(w)=0 \ \ \ i=1,...,l \end{eqnarray} $$

We may solve this problem by defining the Lagrangian:

$$ \mathcal{L}(w, \beta) = f(w) + \sum_{i=1}^l \beta_ih_i(w) $$

and solving for:

$$ \begin{eqnarray} \frac{\partial \mathcal{L}(w,\beta)}{\partial w} &=& 0\\ \frac{\partial \mathcal{L}(w,\beta)}{\partial \beta_i} &=& 0 \end{eqnarray} $$

For problems of the following form:

$$ \begin{eqnarray} && min_{w} \ \ f(w) \\ && \ \ \ \ s.t.\ \ \ \ h_i(w)=0 \ \ \ i=1,...,l\\ && \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ g_i(w) \leqslant 0 \ \ \ j=1,...,k \end{eqnarray} $$

we may define the generalized Lagrangian:

$$ \mathcal{L}(w, \alpha, \beta) = f(w) + \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^k \alpha_i g_i(w) + \sum_{i=1}^l \beta_ih_i(w) $$

Consider the following:

$$ \theta_P(w) = \max_{\alpha, \beta} \mathcal{L}(w, \alpha, \beta) = \max_{\alpha, \beta} \left( f(w) + \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^k \alpha_i g_i(w) + \sum_{i=1}^l \beta_ih_i(w) \right) $$

Note that if the constraints $\alpha_i\geqslant 0$, $g_i(w) \leqslant 0$ and $h_i(w) = 0$ are satisfied then $\theta_p$ is just equal to $f(w)$. Otherwise, it equals $+\infty$.

Let us define the primal problem to be $p^*=\min_w\theta_P(w)$. Note that this is the same problem we were trying to solve initially (i.e. $\min_w f(w)$ subject to some constraints).

Alternatively, consider the following problem called as the dual problem:

$$ d^* = \max_{\alpha,\beta} \theta_D(\alpha,\beta) = \max_{\alpha,\beta} \min_w \left( f(w) + \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^k \alpha_i g_i(w) + \sum_{i=1}^l \beta_ih_i(w) \right) $$

It follows from the fact that the $\min\max$ of a function is always greater than or equal to the $\max\min$ that:

$$ d^* \leqslant p^* $$

It can be shown that if the $h_i$’s are affine, the $\alpha_i$’s and $g_i$’s convex and the constraints (strictly) feasible i.e. $% <![CDATA[ g_i(w) <0 %]]>$ $\forall$ $i$ then there must exist $w^*$, $\alpha^*$, $\beta^*$ such that $w^*$ is the solution to the primal problem and $\alpha^*$, $\beta^*$ are the solutions to the dual problem and that:

$$ d^* = p^* $$

Moreover the following Karush-Kuhn-Tucker (KKT) conditions are also satisfied:

$$ \begin{eqnarray} \frac{\partial\mathcal{L}(w^*,\alpha^*,\beta^*)}{\partial w_i} &=& 0 \ \ \ i=1,..,n\\ \frac{\partial\mathcal{L}(w^*,\alpha^*,\beta^*)}{\partial \beta_i} &=& 0 \ \ \ i=1,..,l\\ \alpha_i^*g_i(w^*) &=& 0 \ \ \ i=1,..,k\\ g_i(w^*) &\leqslant& 0 \ \ \ i=1,..,k\\ \alpha_i^* &\geqslant& 0 \ \ \ i=1,..,k \end{eqnarray} $$

The third equation above is known as the KKT dual complementarity equation, and essentially constrains at least one of $\alpha_i$ and $g_i$ to be zero.

4. Optimal Margin Classifiers

Let us return to the following optimization problem:

$$ \begin{eqnarray} && min_{\gamma,w,b} \ \ \frac{1}{2}\vert\vert w \vert\vert^2 \\ && \ \ \ \ s.t.\ \ \ \ -y^{(i)}(w^Tx^{(i)}+b) + 1 \leqslant 0 \ \ \ i=1,...,m \end{eqnarray} $$

Note that the constraint $g(x) = -y^{(i)}(w^Tx^{(i)}+b) + 1$ is only $0$ when the functional margin of $x^{(i)}$ is $1$. Hence, because of the KKT dual complementarity equation $\alpha_i > 0$ only for points that have a functional margin equal to $1$. These are known as the support vectors. Note that these points must be the ones that are closest to the hyperplane.

We may formulate a Lagrangian for the above problem:

$$ \mathcal{L}(w, b, \alpha) = \frac{1}{2}\vert\vert w \vert\vert^2 +\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i\left(-y^{(i)}(w^Tx^{(i)}+b) + 1\right) $$

Therefore:

$$ \begin{eqnarray} \frac{\partial\mathcal{L}(w, b, \alpha)}{\partial w} &=& 0\\ w &=& \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m} \alpha_iy^{(i)}x^{(i)} \end{eqnarray} $$

And:

\begin{eqnarray} \frac{\partial\mathcal{L}(w, b, \alpha)}{\partial b} &=& 0\\ \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m} \alpha_iy^{(i)}&=& 0 \end{eqnarray}

Substituting $w$ into the $\mathcal{L}(\gamma,w,b)$ gives:

\begin{eqnarray} \mathcal{L}(w, b, \alpha) &=& \frac{1}{2}\vert\vert \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m} \alpha_iy^{(i)}x^{(i)} \vert\vert^2 +\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i\left(-y^{(i)}\left(\left(\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_jy^{(j)}x^{(j)}\right)^Tx^{(i)}+b\right) + 1\right)\\ &=& \frac{1}{2} \left(\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m} \alpha_iy^{(i)}x^{(i)}\right)^{T}\left(\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m} \alpha_iy^{(i)}(x^{(i)})\right) +\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i\left(-y^{(i)}\left(\left(\sum_{\substack{j=1\\ \alpha_j\geqslant 0}}^{m} \alpha_jy^{(j)}(x^{(j)})^Tx^{(i)}\right)+b\right) + 1\right)\\ &=&\frac{1}{2} \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(j)})^Tx^{(i)} -\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(j)})^Tx^{(i)}+\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_iy^{(i)}b + \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i\\ &=& \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i - \frac{1}{2} \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(j)})^Tx^{(i)} \end{eqnarray}

Note that in the last step we used the fact that $\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_iy^{(i)} = 0$. Hence, we may state the following dual optimization problem:

$$ \max_{\alpha} W(\alpha) = \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i - \frac{1}{2} \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(j)})^Tx^{(i)}\\ \begin{eqnarray} s.t. \ \ \ \ \sum_{i=1}^{m}\alpha_iy^{(i)} &=& 0 \ \ \ \ i=1,...,m\\ \alpha_i &\geqslant& 0\ \ \ \ i=1,...,m \end{eqnarray} $$

Once we have solved for $\alpha$ we can find $w$ using the equation we derived earlier:

$$ w = \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m} \alpha_iy^{(i)}x^{(i)} $$

To solve for $b$ consider two points $x_1$ and $x_2$ whose functional margin is equal to $1$ such that $y_1 = 1$ and $y_2=-1$. Therefore $w^Tx_1 + b=1$ and $w^Tx_1 + b=-1$. Hence:

$$ \begin{eqnarray} 0 &=& (w^Tx_1 + b) + (w^Tx_2 + b)\\ b &=& -\frac{w^Tx_1+w^Tx_2}{2}\\ b &=& -\frac{\min_{i:y^{(i)}=1}w^Tx^{(i)}+\max_{i:y^{(i)}=-1}w^Tx^{(i)}}{2} \end{eqnarray} $$

Note that in the last step we used the fact that all points have a functional margin of at least $1$, i.e. $1$ is the minimum functional margin a point can have.

Also, note that the equation (that we will use to make a prediction on an input $x$):

$$ \begin{eqnarray} w^Tx+b &=& \left(\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m} \alpha_iy^{(i)}x^{(i)}\right)^T x + b\\ &=& \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m} \alpha_iy^{(i)}(x^{(i)})^Tx + b \end{eqnarray} $$

only depends on the inner product between the training examples $x^{(i)}$ and the input $x$. Moreover most of the $\alpha_i$’s, as discussed previously, will be $0$.

5. Kernels

Suppose that we map the input vector $x$ to a higher (possibly infinite) dimensional vector space and input the resulting vector $\phi(x)$ to an SVM. Consequently we would replace $x$ in the SVM formulation with $\phi(x)$. Let us define $K(x,z)$$=$$\phi(x)^T\phi(z)$, where $K(x,z)$ is called as the kernel. Note that this implies that inner product $\phi(x)^T\phi(z)$ can be calculated using only $x$ and $z$, i.e. without explicitly finding $\phi(x)$ or $\phi(z)$.

Let us define the Kernel matrix $K$ such that $K_{ij} = K(x^{(i)},x^{(j)})$. Then for $K$ to be a valid kernel $K_{ij}=K_{ji}$ because $\phi(x^{(i)})\phi(x^{(j)})$ $=$ $\phi(x^{(j)})\phi(x^{(i)})$. Also, suppose that $z$ is any vector. Then:

$$ \begin{eqnarray} z^TKz &=& \sum_i\sum_j z_iK_{ij}z_j\\ &=& \sum_i\sum_j z_i\phi(x^{(i)})^T\phi(x^{(j)})z_j\\ &=& \sum_i\sum_j z_i\left(\sum_k\phi_k(x^{(i)})\phi_k(x^{(j)})\right)z_j\\ &=& \sum_k\sum_i\sum_j z_i\phi_k(x^{(i)})\phi_k(x^{(j)})z_j\\ &=& \sum_k\left(\sum_i z_i\phi_k(x^{(i)})\right)\left(\sum_j\phi_k(x^{(j)})z_j\right)\\ &=& \sum_k\left(\sum_i z_i\phi_k(x^{(i)})\right)^2\\ &\geqslant& 0 \end{eqnarray} $$

which shows that $K$ is positive semidefinite. It turns out that for $K$ to be a valid (Mercer) kernel it is sufficient that it is symmetric and positive semidefinite.

6. Regularization and the Non-Separable Case

Consider the following problem:

\begin{eqnarray} && min_{\gamma,w,b} \ \ \frac{1}{2}\vert\vert w \vert\vert^2 + C\sum_{i=1}^m\xi_i\\ && \ \ \ s.t. \ \ \ \ \ y^{(i)}(w^Tx^{(i)}+b) \geqslant 1-\xi_i \ \ \ i=1,...,m\\ && \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \xi_i \geqslant 0 \ \ \ i=1,...,m \end{eqnarray}

Note that the regularization term allows, but penalizes, points to have a functional margin of less than $1$. The Lagrangian is thus given by:

$$ \mathcal{L}(w, b, \xi, \alpha, r) = \frac{1}{2}\vert\vert w \vert\vert^2 + C\sum_{i=1}^m\xi_i +\sum_{\substack{i=1\\ \alpha_i \geqslant 0\\ \xi_i \geqslant 0}}^{m}\alpha_i\left(-y^{(i)}(w^Tx^{(i)}+b) + 1 - \xi_i\right)-\sum_{\substack{i=1\\ \xi_i \geqslant 0}}^{m}r_i\xi_i $$

Therefore:

\begin{eqnarray} \frac{\partial\mathcal{L}(w, b, \xi, \alpha, r)}{\partial w} &=& 0\\ w &=& \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m} \alpha_iy^{(i)}x^{(i)} \end{eqnarray}

And:

$$ \begin{eqnarray} \frac{\partial\mathcal{L}(w, b, \xi, \alpha, r)}{\partial b} &=& 0\\ \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m} \alpha_iy^{(i)}&=& 0 \end{eqnarray} $$

Also:

$$ \begin{eqnarray} \frac{\partial\mathcal{L}(w, b, \xi, \alpha, r)}{\partial \xi_i} &=& 0\\ r_i &=& C - \alpha_i \end{eqnarray} $$

Note that the last equation implies that $0\leqslant \alpha_i\leqslant C$ because $r_i\geqslant 0$.

Substituting $w$ in $\mathcal{L}(w, b, \xi, \alpha, r)$ and simplifying (see Section 4 above) gives:

\begin{eqnarray} L(w,b,\xi,\alpha,r) &=& \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}(1-\xi_i)\alpha_i +C\sum_{i=1}^m\xi_i -\frac{1}{2} \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(j)})^Tx^{(i)}-\sum_{\substack{i=1\\ r_i \geqslant 0}}^{m}r_i\xi_i\\ &=& \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i -\frac{1}{2} \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(j)})^Tx^{(i)}+\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}(C-\alpha_i)\xi_i -\sum_{\substack{i=1\\ r_i \geqslant 0}}^{m}r_i\xi_i\\ &=& \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i -\frac{1}{2} \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(j)})^Tx^{(i)}+\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}(C-\alpha_i)\xi_i -\sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}(C-\alpha_i)\xi_i\\ &=& \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i -\frac{1}{2} \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(j)})^Tx^{(i)} \end{eqnarray}

We formulate the dual problem as follows:

$$ \max_{\alpha} W(\alpha) = \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i - \frac{1}{2} \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(j)})^Tx^{(i)}\\ \begin{eqnarray} s.t. \ \ \ \ \ \ \ 0 \leqslant \alpha_i &\leqslant& C \ \ \ \ i=1,...,m\\ \sum_{i=1}^{m}\alpha_iy^{(i)} &=& 0 \ \ \ \ \ i=1,...,m \end{eqnarray} $$

Also, the KKT dual complementarity equations are now given as:

$$ \begin{eqnarray} \alpha_i &=& 0 &\Rightarrow& y^{(i)}(w^Tx^{(i)}+b) &\geqslant& 1 \\ \alpha_i &=& C &\Rightarrow& y^{(i)}(w^Tx^{(i)}+b) &\leqslant& 1 \\ 0 < \alpha_i &<& C &\Rightarrow& y^{(i)}(w^Tx^{(i)}+b) &=& 1 \\ \end{eqnarray} $$

7. Coordinate Ascent

Consider the following optimization problem:

$$ \max_{\alpha_1,\alpha_2,...,\alpha_n} W(\alpha_1,\alpha_2,...,\alpha_n) $$

The coordinate ascent algorithm does the following:

Loop until convergence
{
$\;\;$For $i =1,...,m$ do:
$\;\;\;\;$$\max_{\hat{\alpha}_i} W(\alpha_1,\alpha_2,...,\hat{\alpha}_i,...,\alpha_m)$
}

i.e. at each iteration, $W(\alpha_1,\alpha_2,...,\alpha_n)$ is optimized with respect to only one parameter while all others are held constant.

8. Sequential Minimal Optimization

We have the following problem:

$$ \max_{\alpha} W(\alpha) = \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\alpha_i - \frac{1}{2} \sum_{\substack{i=1\\ \alpha_i \geqslant 0}}^{m}\sum_{\substack{j=1\\ \alpha_j \geqslant 0}}^{m} \alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(j)})^Tx^{(i)}\\ \begin{eqnarray} s.t. \ \ \ \ \ \ \ 0 \leqslant \alpha_i &\leqslant& C \ \ \ \ i=1,...,m\\ \sum_{i=1}^{m}\alpha_iy^{(i)} &=& 0 \ \ \ \ \ i=1,...,m \end{eqnarray} $$

Note that we cannot use the coordinate ascent algorithm, as presented above, here because of the second constraint which essentially states that any one $\alpha_i$ is exactly determined by the other $\alpha_i$’s (for $i=1,...,m$). We, thus, modify the coordinate ascent algorithm as follows:

Loop until convergence
{
$\;\;$For some $i,j=1,...,m$ do:
$\;\;\;\;$$\max_{\hat{\alpha}_i\hat{\alpha}_j} W(\alpha_1,...,\hat{\alpha}_i,...,\hat{\alpha}_j,...,\alpha_m)$
}

i.e. we optimize $W(\alpha_1,...,\alpha_m)$ with respect to two parameters at each iteration. Therefore, at each iteration we have:

$$ \begin{eqnarray} y^{(i)}\alpha_i+y^{(j)}\alpha_j &=& \sum_{\substack{k=1\\ k \neq i,j}}^m y^{(k)}\alpha_k\\ y^{(i)}\alpha_i+y^{(j)}\alpha_j &=& \zeta\\ \alpha_i &=& y^{(i)}(\zeta-y^{(j)}\alpha_j) \end{eqnarray} $$

where $\zeta$ is a constant. Note that at each iteration all $\alpha_k$’s where $k=1,...,m$ and $k \neq i,j$ are constant. Note also that $\alpha_i$ and $\alpha_j$ must be between $0$ and $C$. Suppose that $L$ and $H$ are the lower and upper bounds on $\alpha_j$ respectively. We may rewrite $W(\alpha_1,...,\hat{\alpha}_i,...,\hat{\alpha}_j,...,\alpha_m)$ as $W(\alpha_1,...,y^{(i)}$$(\zeta-y^{(j)}\alpha_j),...,\hat{\alpha}_j,...,\alpha_m)$. Note that this is just a quadratic equation in $\hat{\alpha_j}$. Setting its derivative to zero will yield the optimal value for $\alpha_j$ which we shall represent by $\alpha_j^{unclipped}$. As $\alpha_j$ is constrained to be between $L$ and $H$, we shall clip its value as follows:

$$ \alpha_j^{clipped} = \begin{cases} H, & \text{if} & &\alpha_j^{unclipped} &>& H\\ \alpha_j^{unclipped} & \text{if} & L\leqslant &\alpha_j^{unclipped}&\leqslant& H\\ L, & \text{if} & & \alpha_j^{unclipped} &<& L\end{cases} $$

The convergence of this algorithm can be tested using the KKT dual complementarity conditions.