Convergence Point
Posts
SUPPORT VECTOR MACHINES (part 2)

SUPPORT VECTOR MACHINES (part 2)

Lagrange Multipliers and Kernel Trick

Abhiram kandiyana
April 18, 2024

In last week’s blog, I explained the intuition and math behind the concepts of support vector machines (SVMs) and how they achieve maximum margin classification.

We ended with the equation below that we need to minimize

find w and b such that

d = ||w||² /2 is minimized, for all (x_i,y_i)

and y(w.x_i + b) - 1 >= 0 where y_i = +1 for all x₊ and -1 for all x_-

Now, how do you calculate the extremes (min or max) of a function? You can take the first derivation, solve it to 0, and then do the first and second derivative tests. But, the function is bounded by constraints as mentioned above. To make it easy on ourselves, we will use Lagrange multipliers.

Using Lagrange multipliers, we come up with a new equation that can be solved without worrying about the constraints.

The L here is the Lagrangian function and the alphas are Lagrange multipliers. We can now solve L for minimum without worrying about the constraints by finding the derivative and setting that to 0. Let’s do that.

Note: Remember that we are calculating the derivative of L w.r.t to w, a vector. So it is slightly different than what we would do with a scalar.

This equation (in red) means that the weight vector is the linear sum of the vectors (not all). This mathematical convenience made SVMs easy to apply.

Also, note that only for the support vectors (vectors lying on the gutter) alpha is non-zero. For the remaining vectors it is 0. (https://www.svms.org/kkt/)

Ok. We also need to differentiate L with b as it is the second variable we need to tune(w and b are the variables that we need to find out to maximize the distance between gutters). Let’s do that.

Ok. This also looks simple. No logs, or powers .. etc which would make this complex. Now, let’s substitute the values of w to the Lagrange function to see if we can simplify it and find any other convenient equations. This shit is gonna get heavy. buckle up.

I hope I got this right. Be careful about the dot products in the above equation. You can notice that I am using i and j to differentiate different summations w.r.t to the same variables. We have substituted w into L but we also have another equation (a derivative of L by b) that equates to 0. Let’s substitute that.

Ok. We got rid of one summation. But wait, there is something interesting here. The second and third parts of the equation (in yellow) are same. Adding them up we get

Let’s also simplify the summations and use double summation to make the equation less scary.

Yep. This looks better. Keep an eye on the last dot product. This tells us that the entire optimization of the problem depends on the dot product of input vector pairs. That will come in handy. Now let’s keep this aside and substitute w from the Lagrange function into the decision rule

Clear? Ok. let’s look closely at the about equation. Do you see something interesting? Almost forming a pattern? You can see that the decision rule also depends on the dot product of vector pairs. So both the optimization problem and decision rule depend on the dot product of input vectors.

We have everything to train and test the linear SVM. Here are the steps to do it.

In training, you will find the αi and b that define the hyper-plane that best separates your data points. This is nothing but substituting the xi values in the optimization equation
From the training, you will get your support vectors, the xi with non-zero αi values.
Now, you will get the αi, yi, xi, and b values for support vectors from training and substitute them in the below equation
In inference, you will substitute the u vector and if this equation is greater than or equal to 0 then it is positive otherwise it is negative.

The linear SVM classifier is blessed with convex space, i.e., it can never get stuck in a local maxima (unlike NNs). But it can only work with linearly separable data right now. How to make this work on non-linear data?

**BGM

KERNEL TRICK

The intuition behind the kernel trick is that when non-linear data in a dimension is transformed into another with higher dimensions, it becomes linear data. This post explains the concept better with a simple example.

Ok. Alright. Normally, this is hard. I mean we need to come up with a function that when applied to the data points, transforms all the points into a new space with more dimensions where the relationship between the data points is intact but they become linearly separable. It is comptuationally expensive and a hard problem. But, we have seen something interesting while optimizing linear SVM.

We have seen that both the optimization and decision problem depend only on the dot product of input vectors. i.e., if we can come up with a function that transforms each vector into that space and we dot product them, that would be enough.

The K() in the above equation is called the kernel function. We don’t need to know ϕ(x). We only need to know k() and that would be enough for the transformation. This makes SVM a lot easier to solve.

A few popular kernel functions

The RBF kernel is the most popular in traditional machine learning but it is rarely used in text classification (because of the sparse data which leads to overfitting and instability). It still leverages the dot product power as the Euclidean distance can be written in terms of the dot product.

That’s it. SVMs are

Maximum margin classifiers. They find the best hyper-plane that separates the data points
The decision function only depends on the support vectors (vectors lying on the gutters) making it computationally efficient
SVMs deal with convex space which doesn’t have the local minimum plague
They can work on non-linear data too; thanks to the kernel trick.

So far so good. But this doesn’t make SVMs the holy grail. They too have issues that are not easy to solve

Overfitting. SVMS can easily overfit, especially when using complex kernels (RBFs).
SVMs inherently work for binary classification. For multi-class classification, You have to make some changes to the training method
Optimizing SVMs is a quadratic problem. This makes it inefficient when scaling to large datasets.
As I mentioned above, when data is sparse, the kernel functions shrink and lead to overfitting and instability.
Like any other ML algorithm, choosing the right kernel and parameter tuning is hard.

This is SVMs. I hope you understand how they work and why they are so popular. If you have any questions on SVM or ML in general, you can book a free call here: https://topmate.io/abhiram_kandiyana/920032

Next week, we will see how SVMs can be used for text classification with Python programs. Thank you.