SUPPORT VECTOR MACHINES ( part 1 )

A maximum margin classifier that finds the optimal hyper-plane to fit the data

The classifiers we have seen so far such as Naive Bayes Classifier (NBC) and K-Nearest Neighborhood (KNN) classifier have one assumption we haven’t talked about in detail. They assume that the data is linearly separable.

Linear separability means that the data is separable by a straight line. Let me illustrate this with a very simple example.

The above is a toy classification problem. Your goal is to separate the bottle caps and the girl with the red hat (let’s call her Lily). Let’s assume they are in a 2D space where the table height is y and width is x.

Now, when I put a pen in between them you see that the marbles are on one side and Lily is on the other. So, these are linearly separable.

Let’s change the positions of the toys.

Now, Can you separate them by putting a pen in between? I don’t think so. This data is not linearly separable and to separate them we can use a wire or some other object which is non-linear.

The classifier that can separate this kind of data is called a non-linear classifier.

In this example, these toys are in 2D. What if we increase the dimensions? What if we add one more dimension which depends on each toy's distance from the center? making them 3D without changing anything else?

We can do it easily by lifting up Lily slightly.

Now, can you separate the caps and Lily using a linear plane, like paper? yes!

and voila, these are linearly separable now. What happened here?

Increasing the dimensions transformed the non-linear data into linear data. Here because there are only 3 points, it was very easy for me to find a way to separate them using a paper. But let’s say you have 100 points (datasets have thousands), how do you find a way to separate them? Using Support Vector Machines (SVMs) and their kernel trick.

**Drumrolls

LINEAR SUPPORT VECTOR MACHINES

Let’s first understand what SVMs are and how they work. SVMs are maximum margin classifiers, i.e., they don’t only separate the data using a hyper-plane (like the paper in the above example), they ensure that there is a gap between either side of the lines such that it is exactly in the middle of them.

the dotted line in the middle is the decision boundary. the solid lines on either side are called gutters. SVM’s goal is to maximize the width of this “road”.

In other words, it tries to find the optimal decision boundary. The decision function doesn’t depend on the entire data. It depends on the subset of data that is closest to the decision boundary. The “difficult points” in this subset are called support vectors.

I mean look at it this way. If there are no points anywhere close to the decision boundary, then when you test it with a new test case, the chances of getting uncertain classification decisions is slim.

Before we go ahead, there are a few things you need to refresh your mind on a few things. Below are the concepts that I assume you understand going forward.

  1. What are dimensions? like 2D, 3D ?

  2. What are vectors? What is the dimension of a vector?

  3. What are hyperplanes?

  4. What is the dot product of vectors? what does it calculate?

  5. What is the Lagrange multiplier used for? And how does it work?

Let’s define some variables that I will use to explain the SVM concept

In the above plot

  • xi: all the points both positive and negative

  • yi: class, +1 for positive and -1 for negative points

  • the dotted line in the middle is the decision boundary

  • the two solid lines on either side of the decision boundary are called the gutters. Our goal is to find a decision boundary such that there is no point in between the gutters.

  • w is a vector of some unknown length perpendicular to the decision boundary. ()

NOTE: I am still figuring out how to add the lines above the alphabets to represent them as vectors. Please remember that x, y, w, u, are all vectors.

Let’s say we have an unknown vector U that we want to classify, i.e., we want to find if it’s on the right side of the decision boundary or left.

One way to find it is to project the vector U onto W which is perpendicular to the street. This will get you a vector in the direction of W and then based on its distance, we can decide if it is on the left or the right side of the street. If the distance is large, it’s on the right side otherwise it’s on the left.

Now, how do you project one vector onto another? Using dot product.

w.u >= c then + 

if the dot product of w and u is greater than a constant c, then u belongs to the positive samples otherwise it is negative. fair?

w.u + b >= 0 then + (where b= -c)

Now, let’s generalize this by adding a bias term so the equation can be equated to 0. This equation above is called the decision rule.

But we have one equation and two unknowns, w and b. So, let’s try to constrain the situation so we can approximate these values.

Let’s define a condition that all the positive samples are at least 1 unit right-side to the decision boundary and negative samples are 1 unit left, i.e.,

w.x+ + b >= 1

w.x- + b <= -1

Right? So, we are saying that there are no points in between the gutters right now. And we are doing that we want that to be true. If we want to find a way to get something, we first must assume that the condition is true and come backward to find the solutions. You cannot always do it in real life but in the math world, you can.

Let’s do some magic to converge these two equations into one. We introduce another variable y with the condition that it is +1 for positive samples and -1 for negative samples. We multiply that to both sides of the equation by both equations above which gives us

y(w.x + b) >= 1 where y = +1 for all x+ and -1 for all x-

y(w.x + b) - 1 >= 0 where y = +1 for all x+ and -1 for all x-

y(w.x + b) - 1 = 0 for samples which lie on the gutter

For all the samples, the above equation is >= 0 but I just added another constraint that for a subset of points lying on the gutter the above equation would be 0.

Ok. now let’s keep these aside. Remember that our goal is to ensure the optimal decision boundary and gutter such that there are no points in between them. For that, we need to find a way to express the distance between the gutters or the width of the road.

Let’s suppose we have two vectors x+ and x- both of which lie on the positive and negative side gutters respectively. The difference between these would be close to the width but it won’t be equal. The issue here is the direction. We can project this vector to another unit vector which is perpendicular to the gutters. Wait a minute? we already have w which is perpendicular to the decision boundary and hence to the gutters. great. but, w is not unit length. So, first, we need to convert it to unit length and then project x+ - x- to it. How do you do it? Using dot product.

unit w = w/||w|| where ||w|| is it’s length.

width = (x+ - x-) . w/||w||

= x+ . w/||w|| - x- . w/||w||

= (x+ . w - x- . w) / ||w||

Ok. Now we have an equation for width. but how do we solve it? Hmm. let’s look back to the previous equations and see if you can substitute any of those here to simplify it. Remember that x+ and x- lie on the gutters. and we have defined an equation for points lying on gutters.

y(w.x + b) - 1 = 0 for samples which lie on the gutter, where y = +1 for all x+ and -1 for all x-

if you substitue the y in here we get

w.x+ + b = 1 and w.x- + b = -1

substituting this into the width equation gets us

(1 - b - (-1 - b)) / ||w||

= 2 / ||w||

=> width = 2 / ||w||

Our goal is to maximize this width so that there is a significant difference between the positive and negative samples.

So, our goal is to find

max 2 / ||w||

= max 1 / ||w|| (removing constant)

= min ||w|| (inverse of max is min)

= min ||w||2 /2 (for convinience)

Ok. We have an equation to minimize and other constraints we must obey. let’s summarize it here.

find w and b such that

d = ||w||2 /2 is minimized, for all (xi,yi)

and y(w.xi + b) - 1 >= 0 where yi = +1 for all x+ and -1 for all x-

This is an optimization problem. The above equation is a quadratic function but with linear constraints. To deal with these kinds of problems we have Lagrange multipliers which we will discuss next week.

Next week we will see how to optimize this equation so that it is easy to solve and how to use the “kernel” trick to solve non-linear problems.

I learned ML online because my university didn’t have any experts to teach. Yes, I am proud that I did it on my own but It took me a year to learn the fundamentals. It shouldn’t have.

But you don’t have to, ‘cause you got me. Book a call here and I will answer your questions on ML concepts, math, and programming.