Convergence Point
Posts
K-Nearest-Neigbor (KNN) Classifier

K-Nearest-Neigbor (KNN) Classifier

The closest we can get to an optimal linear model

Abhiram kandiyana
April 03, 2024

Last week we saw the Naive Bayes Classifier (NBC) which uses the Bayes theorem to filter spam emails from your inbox.

Though it is an efficient model, its assumptions are rarely true requiring large amounts of data for training.

Google has it but you and I don’t. So we need something that doesn’t make any assumptions.

K-Nearest-Neigbor (KNN) Classifier is a non-parametric vector space model (or not a model) used heavily for character and image recognition in various domains. In NLP, KNN is still used in recommendation systems.

Before we build a KNN classifier, we need to transform our text into vectors. KNN works by mapping the data points in a vector space and then operating on them.

In NLP every sentence (or paragraph) is a 1 dimensional vector where each token is an element. But, Machines cannot understand language like we do. They only understand numbers. So, we need to represent each token in a sentence using a number. This process is called vectorization.

You can be more granular by representing each word with a separate vector which will create embeddings but this is optional for now.

"Machine learning is fun"
=> <"Machine", "learning", "is", "fun"> (tokenization)
=> [24,43,64,34] (vectorization)
=> [[0.5,0.3,0.2],[0.0,0.1,0.4]....] (embeddings)

This is straightforward. Great! But how do you vectorize? There are many methods to do it and I posted it on Linkedin.

The easiest way to do it is to use one-hot encoding, i.e., every sentence has the vocabulary length and if a word is present it is 1 otherwise it is 0.

Let’s go one step ahead and build tf-idf vectors. These vectors capture the rarity of the token along with its frequency in the sentence.

TF-IDF vectors

Term frequency-Inverse document frequency is the multiplication of the frequency of the token in the sentence and inverse of the the frequency of the token in all sentences.

let’s build tf-idf vectors for a toy dataset of two sentences in basic Python.

The first step is to tokenize the sentences. We will use the NLTK package for that.

from nltk.tokenize import word_tokenize

data = ["movie is great. I loved the performance.","movie is bad. it was boring. one time watch."]

tokenized_sents = []
for sentence in data:
    tokenized_sents.append(word_tokenize(sentence))

The next step is calculating the word frequency for each sentence word. We are using dictionaries to store the tf for a sentence. The key is the word and the value is the frequency of that word in that sentence.

tf = []
for sentence in tokenized_sents:
    sentence_tf = {}
    for word in sentence:
        if word not in sentence_tf:
            sentence_tf[word] = 1
        else:
            sentence_tf[word] += 1

    tf.append(sentence_tf)

Now, let’s calculate idf for the words across all sentences. You can see that df counts the global frequency of the word, unlike tf. Then we calculate its inverse and apply Log for stabilization.

import numpy as np

N = len(data)

df={}
for sentence in tokenized_sents:
        for word in set(sentence):
            if word in df.keys():
                df[word] += 1
            else:
                df[word] = 1

idf = {word: np.log((N / (df + 1))) + 1 for word, df in df.items()}

We have tf and idf. it’s time to multiply them to get the tf_idf values for each word in the sentences. tfidf has two elements each element is a dictionary where the key is the word and the value is the tfidf value.

In practical terms, you would have two different classes of data, and your tfidf value for the same word changes based on the class you are looking into.

tfidf = []
for tf_sentence in tf:
    tfidf_sentence = {word: (tf_word * idf[word]) for word, tf_word in tf_sentence.items()}
    tfidf.append(tfidf_sentence)

The last step is to convert this dictionary into a list where each value in the list is the tfidf for the word. We also need to normalize the vectors so all of them have unit length.


vocab = list(df.keys())
vocab_index = {word: index for index, word in enumerate(vocab)}  # Map words to its index in vocab

train_vectors = []
vector = np.zeros((len(vocab)), dtype=float)
for i, sentence in enumerate(tfidf):
    for word, score in sentence.items():
        vector[vocab_index[word]] = score
    train_vectors.append(vector)

norms = np.linalg.norm(train_vectors, axis=1, keepdims=True)
norms[norms == 0] = 1
normalized_train_vectors = train_vectors / norms

You can write the same code in less than 5 lines. I expanded on each operation so that the code is comprehensive.

Ok. We have our vectors which are in a V dimensional space where V=vocabulary_size. Now how do you classify them using KNN?

KNN Classifier

The pseudo-code for KNN is surprisingly simple:

1. Choose a value for K.

2. Find the K nearest vectors for your test sentence using a similarity metric.

3. The majority class in the neighborhood is assigned to your vector.

That’s it. Let’s visualize how and why this works.

Let’s take the problem of sentiment classification. Your goal is to classify a test sentence (blue triangle) into either a positive or negative review.

For this problem, let’s choose your K=5, i.e., you will consider the 5 nearest neighbors for the classification voting.

Now, let’s map the train data into a vector space and see where the test sample is at

KNN for Sentiment Clasification

As you can see, the test sample’s neighborhood has 3 positive and 2 negative data points which classifies it as a positive sample.

Hooray! You have learned the KNN classifier. You can now do sentiment analysis, handwriting recognition, and movie recommendation systems.

This is the easy part. the hard part is choosing the hyper-parameters.

KNN has two hyper-parameters: The value of K and the similarity metric.

K Value

The K value determines how many neighbors are considered to classify a data point. For k=3, you would look at the 3 nearest neighbors, find the majority class, and assign that. It is preferred to make k odd so that you always have a majority class.

Low K values will make the decision boundary sensitive to noise leading to overfitting.

In the above example with K=1, it would be assigned a negative class. But if you observe closely, the red point close to it is an outlier. This would be a wrong prediction.

High K values will make the decision boundary smooth but might result in under-fitting.

Your K should be a good balance between these two. Choosing the K value is an optimization problem. You try different values and choose the one with the best performance.

Similarity metric

A metric that measures how similar two data points are. If we are mapping points to a vector space, the close distance between vectors will indicate high similarity between data points and vice versa.

Your similarity vector should be

Sensitive to a substantial difference
Scalable
Interpretable
Robust

It should not be

Sensitive to outliers
Computationally expensive

Cosine similarity is the most used similarity metric to date in NLP (used in LLMs). It takes the difference in magnitude and direction between the vectors to calculate similarity.

In 90% of cases, KNN is more accurate than Naive Bayes Classifier (NBC) which makes it the closest to an optimal linear classifier. It also scales well with a large number of classes.

But it also has its issues like

Influence of one class on entire data
Because KNN depends only on decision boundaries, a slight shift in one class will affect the classification of all data points. Noise can affect the model’s performance if not handled carefully.
Scores are hard to convert to probabilities
KNN provides the classification based on voting without directly giving a probability. The ratio can be converted into a probability for simplicity but it is not sophisticated enough as they don’t account for the density of points in the feature space. One way to deal with this is using weighted voting, i.e., points closer to the test sample will have more influence over the voting.
Expensive at inference.
The training part for KNN is just mapping the data into a vector space. Everything else happens at inference. The time complexity of calculating the distance between one vector and all the vectors in the train set is O(nd) where n is training size and d is vector dimension. We can make it more efficient using indexing techniques like KD-trees (O(log n)) but it still is computationally expensive at inference compared to other models like linear regression or naive Bayes classifier.