Math Behind KNN

What is KNN and how it works:

Let’s head by setting some definitions and notations. We will take x to denote a feature and y to denote the target.

KNN falls in the supervised learning algorithms. This means that we have a dataset with labels training measurements (x,y) and would want to find the link between x and y. Our goal is to discover a function h:X→Y so that having an unknown observation x, h(x) can positively predict the identical output y.

Working

First, we will talk about the working of the KNN classification algorithm. In the classification problem, the K-nearest neighbor algorithm essentially said that for a given value of K algorithm will find the K nearest neighbor of unseen data point and then it will assign the class to unseen data point by having the class which has the highest number of data points out of all classes of K neighbors.

For distance metrics, we will use the Euclidean metric.

Finally, the input x gets assigned to the class with the largest probability.

For Regression the technique will be the same, instead of the classes of the neighbors we will take the value of the target and to find the target value for the unseen datapoint by taking an average, mean or any suitable function you want.

Ideal Value for K

Now most probably, you are wondering how to decide the value for variable K and how it will affect your classifier. Well, like most machine learning algorithms, the K in KNN is a hyperparameter that you, as a data scientist, must decide in place to get the most suitable fit for the data set.

When K is small, we are holding the region of a given prediction and pushing our classifier to be “more blind” to the overall distribution. A small value for K provides the most adjustable fit, which will have low bias but high variance. Graphically, our decision boundary will be more irregular. On the other hand, a higher K averages more voters in each prediction and hence is more flexible to outliers. Larger values of K will have smoother decision boundaries which means lower variance but increased bias.

Improvements

  • An easy and mild approach to change skewed class distributions is by implementing weighted voting.
  • Changing the distance metric (i.e. Hamming distance for text classification)
  • Dimensionality reduction techniques like PCA should be executed prior to applying KNN and help make the distance metric more meaningful.

KNN stands for K-nearest neighbour, it’s one of the Supervised learning algorithm mostly used for classification of data on the basis how it’s neighbour are classified. KNN stores all available cases and classifies new cases based on a similarity measure. K in KNN is a parameter that refers to the number of the nearest neighbours to include in the majority voting process.

How do we choose K?

Sqrt(n), where n is a total number of data points(if in case n is even we have to make the value  odd by adding 1 or subtracting 1 that helps in select better)

When to use KNN?

We can use KNN when Dataset is labelled and noise-free and it’s must be small because KNN is a “Lazy learner”. Let’s understand KNN algorithm with the help of an example

NAMEAGEGENDERCLASS OF SPORTS
Ajay320Football
Mark400Neither
Sara161Cricket
Zaira341Cricket
Sachin550Neither
Rahul400Cricket
Pooja201Neither
Smith150Cricket
Laxmi551Football
Michael150Football

Here male is denoted with numeric value 0 and female with 1. Let’s find in which class of people Angelina will lie whose k factor is 3 and age is 5. So we have to find out the distance using 

  d=√((x2-x1)²+(y2-y1)²) to find the distance between any two points.

So let’s find out the distance between Ajay and Angelina using formula  

d=√((age2-age1)²+(gender2-gender1)²)

d=√((5-32)²+(1-0)²)

d=√729+1

d=27.02

Similarly, we find out all distance one by one.

Distance between Angelina and  Distance
Ajay27.02
Mark35.01
Sara11.00
Zaira9.00
Sachin50.01
Rahul35.01
Pooja15.00
Smith10.00
Laxmi 50.00
Michael10.05

So the value of factor is 3 for Angelina. And the closest to 3 is 9,10,10.5 that is closest to Angelina are Zaira, Smith and Michael.

                                       Zaira         9           cricket

                                      Michael      10         cricket    

                                      smith          10.5      football

so according to KNN algorithm, Angelina will be in the class of people who like cricket. So this is how KNN algorithm works.  

Follow Us On