Basics : K-Nearest Neighbors

2 minute read

Objective: Given some “Classified Data”, train the model to categorize data points.

Source: Udemy | Python for Data Science and Machine Learning Bootcamp
Data used in the below analysis: link

#import necessary libraries for analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#set this to view plots in jupyter notebook
%matplotlib inline
sns.set_style('whitegrid')
knn_data = pd.read_csv('KNN_Project_Data') #read data into dataframe
knn_data.head(3) #view some data entries
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS GUUB MGJM JHZC TARGET CLASS
0 1636.670614 817.988525 2565.995189 358.347163 550.417491 1618.870897 2147.641254 330.727893 1494.878631 845.136088 0
1 1013.402760 577.587332 2644.141273 280.428203 1161.873391 2084.107872 853.404981 447.157619 1193.032521 861.081809 1
2 1300.035501 820.518697 2025.854469 525.562292 922.206261 2552.355407 818.676686 845.491492 1968.367513 1647.186291 1

Since this data is artificial, we would view all the features for better understanding

#pair plot of the entire data
sns.pairplot(data=knn_data,hue='TARGET CLASS')

Pair plot of all features

Standardize the variables

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(knn_data.drop('TARGET CLASS',axis=1))
StandardScaler()
scaled_version = scaler.transform(knn_data.drop('TARGET CLASS',axis=1)) #scaled the features
#convert the scaled array into a DataFrame
knn_data_scaled = pd.DataFrame(scaled_version,columns=knn_data.columns[:-1])
knn_data_scaled.head() #check the head to see if scaling worked
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS GUUB MGJM JHZC
0 1.568522 -0.443435 1.619808 -0.958255 -1.128481 0.138336 0.980493 -0.932794 1.008313 -1.069627
1 -0.112376 -1.056574 1.741918 -1.504220 0.640009 1.081552 -1.182663 -0.461864 0.258321 -1.041546
2 0.660647 -0.436981 0.775793 0.213394 -0.053171 2.030872 -1.240707 1.149298 2.184784 0.342811
3 0.011533 0.191324 -1.433473 -0.100053 -1.507223 -1.753632 -1.183561 -0.888557 0.162310 -0.002793
4 -0.099059 0.820815 -0.904346 1.609015 -0.282065 -0.365099 -1.095644 0.391419 -1.365603 0.787762

We can see that the data is now in a standard scale and has closer values as compared to huge values in the initial dataset

Split the data into train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(knn_data_scaled,knn_data['TARGET CLASS'],test_size=0.3,random_state=101)

Apply KNN on the data

from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors=1)
KNN.fit(X_train,y_train)
KNeighborsClassifier(n_neighbors=1)

Predictions and Evaluations

predictions = KNN.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,predictions))
[[109  43]
 [ 41 107]]
print(classification_report(y_test,predictions))
              precision    recall  f1-score   support

           0       0.73      0.72      0.72       152
           1       0.71      0.72      0.72       148

    accuracy                           0.72       300
   macro avg       0.72      0.72      0.72       300
weighted avg       0.72      0.72      0.72       300

Choosing a K value for better accuracy

We will use the elbow method to pick a better K value

error = []
for i in range (1,40):
    KNN = KNeighborsClassifier(n_neighbors=i)
    KNN.fit(X_train,y_train)
    pred_i = KNN.predict(X_test)
    error.append(np.mean(y_test != pred_i))
#view the error list
print(error)
[0.28, 0.29, 0.21666666666666667, 0.22, 0.20666666666666667, 0.21, 0.18333333333333332, 0.19, 0.19, 0.17666666666666667, 0.18333333333333332, 0.18333333333333332, 0.18333333333333332, 0.18, 0.18, 0.18, 0.17, 0.17333333333333334, 0.17666666666666667, 0.18333333333333332, 0.17666666666666667, 0.18333333333333332, 0.16666666666666666, 0.18, 0.16666666666666666, 0.17, 0.16666666666666666, 0.17333333333333334, 0.16666666666666666, 0.17333333333333334, 0.16, 0.16666666666666666, 0.17333333333333334, 0.17333333333333334, 0.17, 0.16666666666666666, 0.16, 0.16333333333333333, 0.16]
#plot the error vs K values to better view the result
plt.figure(figsize=(16,7))
plt.plot(range(1,40),error,linestyle='--',color='blue',marker='o',markerfacecolor='red',markersize=10)
plt.xlabel('Values of K')
plt.ylabel('Mean Error')
plt.title('Error rate vs K Value')

Error vs Mean error

Retraining the data with the better K Value as per the plot above

Choosing K value as 31

KNN = KNeighborsClassifier(n_neighbors=31)
KNN.fit(X_train,y_train)
predictions = KNN.predict(X_test)
print(confusion_matrix(y_test,predictions))
[[123  29]
 [ 19 129]]
print(classification_report(y_test,predictions))
              precision    recall  f1-score   support

           0       0.87      0.81      0.84       152
           1       0.82      0.87      0.84       148

    accuracy                           0.84       300
   macro avg       0.84      0.84      0.84       300
weighted avg       0.84      0.84      0.84       300

Result: We were able to train our model to classify the data into respective TARGET CLASS with an accuracy of 84% (K = 31)