Nearest neighbors & distances

Classification is a fundamental task in machine learning includes categorizing data into predefined classes or categories based on its characteristics. It applies in various domains such as elementary image recognition, defining spam, recommendations based on product browsing history or as a support for medical diagnosis.

The K-Nearest-Neighbors (KNN) is a non-parametric supervised learning classifier that uses proximity to create classifications or predictions about the grouping of individual data points. It assumes, that the observations closest to a given data point are the most common observations in the data set and we can classify unpredicted points based on the values of the closest existing points.

Let’s take an example of classifying game animals based on their species and population. For this purpose I will use a list containing the number of occurrences of a defined animal species in a selected region.

Region_ID;HuntingDistricts;Deer;SikaDeer;\
FallowDeer;RoeDeer;Mouflons;WildBoars
1;370;23874;;2755;72650;1042;5996
2;259;13132;;3141;51731;209;2687
3;318;13589;;621;63613;;2314
4;206;15730;;1849;43681;;2581

Animal_data_2024_large Download

import pandas as pd
import numpy as np
import math
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

from matplotlib import style
style.use('Solarize_Light2')

# Load the dataset
animals = pd.read_csv(
    'Animal_data_2024_large.csv',
    sep=';')
print(animals[:2])

for col in animals.select_dtypes(include=['float64']).columns:
    animals[col] = animals[col].fillna(0).astype('int64')
print(animals[:2])
print(type(animals))
print(len(animals))

Scikit-learn Pyhon Library offer 2 type of nearest neighbors classifiers: KNeighborsClassifier and RadiusNeighborsClassifier.

KNeighborsClassifier implements learning based on the nearest neighbors of each query point, where is an integer value specified by the user and is the most commonly used technique. RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius of each training point, where is a floating-point value specified by the user.

KNeighborsClassifier(n_neighbors=1)

# average number of Deer occurrences per m2
x = [round(val / 1000)
     for val in animals["Deer"].tolist()[:10]]
print(f"x: {x}")
# average number of Roe Deer occurrences per m2
y = [math.floor(val / 1000)
     for val in animals["RoeDeer"].tolist()[:10]]
print(f"y: {y}")
cls = animals["SikaDeer"].tolist()[26:36]
print(f"z: {cls}")

print(f"len x: {len(x)},"
      f"len y: {len(y)}, "
      f"len z: {len(cls)}")

data = list(zip(x, y))
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(data, cls)

predictions and visualisation:

x_predict = 4
y_predict = 7
new_point = [(x_predict, y_predict)]

prediction = knn.predict(new_point)
print(prediction)

plt.scatter(x + [x_predict],
            y + [y_predict],
            c=cls + [prediction[0]])
plt.text(x=x_predict+0.3, y=y_predict+0.3,
         s=f"new point, cls: {prediction[0]}",
         color = "hotpink" )
plt.show()

outputs:

x: [24, 13, 14, 16, 8, 10, 12, 11, 16, 13]
y: [72, 51, 63, 43, 50, 41, 78, 44, 41, 29]
z: [1, 0, 0, 0, 0, 0, 0, 12, 0, 8]
len x: 10,len y: 10, len z: 10
[8]

graph:

neighbors.KNeighborsClassifier()

from sklearn.model_selection import cross_validate,train_test_split
from sklearn import datasets, preprocessing, svm, neighbors

x = np.array(animals.drop('SikaDeer',axis=1))
y = np.array(animals['SikaDeer'])

X_train, X_test, y_train, y_test = (
    train_test_split(x,y,test_size=0.2))

clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)
print(accuracy)

(first) output: 0.9

RadiusNeighborsClassifier

RadiusNeighborsClassifier is an interesting alternative to KNN because instead of a fixed number of neighbors, it takes into account everyone within a given radius. Can be useful when data density is not uniform. I changed the input data a bit. I limited the length of the list from 10 elements to 4. I also used reshape(-1,1) for the ‘x’ values. For ‘y’ values, I included twice the area in the denominator.

from sklearn.neighbors import RadiusNeighborsClassifier

x = [round((val / 1000) + 6)
     for val in animals["Deer"].tolist()[:4]]
x = np.array(x).reshape(-1, 1).tolist()
print(f"x: {x}")
# average number of Roe Deer occurrences per m2
y = [math.floor(val / 2000)
     for val in animals["RoeDeer"].tolist()[:4]]
print(f"y: {y}")

X_train, X_test, y_train, y_test = (
    train_test_split(x,y,test_size=0.2))

neigh = RadiusNeighborsClassifier(radius=25.0)
neigh.fit(x, y)
print(neigh.predict([[1.5]]))

output: [21]

neigh = RadiusNeighborsClassifier(radius=20.0)

output: [25]

To calculate, which data points are closest to a given query point, we need to define the distance between the query point and the other data points.

Euclidean distance

Euclidean distance is a measure of the straight-line distance between two points in high-dimensional space. The Euclidean distance between two points,in two-dimensional space (x, y) can be calculated using the following formula:

from collections import Counter

x =  animals["FallowDeer"].tolist()[:4]
print(f"x: {x}")

y = animals["WildBoars"].tolist()[:4]
print(f"y: {y}")

euclidean_distance = round(sqrt((x[0]-y[0])**2 +
                          (x[1]-y[1])**2 +
                          (x[2]-y[2])**2 +
                          (x[3]-y[3])**2 ),4)

print(f"euclidean_distance: "
      f"{euclidean_distance}")

outputs:
x: [2755, 3141, 621, 1849]
y: [5996, 2687, 2314, 2581]
euclidean_distance: 3756.6301

Manhattan distance

Manhattan Distance. The sum of the distances by coordinates. This is also another popular distance measure that measures the absolute value between two points. It is commonly visualized using a grid, illustrating how, for example, you can move along city streets from point A to point B. Can be calculated using the following formula:

from sklearn.metrics.pairwise import manhattan_distances

x = [round(val / 1000)
     for val in animals["Deer"].tolist()[:4]]
x = np.array(x).reshape(-1, 1)

y = [math.floor(val / 1000)
     for val in animals["RoeDeer"].tolist()[:4]]
y = np.array(y).reshape(-1, 1)

distances = manhattan_distances(x, y)

print(distances)

output:
[[48. 27. 39. 19.]
[59. 38. 50. 30.]
[58. 37. 49. 29.]
[56. 35. 47. 27.]]

Heart4DataScience

Exploring data with passion: analysis, deep insights and beyond

Nearest neighbors & distances