Clustering K_Mean K_Shift

Clustering is a technique used in unsupervised machine learning in which a machine is given a dataset and tasked with finding groups, or clusters, within the data. There are two main forms of clustering: flat clustering and hierarchical clustering. In flat clustering, we specify the number of clusters that the machine should find. Hierarchical clustering allows the machine to determine the number of clusters and the subgroups within them.

For the exercises, we will use a dataset that I developed based on publicly available reports and data on forests and forest wildlife.

Animal_data_2024_large

Animal_data_2024_large Download

import pandas as pd
import numpy as np
import math

# Load the dataset
animals = pd.read_csv(
    'Animal_data_2024_large.csv',
    sep=';')

print(animals.info())


for col in animals.select_dtypes(include=['float64']).columns:
    animals[col] = animals[col].fillna(0).astype('int64')

K-means is a simple and widely used clustering algorithm whose goal is to partition a dataset into k distinct clusters.The algorithm begins by randomly initializing k cluster centroids and then iteratively assigning each data point to the nearest centroid. It subsequently recalculates the centroids based on the mean of the data points assigned to each cluster. This process continues until convergence is reached, meaning that the centroids no longer change significantly, or until the maximum number of iterations has been reached.

To apply the K-means algorithm we can use scikit-learn, that provides a straightforward implementation of the K-means algorithm, making it easy to apply in real-world examples.

Libriaries we will use:

from sklearn.model_selection import cross_validate,train_test_split
from sklearn import preprocessing, svm, neighbors
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

from sklearn.cluster import KMeans
from sklearn.cluster import MeanShift
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

Data Preparation and Standardization

X = animals.drop(['Hunting_Dist_ID','Hunting_Dist_name'], axis=1)

# Conversion
X = X.values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)

K-Means with k number of clusters and pre-visualization


# K-MEANS

inertia = []
K_range = range(1, 10)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.plot(K_range, inertia, marker='o')
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.show()

K-Means with a defined number of clusters of 3

# Final K-Means

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

animals["cluster_kmeans"] = clusters
print(animals["cluster_kmeans"].value_counts())

output: cluster_kmeans
0 1956
1 629
2 13
Name: count, dtype: int64

Visualization (2D)

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

plt.scatter(X_2d[:,0], X_2d[:,1], c=clusters)
plt.title("K-Means clusters")
plt.show()

Heart4DataScience

Exploring data with passion: analysis, deep insights and beyond

Clustering K_Mean K_Shift