Support Vector Machine

Support vector machines (SVM) are a supervised machine learning algorithm useful for complex data sets that have nonlinear decision boundaries. SVM is a binary classifier, meaning it divides the data into two groups and finds the best possible hyperplane that maximally separates the data points of different classes. The hyperplane is defined by a subset of the data points, called support vectors, that lie closest to the decision boundary. Importantly, SVM is resistant to overfitting, and is particularly effective for small data sets and high-dimensional problems.

Python provides libraries such as scikit-learn that make implementing support vector machines easy. Scikit-learn has an SVM module, allowing users to easily train and evaluate models.

Our CSV data file, Animal_data_2024_large, seems to be a quite good example of how to practice using SVM in practice. You can download it once again by clicking on the below:

import pandas as pd
import numpy as np
import math
from sklearn.model_selection import cross_validate,train_test_split
from sklearn import preprocessing, svm, neighbors
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the dataset
animals = pd.read_csv(
    'Animal_data_2024_large.csv',
    sep=';')

Similarly to the lesson on KNN, we will transform the data, replacing missing data in individual records and adapting the data type to one category – float64.

for col in animals.select_dtypes(include=['float64']).columns:
    animals[col] = animals[col].fillna(0).astype('int64')

# Division into features (X) and labels (y)
x = np.array(animals.drop('SikaDeer', axis=1))
y = np.array(animals['SikaDeer'])

print(f"x: {x[:10]}"
      f"\ny: {y[:10]}")

output:
x: [[ 1 370 23874 2755 72650 1042 5996]
[ 2 259 13132 3141 51731 209 2687]
[ 3 318 13589 621 63613 0 2314]
[ 4 206 15730 1849 43681 0 2581]
[ 5 299 8057 1235 50945 0 2280]
[ 6 242 10054 407 41010 39 3060]
[ 7 572 12152 1057 78336 0 4255]
[ 8 147 11060 2214 44853 39 2018]
[ 9 201 15874 967 41711 0 1700]
[ 10 306 12546 133 29067 0 1646]]
y: [0 0 0 0 1 0 0 0 0 0]

We will split the data into training and test data and invoke the KNN classifier again.

# Training and test sets
X_train, X_test, y_train, y_test = (
    train_test_split(x, y, test_size=0.2))

# KNN 
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
knn_accuracy = clf.score(X_test, y_test)

Now SVM. We will begin with C-Support Vector Classification.

# SVM.SVC
clf = svm.SVC()
clf.fit(X_train, y_train)
svm_accuracy = clf.score(X_test, y_test)

clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
svm_accuracy_lin = clf.score(X_test, y_test)

print(f"knn_accuracy: {knn_accuracy}"
      f"\nsvm_accuracy: {svm_accuracy}"
      f"\nsvm_accuracy_linear: {svm_accuracy_lin}")

In the first example we used the built-in SVC function without declaring any components. In the second example we define the kernel to be linear.

first run:
knn_accuracy: 0.8
svm_accuracy: 0.8
svm_accuracy_linear: 0.8

second run:
knn_accuracy: 0.8
svm_accuracy: 0.8
svm_accuracy_linear: 0.7

third run:
knn_accuracy: 0.9
svm_accuracy: 0.9
svm_accuracy_linear: 0.6

Now we will try to normalize the data using StandardScaler from sklearn.preprocessing. Standardization makes the data scaled the same way every time, which means that the SVM model always gets the same inputs. We will also specify the function parameters by choosing the RBF kernel, C at level 1.0 and the gamma type: scale.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(x)

X_train, X_test, y_train, y_test = (
    train_test_split(X_scaled,
                     y, test_size=0.2,
                     random_state=42))

clf = svm.SVC(kernel='rbf', C=1.0, gamma='scale')
clf.fit(X_train, y_train)

# Prediction
y_pred = clf.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.9

clf = svm.SVC(kernel='linear', C=2.0, gamma='auto')

Accuracy: 0.7