R Squared & Testing assumptions

R Squared

R² is a statistical measure that represents the proportion of the variance in a dependent variable that can be explained by the independent variables in a regression model. The range is from 0 to 1, where 1 indicates a perfect match and 0 indicates no relationship between the variables.
In Python, we can use the scikit-learn library to calculate R squared by importing the following module:

from sklearn.metrics import r2_score

Using data from fit-slope.

X = np.array([1.0, 1.6, 7.0, 2.15, 7.96,
              1.0, 0.79, 5.0, 0.0, 13.32],
             dtype=np.float64)
Y = np.array([3.0, 2.16, 9.0, 1.1, 16.36,
              1.0, 0.49, 7.5, 2.16, 4.99],
             dtype=np.float64)

x_true = list(X[:5])
x_pred = list(X[5:])
print(x_true) #output: [1.0, 1.6, 7.0, 2.15, 7.96]
print(x_pred) #output: [1.0, 0.79, 5.0, 0.0, 13.32]
R_Squared = r2_score(x_true,x_pred)
print(R_Squared) #output: 0.11293785743225693

With Polynomial Features. Using data from linear-regression. Import the following module:

from sklearn.preprocessing import PolynomialFeatures
# Load the "taxis" dataset
taxis = sns.load_dataset('taxis')

# Creating dataframe for further activities
df = taxis.iloc[:, :14]
df = df[['passengers',
         'distance', 'fare']]

# Input values
X = np.array(df.drop(columns=['passengers','fare'],
                     axis=1))

y = np.array(df['fare'])

# Polynomial Features
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)

# Create an instance of the linear regression model
clf = LinearRegression()
clf.fit(X_poly,y)
y_pred = clf.predict(X_poly)

# R² calculation
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
#R² (Scikit-Learn Calculation): 0.8466800744761954

# Visualization
plt.scatter(X, y, color='green', label='Current')
plt.plot(X, y_pred, color='yellow', linewidth=2, label='Predicted')
plt.title('Real vs Predicted Polynomial Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

Testing assumptions

Defining the problem and collecting the necessary data is the initial stage of programming a machine learning model. This includes understanding the problem domain, identifying variables of interest, and collecting or generating an appropriate data set.

The next step is to pre-process the data, including handling missing values, scaling numerical features, and dividing the data set into training and test sets. Another important element of this process is the selection of the appropriate machine learning algorithm for a given task.

Once we have selected an algorithm, we can start training the model using the training data set. Training is needed to capture the underlying patterns and relationships in the data, enabling the model to make accurate predictions or decisions.

Testing assumptions is a critical aspect of programming machine learning models. Assumptions are made during the modeling process to simplify a problem or make certain predictions feasible. However, it is important to check whether these assumptions are true in a given context.

import seaborn as sns
import pandas as pd
import numpy as np
import math
from statistics import mean

from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures

from sklearn import preprocessing, svm, neighbors
from sklearn.model_selection import (cross_validate,
                                     train_test_split)
from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt
from matplotlib import style

style.use('dark_background')

# Load the "taxis" dataset
taxis = sns.load_dataset('taxis')

# Creating dataframe for further activities
df = taxis.iloc[:, :14]
df = df[['passengers',
         'distance', 'fare']]
df = df[:100]
print(len(df))

# Input values
X = np.array(df.drop(columns=['passengers',
                              'fare'],
                     axis=1))

df['fare_category'] = pd.qcut(df['fare'],
                              q=3, labels=[0, 1, 2])
y = np.array(df['fare_category'])

# Preprocessing scale for X
X = preprocessing.scale(X)
print(f"\npreprocessing.scale(X):\n"
      f"{X[:4]}")

# Test and training sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Polynomial Features
poly_features = PolynomialFeatures(degree=2,
                                   include_bias=False)
X_poly = poly_features.fit_transform(X)

# Create an instance of the linear regression model
clf = LinearRegression()
clf.fit(X_poly, y)
y_pred = clf.predict(X_poly)

# R² calculation
R2_sklearn = r2_score(y, y_pred)
print(f"\n R² Scikit-Learn Calculation: "
      f"{R2_sklearn}")

# Example measures
example_values = np.array([
    [-0.39234567,0.39234567 ],
    [0.22123456,-0.22123456],
    [-0.1098236, 1.00120304],
    [0.15744567,-0.22149996],
    [0.11121234,0.5],
    [0.22123456,-0.22123456]
])
print(f"\n example values: "
      f"\n {example_values[:3]}")
# Prediction
prediction = clf.predict(example_values)
print(f"\n prediction of example values: "
      f"\n {prediction}")

results: