Linear Regression

Machine learning is a subfield of artificial intelligence that focuses on developing algorithms and models that enable computers to learn and make predictions or decisions based on data. Linear regression is a simple but effective technique used to predict a continuous dependent variable from one or more independent variables.

Focusing on known datasets available in seaborn or sklearn (scikit-learn), which we can freely navigate, for example, based on a unified version of data processing I descirbed in my post: unification-of-data-training-methods we will select a sample data set to perform linear regression. Data preprocessing is one of the basic steps in machine learning, which involves cleaning, transforming, and normalizing data. Includes handling missing values, encoding categorical variables, and scaling numeric features.

Let’s start by downloading the dataset, loading its basic information and selecting the columns that interest us in this exercise.

import seaborn as sns
import pandas as pd
import numpy as np
import math

# hidden libraries 
# that will be 
# added                               
# later

# Load the "taxis" dataset
taxis = sns.load_dataset('taxis')

# Display basic information of dataset
print(taxis.info())

# Check for missing values
print(f"\nColumns:\n"
      f"{list(taxis.columns)}")
print(f"Check for missing values:\n"
      f"{list(taxis.isna().sum())}")

# Display first and last rows as part view
print(f"\nDataset part view:\n"
      f"{pd.concat((taxis.head(2), 
                    taxis.tail(2)))}")

# Creating dataframe for further activities
df = taxis.iloc[:, :14]
df = df[['pickup','passengers',
            'distance','fare',
            'tip','payment']]

The columns I chose from taxis dataset are: departure time, number of passengers, distance covered, fare collected, tip given and payment method.

Based on three features: distance, fare and tip, I want to create another column with the general price for a taxi ride. I must immediately take into account the fact, that although there are no missing values here, there might be zeros that will be indivisible in my expression.

# Adding new column
df['price'] = np.where(df['distance'] == 0, 0, 
                       round((df['fare'] + df['tip']) / 
                             df['distance'], 2))

I leave the N/A values out just in case and check the result for the first few rows

df = df.replace([np.inf, -np.inf], np.nan).dropna()

print(f"\nCreated dataset:\n"
      f"{df.head(6)}")

Regression models establish the relationship between independent variables (features) and dependent variables (labels) to make predictions. Features are input variables that are used to make predictions in regression models. Labels are the values we want to predict with the regression model. To create a regression model, we need a dataset that contains both features and labels. The dataset is divided into two subsets: training set and testing set.

forecast_col = 'price'
forecast_out = int(math.ceil(0.01*len(df)))
print(forecast_out)

df['label'] = (df[forecast_col].shift(-forecast_out))
df = df.fillna(0)
print(df[:4])

Time for regression. Let’s add the missing libraries and methods

from sklearn import preprocessing, svm, neighbors
from sklearn.model_selection import (cross_validate,
                                     train_test_split)
from sklearn.linear_model import LinearRegression

So the next lines of the code will be as follow:

# Regression
# input values
X = np.array(df.drop(columns=['pickup', 'payment', 'price'],
                     axis=1))
print(f"\nX for: taxis columns:\n'passengers',"
      f"'distance','fare','tip', first 6 rows\n"
      f"{X[:6]}")
# label
y = np.array(df['label'])

Next we can scale our functions using the preprocessing.scale function. Data scaling is typically done to ensure that features fall within a certain range, often between -1 and 1. This can improve accuracy and processing speed.

X = preprocessing.scale(X)
print(f"\npreprocessing.scale(X):\n"
      f"{X[:2]}")

output: preprocessing.scale(X):
[[-0.45002667 -0.37292328 -0.52862091 0.06411003 0.09961266]
[-0.45002667 -0.58434722 -0.70197719 -0.81299311 0.6963327 ]]

Now we split our data into a training set and a testing set.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

And we reach the final element of our programming, which is the establishment of a classifier, which in this case is the built-in linear regression function and fitting to the training data.

# Create an instance of the linear regression model
clf = LinearRegression()
# Fit the model to the training data
clf.fit(X_train, y_train)
print("\nCoefficients:\n", clf.coef_)

Output: Coefficients:
[-2.77313103e-15 -4.88498131e-15 7.35522754e-15 4.44089210e-16
8.98243670e+00]

Heart4DataScience

Exploring data with passion: analysis, deep insights and beyond

Linear Regression