Linear Regression

This page is under development.

A regression model is used if a relation exists between a dependent conitnuous variable y and one or more independent variables x. The dependent variable, y, is the target that we try to predict for unknowm values of x. The independent variables, x, are the explanatory variables. Regression is the process of fitting a model to explain or predict a continuous value. When only one independent variable is used to explain a dependent variable, y, the regression is simple. When more than one independent variable are used to explain a dependent variable, y, the regression is multiple.

A Linear regression model is used if a linear relation exists between the dependent conitnuous variable y and one or more independent explanatory variables x. We can define two types of linear regression: simple and multiple.

The model of a simple linear regression is an equation of the form of: \(y =\theta_0+ \theta_1 x_1\). As an example, the equation y = 3x + 5 is a simple linear model since the relation between x and y is linear: y = f(x) is plotted as a line.
The model of a multiple linear regression is an equation of the form: \(y =\theta_0+ \theta_1 x_1+ \theta_2 x_2\) … \(+\theta_n x_n\). This means, it can be shown as a dot product of 2 vectors: the parameters vector (\(\theta_0\), \(\theta_1\), \(\theta_2\), …, \(\theta_n\)) and the feature set vector (1, \(x_1\), \(x_2\), …, \(x_n\)).

Let’s import the required libraries

import numpy as np
import matplotlib.pyplot as plt

Let’s generate a dataset using the following python code:

x = np.arange(-10.0, 10.0, 0.2)
y = 3*x + 5
y_noise = 3 * np.random.normal(size=x.size)
ydata = y + y_noise

Let’s start from working with this two dimensional dataset. Let’s print the first five datapoint and plot all datapoints(x, y):

for i in range(5):
  print("({:.2f}, {:.2f})".format(x[i],y[i]))

(-10.00, -25.00)
(-9.80, -24.40)
(-9.60, -23.80)
(-9.40, -23.20)
(-9.20, -22.60)

plt.plot(x, ydata,  'yo')
plt.ylabel('Dependent Variable')
plt.xlabel('Independent Variable')
plt.show()

Since we can see that our dataset is showing a linear trend, we can fit a linear regression model to these data. The blue line is a representation of a linear model which fits well in this case. We will see further a python code used to obtain optimized values for the model parameters (in this case, the slope and intersect of the blue line).

plt.plot(x, ydata,  'yo')
plt.plot(x, y,  'b')
plt.ylabel('Dependent Variable')
plt.xlabel('Independent Variable')
plt.show()

Case Study

Let’s condider now a real dataset that is related to fuel consumption of cars.

https://archive.ics.uci.edu/ml/datasets/auto+mpg

Type of variable (continuous or discrete) and attribute information:

mpg: continuous; miles per gallon: distance travelled per unit volume of fuel used.
cylinders: multi-valued discrete
displacement: continuous
horsepower: continuous
weight: continuous
acceleration: continuous
model year: multi-valued discrete (the last two digits of the 4-digit year (e.g,.70 stands for 1970)
origin: multi-valued discrete. From 1 to 3. We assumed 1 to be American-origin vehicle, 2 is European-origin, and 3 is Asia/elsewhere
car name: string (unique for each instance)

Importing Required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Importing the dataset

cars = pd.read_csv("data/cars.data", delim_whitespace=True, names=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'carName'])

print(cars.shape)

(398, 9)

print(cars.head(3))

    mpg  cylinders            ...              origin                    carName
0  18.0          8            ...                   1  chevrolet chevelle malibu
1  15.0          8            ...                   1          buick skylark 320
2  18.0          8            ...                   1         plymouth satellite

[3 rows x 9 columns]

Opening the file with a text editor shows that the null values in this dataset are actually marked as ‘?’.

Let’s replace all ‘?’ with the np.NAN, numpy constant.

Then, using the dropna function, we can delete any row which contains a NAN value.

cars = cars.replace('?', np.NAN)
cars = cars.dropna(axis=0, how='any')

Data Exploration

Let’s first have a descriptive exploration on our data using the describe function.

print(cars.describe())

              mpg   cylinders     ...            year      origin
count  392.000000  392.000000     ...      392.000000  392.000000
mean    23.445918    5.471939     ...       75.979592    1.576531
std      7.805007    1.705783     ...        3.683737    0.805518
min      9.000000    3.000000     ...       70.000000    1.000000
25%     17.000000    4.000000     ...       73.000000    1.000000
50%     22.750000    4.000000     ...       76.000000    1.000000
75%     29.000000    8.000000     ...       79.000000    2.000000
max     46.600000    8.000000     ...       82.000000    3.000000

[8 rows x 7 columns]

Let’s assume that we want to explain the mpg variable (travelled per unit volume of fuel used) using some of the other variables, called features or explanatory variables.

We can plot the histogram of all continuous variables from our data set to get an idea of their distributions.


cars.hist(column = ["horsepower", "displacement", "weight", "acceleration", "mpg"], figsize=(10,10))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f335ff46160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f335ff737f0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f335ff279b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f335fedbc18>]],
      dtype=object)

//usr/lib/python3/dist-packages/pandas/plotting/_tools.py:308: MatplotlibDeprecationWarning: 
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().rowspan.start instead.
  layout[ax.rowNum, ax.colNum] = ax.get_visible()
//usr/lib/python3/dist-packages/pandas/plotting/_tools.py:308: MatplotlibDeprecationWarning: 
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().colspan.start instead.
  layout[ax.rowNum, ax.colNum] = ax.get_visible()
//usr/lib/python3/dist-packages/pandas/plotting/_tools.py:314: MatplotlibDeprecationWarning: 
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().rowspan.start instead.
  if not layout[ax.rowNum + 1, ax.colNum]:
//usr/lib/python3/dist-packages/pandas/plotting/_tools.py:314: MatplotlibDeprecationWarning: 
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().colspan.start instead.
  if not layout[ax.rowNum + 1, ax.colNum]:

plt.show()

Let’s plot the quantitative features vs the distance travelled per unit volume of fuel:

plt.scatter(cars.cylinders, cars.mpg,  color='yellow')
plt.xlabel('cylinders')
plt.ylabel("Distance travelled per unit volume of fuel used")
plt.show()

plt.scatter(cars.displacement, cars.mpg,  color='yellow')
plt.xlabel('displacement')
plt.ylabel("Distance travelled per unit volume of fuel used")
plt.show()

plt.scatter(cars.horsepower, cars.mpg,  color='yellow')
plt.xlabel('horsepower')
plt.ylabel("Distance travelled per unit volume of fuel used")
plt.show()

plt.scatter(cars.weight, cars.mpg,  color='yellow')
plt.xlabel('weight')
plt.ylabel("Distance travelled per unit volume of fuel used")
plt.show()

plt.scatter(cars.acceleration, cars.mpg,  color='yellow')
plt.xlabel('acceleration')
plt.ylabel("Distance travelled per unit volume of fuel used")
plt.show()

plt.scatter(cars.year, cars.mpg,  color='yellow')
plt.xlabel('year')
plt.ylabel("Distance travelled per unit volume of fuel used")
plt.show()

We can see that there is a linear trend on some of these plots: * horsepower vs mpg * weight vs mpg

Let’s choose to train a simple linear regression model based on the weight feature to explain the mpg variable (e.g., distance travelled per unit volume of fuel used).

Creating the training and testing sets:

We can select 80% of our data as the training data and the left 20% as the testing data as follows:

rand = np.random.rand(len(cars)) < 0.8
train = cars[rand]
test = cars[~rand]

Linear Regression using Python

Let’s use a simple linear regression model on our training set. We can use the LinearRegression function from the scikit learn library. The LinearRegression function calculates the model parameters, \(theta_0\) and \(theta_1\) while minimizing the ‘residual sum of squares’ between actual values and predicted values. TODO: to be developped.

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train[["weight"]], train.mpg)

LinearRegression()

print ('Theta1: ', lr.coef_)

Theta1:  [-0.00796912]

print ('Theta0: ', lr.intercept_)

Theta0:  47.19719482575017

The model, in this case, is a simple line. The regression coefficient, \(Theta_1\), and the intercept, \(Theta_0\), in the simple linear regression, are the parameters of the fit line (the slope and intersect): \(y =\theta_0+ \theta_1 x_1\).

We can plot the fit line over the data:

x_train = train[["weight"]]
y_train = train[["mpg"]]
plt.scatter(x_train, y_train,  color='blue')
y_predicted = lr.coef_[0] * x_train + lr.intercept_

# Same as
# y_predicted = lr.predict(x_train)

plt.plot(x_train, y_predicted,'-r')
plt.xlabel("car weight")
plt.ylabel("Distance travelled per unit volume of fuel used")
plt.show()

Model Evaluation

We can compare the actual values and predicted values to calculate the accuracy of a regression model using evaluation metrics such as R-squared, Mean absolute error, Mean Squared Error, Root Mean Square Error, etc.

Let’s use some evaluarion metrics to test the model

from sklearn.metrics import r2_score

test_x = test.weight
test_y = test.mpg
test_y_ = lr.predict(test[["weight"]])


print("Testing set: ")

Testing set:

print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))

Mean absolute error: 2.70

print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
# The higher the R-squared, the better the model fits your data. Best possible score is 1.0.

Residual sum of squares (MSE): 12.31

print("R2-score: %.2f" % r2_score(test_y_ , test_y) )

R2-score: 0.78

Using only one feature, here the car’s weight to predict a car’s fuel efficiency based on the linear regression algorithm results in a model which underfits the data.

R2-score indicates a poor accuracy of our model to fit unknown data (testing set). The model is poorly performing since a car’s fuel efficiency is affected by many other factors besides just its weight.

We can add some more features to fit the model and see if it better fits the data.

We could compare the in-sample error (the error on the training set) and the out-of-sample error (the error on the testing set). So far, we have used the out-of-sample error by testing the model over the testing data set.

Let’s use the cross validation technique

Let’s create a function named train_and_cross_val that: * takes in the list of features * trains a linear regression model on these features, * uses the KFold class to perform 8 folds (random_state is set to 3: this seed is used to reproduce the result), * over each fold combination (8 possibilities), we calculates the mean squared error and the variance using the model predictions. * returns the average mean squared error value and the averange variance.

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np
def train_and_cross_val(cols):
    # Takes in a list of column names
    # Returns the average MSE and average variance
    X = cars[cols]
    y = cars["mpg"]
    var_test_values = []
   
    mse_test_values = []

    kf = KFold(n_splits=10, shuffle=True, random_state=3) 
    
    for train_index, test_index in kf.split(X): 
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        lr = LinearRegression()
        lr.fit(X_train, y_train)

        pred_test = lr.predict(X_test)
        
        mse_test = mean_squared_error(y_test, pred_test)

        var_test = np.var(pred_test)

        
        var_test_values.append(var_test)
      
        mse_test_values.append(mse_test)

        
    avg_test_mse = np.mean(mse_test_values)

    avg_test_var = np.mean(var_test_values)

    
    return(avg_test_mse, avg_test_var)

Let’s use the train_and_cross_val function to train linear regression models using the following columns as the features:

the cylinders and displacement columns.
the cylinders, displacement, and horsepower columns.
the cylinders, displacement, horsepower, and weight columns.
…

import matplotlib.pyplot as plt
        
cols = ["cylinders", "displacement", "horsepower", "weight", "acceleration","year"]

mse_test_list = []

var_test_list = []


for i in range(len(cols)):
    mse_test, var_test = train_and_cross_val(cols[:i+1])   
    mse_test_list.append(mse_test)  

    var_test_list.append(var_test)



plt.scatter([1,2,3,4,5,6], mse_test_list, c='r', label = 'Mean Square error')    
   
plt.scatter([1,2,3,4,5,6], var_test_list, c='y', label = 'Variance')

plt.xlim(1, 7)

(1.0, 7.0)

plt.legend()
plt.show()

We can see as expected that the more complex the model is (increased number of features), the higner is the variance and the lower is the MSE of our model.

The MSE incorporates both the variance of the prediction (measures the dispersion of the test predicted values) and its bias (how different the values of the estimator and the actual values of the parameters are).

An MSE of zero indicates that our model perfectly matches our data, but such a model is very unlikely to perfectly predict new unseen data. The best model is a balance between an overfitting model (low MSE but high variance) and an underfitting model (high MSE indicating the model may be too simple (e.g. not enough exaplanatory variables).

The overall variance increased around 30% as we increased the model complexity. An increased variance means that the model has more unpredictable performance on unseen data.