Creating training and test datasets

We usually split our data into two subsets which are mutually exclusive: a training data set and a test data set.

The training data set is used to train and optimize the model parameters while the test data set is only used to test the resulting model.

Since the test data set is not used to fit the model, the model has no knowledge of the outcome from test data. Hence, the test set can be used to provide an evaluation of the model based on evaluation metrics. This ensures the trained model can generalize well to new data and is not overfitting or underfitting data.

The limitations of a single train-test split: 1. The scores for evaluating the model are highly variable depending on the observations split into the train set and the test set. 2. Reserving much of the data for a single test set reduces the number of observations we can use to train the model.

To go over these limitations we can use the cross validation technique.

Cross Validation Technique

Cross-validation is a model validation technique that involves partitioning a sample of data into subsets.

The analysis is performed on one subset (called the training set), and validated on the other subset (called the validation set or testing set).

To reduce variability, multiple rounds of cross-validation are performed using different partitions. The validation results are combined.

As an example, we can partition a data set into 8 subsets. We can perform a series of train and evaluate cycles where each time we train on 7 of the folds and test on the 8th. We repeat this cycle 8 times, each time using a different fold for evaluation as follows:

source wikipedia

Overfitting and underfitting models

Source: https://www.datarobot.com/

Overfitting

An underfitting model has high training and high testing error while an overfitting model has a low training error but a high testing error.

Overfitting is the term used to indicate that the trained model is too specifically adapted to the training data. An overfitting model describes the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. An overfitting model has high variance and low bias. Variance refers to how much the model is dependent on the training data.

An overfitting model can map the training data very accurately, but it is unable to produce generalized results with untrained data such as data from the test set. The test data set is important to highlight such issues.

Overfitting can happen when the training set is relatively small and the resulting model is too complex: Too many features/independant variables are used to train the model compared to the number of observations.

In this case, the variance describes the error that occurs because of the variability of the model’s predicted values.

Underfitting

Underfitting as opposed to overfitting is the term used when the model is too simple and misses the trend of the training data. Neither an overfitting model nor an underfitting model can be generalized to new data.

Underfitting usually occurs when not enough features/independent variables are used to train the model or the regression algorithm is not complex enough to fit the data. As an example, it can happen when a linear regression model is trained to fit non linear data.

In an underfitting model, we talk about the bias to describe the error that occurs because of bad assumptions from the learning algorithm. An underfitting model has low variance and high bias.

As an example, assuming that only one feature, e.g., the car’s weight, relates to a car’s fuel efficiency fits an underfitting regression model. The error rate will be high since a car’s fuel efficiency is affected by many other factors besides just its weight.