Train/Test Split and Cross Validation – A Python Tutorial

26 min read

Get 10-day Free Algo Trading Course

Loading

Last Updated on October 13, 2020

What is a training and testing split? It is the splitting of a dataset into multiple parts. We train our model using one part and test its effectiveness on another.

In this article, our focus is on the proper methods for modelling a relationship between 2 assets.

We will check if bonds can be used as a leading indicator for the S&P500.

Table of contents

  1. What is data splitting in modelling?
  2. What is a Training set?
  3. What is a Validation set?
  4. What is a Test Set?
  5. Why do we need to split our data?
  6. How to Train our Model?
  7. How do we use the Validation Set?
  8. Hyper-parameter Tuning with the Validation Set
  9. How do we test our model on our Testing Set?
  10. What is Cross-Validation and why do we use it?
  11. Cross-Validation for Standard Data
  12. Cross-Validation for Time Series Data

What is data splitting in modelling?

Data splitting is the process of splitting data into 3 sets:

  • Data which we use to design our models (Training set)
  • Data which we use to refine our models (Validation set)
  • Data which we use to test our models (Testing set)

If we do not split our data, we might test our model with the same data that we use to train our model.

Example

If the model is a trading strategy specifically designed for Apple stock in 2008, and we test its effectiveness on Apple stock in 2008, of course it is going to do well.

We need to test it on 2009’s data. Thus, 2008 is our training set and 2009 is our testing set.

To recap what are training, validation and testing sets…

What is a Training Set?

The training set is the set of data we analyse (train on) to design the rules in the model.

A training set is also known as the in-sample data or training data.

What is a Validation Set?

The validation set is a set of data that we did not use when training our model that we use to assess how well these rules perform on new data.

It is also a set we use to tune parameters and input features for our model so that it gives us what we think is the best performance possible for new data.

What is a Test Set?

The test set is a set of data we did not use to train our model or use in the validation set to inform our choice of parameters/input features.

We will use it as a final test once we have decided on our final model, to get the best possible estimate of how successful our model will be when used on entirely new data.

A test set is also known as the out-of-sample data or test data.

Why do we need to split our data?

To prevent look-ahead bias, overfitting and underfitting.

  • Look-ahead bias: Building a model based on data that is not supposed to be known.
  • Overfitting: This is the process of designing a model that adapts so closely to historical data that it becomes ineffective in the future.
  • Underfitting: This is the process of designing a model that adapts so loosely to historical data that it becomes ineffective in the future.

Look-Ahead Bias

Let’s illustrate this with an example.

Here is Amazon’s stock performance from 2013 to 2020.

amazon price chart 2013 to 2020
Source: Tradingview.com

Wow, it is trending up rather smoothly. I’ll design a trading model that invests in Amazon as it trends up.

I then test my trading model on this same dataset (2013 to 2020).

To my non-surprise, the model performs brilliantly and I make a lot of hypothetical monies. You don’t say!

When the trading model is being tested from 2013, it knows what Amazon’s 2014 stock behavior will be because we took into account 2014’s data when designing the trading model.

The model is said to have “looked ahead” into the future.

Thus, there is look-ahead bias in our model. We built a model based on data we were not supposed to know.

Overfitting

In the simplest sense, when training, a model attempts to learn how to map input features (the available data) to the target (what we want to predict).

Overfitting is the term used to describe when a model has learnt this relationship “too well” for the training data.

By “too well” we mean rather that it has learnt the relationship too closely- that it sees more trends/correlations/connections than really exist.

We can think of this as a model picking up on too much of the “noise” in the training data- learning to map exact and very specific characteristics of the training data to the target when in reality these were one-off occurrences/connections and not representative of the broader patterns generally present in the data.

As such, the model performs very well for the training data, but flounders comparatively with new data. The patterns developed from the training data do not generalise well to new unseen data.

This is almost always a consequence of making a model too complex- allowing it to have too many rules and/or features relative to the “real” amount of patterns that exist in the data. Its possibly also a consequence of having too many features for the number of observations (training data) we have to train with.

For example in the extreme, imagine we had 1000 pieces of training data and a model that had 1000 “rules”. It could essentially learn to construct rules that stated:

  • rule 1: map all data with features extremely close to x1,y1,z1 (which happen to be the exact features of training data 1) to the target value of training data 1.
  • rule 2: map all data with features extremely close to x2,y2,z2 to the target value of training data 2.
  • rule 1000: map all data with features extremely close to x1000,y1000,z1000 to the target value of training data 1000.

Such a model would perform excellently on the training data, but would probably be nearly useless on any new data that deviated even slightly from the examples that it trained on.

You can read more about overfitting here: What is Overfitting in Trading?

Underfitting

By contrast, underfitting is when a model is too non-specific. I.e., it hasn’t really learnt any meaningful relationships between the training data and the target variable.

Such a model would perform well neither on the training data nor any new data.

This is a rather rarer occurrence in practice than overfitting, and usually occurs because a model is too simple- for example imagine fitting a linear regression model to non-linear data, or perhaps a random forest model with a max depth of 2 to data with many features present.

In general you want to develop a model that captures as many patterns in the training data that exist as possible that still generalise well (are applicable) to new unseen data.

In other words, we want a model that is neither overfitted or underfitted, but just right.

How to Train our Model

To see how these concepts play out in reality, lets try building an actual model.

Our Model: To check if yesterday’s 2-10 Bond Spread can predict today’s SPX prices.

We will use some data called:

  • “2-10 Bond Spread” which is the spread between the 10-Year U.S. Treasury Constant Maturity rate and the 2-Year Treasury Constant Maturity Rate) and
  • “SPX, SPCFD: Compare” which is a market-capitalization weighted index of the 500 largest U.S. publicly traded companies.

Visually our data looks like this:

blue bond spread, red SPX prices

Lets go ahead and load up some example market data of 2-10y US bond spread against SPX daily close:

import numpy as np
import pandas as pd
from pathlib import Path

df = pd.read_csv(Path("QUANDL_FRED_T10Y2Y, 1D 80PERCENT.csv"))

Note that before doing any modelling we will lag the 2-10 US bond spread by 1 day as we want to regress SPX(t) on 2-10(t-1), because as we said we want to check if yesterday’s 2-10 value has any effect on today’s SPX value. We should also use the returns (proportional price change from the last day) rather than the actual price today.

Shift the bond spread so that yesterday’s bond spread is regressed against today’s price:

df['2-10 Bond Spread'] = df['2-10 Bond Spread'].shift(1) 

Regress against (percentage) change from yesterday’s price instead of today’s absolute price:

df['returns']=df['SPX, SPCFD: Compare']/df['SPX, SPCFD: Compare'].shift(1) - 1
df

And now clean up a bit by removing the first row which now has an N/A in the “2-10 Bond Spread” and “returns” columns since we shifted “2-10 Bond Spread” up by one:

df = df[1:] # remove first row with an N/A
df

Finally, it would be convenient to set the index of the dataframe to the values of the time column instead of arbitrary integers, since the chronology of the data is important.

Note that type of the time column is currently string (you can check this yourself with the type(x) function), so lets set it first to datetime64 and then set the index to the time column:

df['time'] = df['time'].astype('datetime64[ns]') # change "time" column type from str to datetime64
df.set_index('time', inplace=True) # set time column as the index

df

Now lets use all of the data to build our model, and then test our model with the same data and see what happens.

First lets import from sklearn a very easy to use regression model for demonstrative purposes, and the mean_squared_error function to help us generate a root mean squared evaluation function for testing our model:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

Note that in sklearn often you will find both a “Regressor” and “Classifier” version of the same algorithm (for instance in this case RandomForest). You simply want to use the classifier version when you are predicting into a finite number of categories (for instance horse, shoe, duck) and the regressor version when you are attempting to predict a continuous numerical output (as we are here).

Now lets create our training dataframe, and our target dataframe:

# set the training data columns and target variable
y_train = df["returns"]
X_train = df.drop(columns=["SPX, SPCFD: Compare", "returns"])

And initialise a RandomForestRegressor with a few hyper-parameter values set. Note that:

  • random_state just governs the initialisation parameters for the algorithm- if we define it explicitly we will get completely repeatable results, otherwise it is chosen randomly and so the final model will vary slightly every time
  • max_depth is a variable controlling the depth of the decision trees spawned in the forest- the greater the depth the more splits the model can make on the data- as a function of 2^(max_depth) ), allowing the model to grow more complex
  • n_estimators is the number of individual decision trees in the “forest”
random_forest = RandomForestRegressor(max_depth=5, n_estimators=100, random_state=1)

Lets fit (train) our model on all of our data:

random_forest.fit(X_train, y_train)

And check the root mean squared accuracy of our fitted model on the same exact same data:

root_mean_squared_error = np.sqrt(mean_squared_error(y_train, random_forest.predict(X_train)))
root_mean_squared_error
0.009310731251200473

Okay, so we have an error of 0.00931 as a baseline performance for our model.

Note that this is pretty horrible since the average return (which you can calculate with (np.abs(y_train)).mean() ) is 0.0065 so our average error is ~143% the size of the average daily return so clearly our model is not very accurate.

This isn’t really surprising though since we are using nothing but a single numerical value (yesterday’s bond spread) and an un-calibrated (and perhaps inappropriate type of) model to estimate the daily return- if markets were really that easy to predict we would all be rich!

This doesn’t really matter though and for the purposes of this article we will ignore our objectively horrible results- the focus is on using the dataset to provide demonstrative code for the topics we are exploring, and not the actual skill of our models.

In any case we can still fiddle with the model hyper-parameters to try and improve our performance.

For instance we can increase the max-depth of our random forest (which if we remember allows the model to make more splits on the data and thus grow more complex):

random_forest = RandomForestRegressor(max_depth=10, n_estimators=100, random_state=1)
random_forest.fit(X_train, y_train)
root_mean_squared_error = np.sqrt(mean_squared_error(y_train, random_forest.predict(X_train)))
root_mean_squared_error
0.009083679274932605

Okay, something like a ~2.4% improvement.

And making it more complex still:

random_forest = RandomForestRegressor(max_depth=50, n_estimators=100, random_state=1)
random_forest.fit(X_train, y_train)
root_mean_squared_error = np.sqrt(mean_squared_error(y_train, random_forest.predict(X_train)))
root_mean_squared_error
0.008943288173833334

Almost a 4% improvement now from our start point.

So now that we’ve improved it a bit, how does the model perform when exposed to entirely new data?

Here we will pretend we deployed our model in a live situation, and have come across some new data in the wild. We will load in some more data from the same source (but from a point in time moving chronologically forwards from the training data) and go through exactly the same pre-processing steps:

unseen_data = pd.read_csv(Path("unseen_data.csv"))
unseen_data['2-10 Bond Spread'] = unseen_data['2-10 Bond Spread'].shift(1)
unseen_data['returns']= unseen_data['SPX, SPCFD: Compare']/unseen_data['SPX, SPCFD: Compare'].shift(1) - 1
unseen_data = unseen_data[1:]
unseen_data['time'] = unseen_data['time'].astype('datetime64[ns]') # change "time" column type from str to datetime64
unseen_data.set_index('time', inplace=True) # set time column as the index
unseen_data

And finally setting the input variable and target variable again and testing our performance on some new data:

y_unseen = unseen_data["returns"]
X_unseen = unseen_data.drop(columns=["SPX, SPCFD: Compare", "returns"])

root_mean_squared_error = np.sqrt(mean_squared_error(y_unseen, random_forest.predict(X_unseen)))
root_mean_squared_error
0.018704827013423832

Which gives us a root mean squared error ~209% as large as for the data we previously used to both train and test on- not nearly as well as we might have hoped/expected…

This is a clear demonstration of overfitting in action.

How do we use the Validation Set?

So clearly we cannot just use a model’s performance on it’s training data to gauge how well it will perform on new data.

We need to gauge it’s performance on some data it did not use to train with to get a better picture of how well it will perform in the wild.

Enter the validation set.

From now on we will split our training data into two sets. We will keep the majority of the data for training, but separate out a small fraction to reserve for validation.

A good rule of thumb is to use something around an 70:30 to 80:20 training:validation split.

To do this we could simply do something like round a specific fraction of the length of our data to an integer and chop our dataframe in two as follows:

y = df["returns"]
X = df.drop(columns=["SPX, SPCFD: Compare", "returns"])

train_fraction = 0.8
split_point = int(train_fraction *len(X)) # (len(X) and len(y) are the same anyway)
X_train = X[0:split_point]
X_valid = X[split_point:]

y_train= y[0:split_point]
y_valid= y[split_point:]

Or as you might see in many places, use sklearn’s useful pre-built train_test_split function as follows:

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8,test_size=0.2, random_state=101)

Here train_size and test_size are automatically complementary if you only fill out one, and random_state is a seed for the way the data is split- if you use the same seed in the future, you will be guaranteed the exact same data will be in each of the training and validation sets as before.

print("len(df): {}, split_point: {}, len(X_train): {}, len(X_valid): {}, len(y_train): {}, len(y_valid): {}".format(len(df), split_point, len(X_train), len(X_valid), len(y_train), len(y_valid))) 
len(df): 1998, split_point: 1598, len(X_train): 1598, len(X_valid): 400, len(y_train): 1598, len(y_valid): 400

The validation set essentially allows us to check how “overfitted” or “underfitted” our model is.

It allow us to both tune the model complexity to the sweet spot and provides a much better estimate of how the model will perform with unseen data since the model does not use the validation data to train on.

Note that it is entirely normal (even probable) that the validation accuracy will be lower than the training accuracy. In fact, if they were very similar, it’d be a great indicator that your model might not be complex enough (underfitted).

That said the training accuracy doesn’t matter.

The only thing that matters is getting the best possible validation accuracy, since this is actually somewhat reflective of how the model will perform in the wild.

In general increasing model complexity should (randomness aside) almost always lead to improved training accuracy, and for a while increasing model complexity will also lead to improved validation accuracy, as the model finds more and better patterns.

However eventually, these patterns will become too specific to the training data and will not generalise well, so the validation accuracy will start to fall.

Hyper-parameter Tuning with the Validation Set

We will use the validation set to hone the model’s complexity to the sweet spot, as depicted in the image below:

Lets have a go doing that with our data now.

Below we simply iterate over a list of max_depths, fitting a model to each max depth, and then evaluating the error on the train and validation sets and making a plot of these. The only variable we are changing to alter the complexity of the model is the max_depth- everything else remains the same each time- so max_depth is uniquely responsible for the model complexity.

Matplotlib.pyplot is the “standard” plotting library used in Python. Here is a quick crash course in making some simple plots if you’ve never encountered it before: https://matplotlib.org/tutorials/introductory/pyplot.html

import matplotlib.pyplot as plt

train_errors = []
valid_errors = []
param_range = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,20,30,40,50,75,100]

for max_depth in param_range:
    random_forest = RandomForestRegressor(max_depth=max_depth, n_estimators=100, random_state=1)
    random_forest.fit(X_train, y_train)
    
    train_errors.append(np.sqrt(mean_squared_error(y_train, random_forest.predict(X_train))))
    valid_errors.append(np.sqrt(mean_squared_error(y_valid, random_forest.predict(X_valid))))
    

plt.xlabel('max_depth')
plt.ylabel('root mean_squared_error')
plt.plot(param_range, train_errors, label="train rmse")
plt.plot(param_range, valid_errors, label="validation rmse")
plt.legend()
plt.show()

As expected we can see that as we increase the max_depth (up the model complexity), the training accuracy continuously improves- rapidly at first, but still slowly after, throughout the whole 1-100 range.

On the other hand the validation accuracy gets worse immediately, and doesn’t stop getting worse as we increase the max_depth.

This is an indication that the model is already “too complex” (or at optimal complexity) with a maximum depth of 1.

Normally we would expect the validation accuracy to improve for at least a little while before regressing going from a very simple model to very complex, but then again normally we would expect to use training data that contained more than just a single numeric data column to learn from so a maximum depth of 1 returning the optimal validation performance is almost certainly just a consequence of very simple input data and/or a lack of training data.

In any case, lets go ahead and re-fit our model with a max_depth of 1 and see exactly how it performs.

Note that we are reverting to using X and y (the full dataset) here to re-train our model now that we have a theoretical best max_depth so that the training data exactly matches that of the first model that predicted against unseen data- chopping off 20% of such a small dataset (because we recently made a training:validation split) would likely cause our model to perform much worse irrespective of the max depth. This way we keep the comparison consistent:

random_forest = RandomForestRegressor(max_depth=1, n_estimators=100, random_state=1)
random_forest.fit(X, y)
root_mean_squared_error = np.sqrt(mean_squared_error(y, random_forest.predict(X)))
root_mean_squared_error
0.00942170960852716

~5.4% worse than the best we achieved on the training set when we had a max_depth of 50

root_mean_squared_error = np.sqrt(mean_squared_error(y_unseen, random_forest.predict(X_unseen)))
root_mean_squared_error
0.017679233094329314

Yet ~5.5% better on the unseen data!

Here we have successfully used the validation set to both:

  • Give us a better advanced estimate of how we will perform on unseen data.
  • Improve our performance on out of sample (unseen) data by reducing overfitting/underfitting.

How do we test our model on our Testing Set?

So if the model never trains on the validation data, isn’t the validation data a perfect estimate of how the model will perform in the wild?

Well almost.

But not quite.

The reason is that by using the validation data to tune our model to the best generalisable performance, we have inherently shown a slight bias towards model hyperparameter values and data features that optimise performance specifically for this validation set.

In effect we have overfitted to the validation set.

Note that the degree of overfitting to this set compared to the training data is far smaller, and the performance on the validation set will often give a rough ballpark for performance in the wild (assuming you have created the validation set without data leakage- more on that later!).

It is for this reason though that we separate out from our available data a further test set that we do not touch until we have the final version of our model fully feature-engineered and tuned.

Making this additional split, our original available data should now look something like this:

And we might use something like a 70:20:10 split now. We can use any way we like to split the data-frames, but one option is just to use train_test_split() twice.

Note that 0.875*0.8 = 0.7 so the final effect of these two splits is to have the original data split into training/validation/test sets in a 70:20:10 ratio:

# split the full data 80:20 into training:validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, random_state=101)

# split training data 87.5:12.5 into training:testing sets
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, train_size=0.875, random_state=101)

print("len(X): {} len(y): {} \nlen(X_train): {}, len(X_valid): {}, len(X_test): \
{} \nlen(y_train): {}, len(y_valid): {}, len(y_test): {}".format(len(X), len(y),\
len(X_train), len(X_valid), len(X_test), len(y_train), len(y_valid), \
len(y_test))) 
len(X): 1998 len(y): 1998 
len(X_train): 1398, len(X_valid): 400, len(X_test): 200 
len(y_train): 1398, len(y_valid): 400, len(y_test): 200

We can spend as much time and effort trying to optimise our performance on the validation set as we want.

But we have to be honest ourselves- when we are done we are done, and must take whatever result we then get on the testing set as our likely result on new data in the wild.

We cannot then go on and attempt to optimise again to improve the performance further on the testing set. To do so would be developing overfitting bias all over again.

What is Cross-Validation and why do we use it?

Great, so we now split our data in three ways- we use the bulk of our data for training, keep a reasonable amount for validation and keep a further small hold-out set for final testing.

But you might be thinking, aren’t we now losing a lot of data we could be using for training? Won’t this potentially lower our model accuracy?

Also, is there any way we can avoid overfitting so much to the validation set?

Well the answer to all three questions is yes!

Instead of using a single validation set, we can use many validation sets.

We can make many training:validation splits and cycle which part of the data we use for validation each time such that eventually, over every training:validation split combined, all the data has been used at least once for validation, and at least once for training.

The way in which cross-validation is performed differs for standard and time-series data, but in general gives us the following benefits:

  • We get to (over multiple splits) use 100% of the training + validation data for training, which smooths out issues where the initial training set was perhaps highly biased and contained many examples of an extreme data type/occurrence, or did not contain any examples of an important data type/occurrence
  • We get to (piece-wise) validate over all of the data: not falling prey to any instances of high variance in the small validation set where equally the validation set contained an unusually high or low count of unusual occurrences.
  • We get to average our performance over all of the data, giving us far more confidence in our estimation of the model’s skill, as well as an actual picture of how volatile the model is to perturbations in the input data.
  • We are automatically forced to build a far less overfitted (and thus more generalisable model) because we are trying to maximise the average performance over many validation sets, not one specific validation set, so we cannot inadvertently tune towards hyper-parameter settings that are only good for a very specific validation set.

Cross-Validation for Standard Data

There are a few different ways we can perform cross-validation, but for non-time series data one of the most popular (and simple to understand and effective) techniques is K-fold cross-validation.

Note that it is true that we have time-series data here, so K-fold cross validation is actually an inappropriate technique to use (for reasons we shall discuss shortly) but for now we will temporarily ignore these issues for the sake of generating some example code with the same dataset.

K-fold Cross-Validation

With K-fold cross-validation we split the training data into k equally sized sets (“folds”), take a single set as our validation set and combine the other set as our training set. We then cycle which fold we use as our validation set until we have trained and validated k times- each time with a unique train:validation split.

You can pick whatever value of k you like, but from the collective experience of all data scientists ever, k=5 or k=10 (and everything in between) are common and effective choices: k=5 would represent an 80:20 training:validation split and k=10 a 90:10 split etc.

The process can be summarised as follows:

  1. Separate out from the data a final holdout testing set (perhaps something like ~10% if we have a good amount of data).
  2. Shuffle the remaining data randomly.
  3. Split this data into k equally sized sets/folds.
  4. For each unique fold:
    1. Use this fold as the validation fold
    2. Combine the other k-1 folds as the training data
    3. Fit the model with the training data
    4. Evaluate the model with the validation fold
    5. Keep the evaluation scores, discard the model and begin again at 4.1 with a new validation fold
  5. Evaluate your model against the whole set of k validation scores, and if you are unhappy make adjustments and repeat from 1.
  6. When you are finally happy, combine all k folds into one complete training data set, train again, and perform a final test on the holdout testing set.

Pictorially the process looks something like this:

Visual representation by Joseph Nelson- @josephofiowa

And so now the overall training/validation/testing split process looks like this:

https://scikit-learn.org/stable/modules/cross_validation.html

Lets have a go at attempting to incorporate K-fold cross-validation with our data set!

While we could write a lot of (simple, but lengthy) code to implement such a process from scratch, luckily the sklearn library comes to our rescue with a handy pre-built function once again! (you can read more about it here: https://scikit-learn.org/stable/modules/cross_validation.html )

from sklearn.model_selection import cross_val_score

This magic function is going to handle the whole process very easily for us but do dig into the documentation to understand how it works under the hood and the variations available!

Anyway, lets pass the cross_val_score() function our model with our desired parameters, the X and y data (which should be all of the data minus the final holdout test set), a scoring method (we will use neg_mean_squared_error and adjust to afterwards to RMSE), and the value of k to use (which is the “cv” parameter):

cross_val_scores = cross_val_score(RandomForestRegressor(max_depth=1, n_estimators=100, random_state=1),\
                                   X, y, scoring='neg_mean_squared_error', cv=5)

Adjust the scores from negative mean squared error to root mean squared error to be consistent with our scores from before:

cross_val_scores = np.sqrt(np.abs(cross_val_scores)) 
print(cross_val_scores)
print("mean:", np.mean(cross_val_scores))
[0.01360711 0.00861119 0.00715738 0.00947426 0.0067502 ]
mean: 0.009120025719774182

As you can see there is huge fluctuation in our validation scores for different training:validation splits! The worst error is ~102% greater than the smallest!

These crazy differences are probably mainly a function of our dataset being too small, and the fact that K-fold cross validation is not appropriate for time series data (so some of our training:validation splits might randomly be more appropriate than others) but it goes to highlight that performance can vary quite strongly from split to split so it is important to take an average over all the models/splits where eventually all the data is used to validate once!

Note as a demonstration of the fact that very small validation sets lead to high variance (because the small amount of data contained in them is subject to changing a lot with each split), we can set k=50 and run our cross-validation again:

cross_val_scores = cross_val_score(RandomForestRegressor(max_depth=1, n_estimators=100, random_state=1),\
                                   X_train, y_train, scoring='neg_mean_squared_error', cv=50)

# change neg_mean_squared error to mean_squared_error
cross_val_scores = np.sqrt(np.abs(cross_val_scores)) 
print(cross_val_scores)
print("mean:", np.mean(cross_val_scores))
[0.01029598 0.00922735 0.00553913 0.00900553 0.0110392  0.01333214
 0.0115197  0.00933864 0.00664628 0.004857   0.0135743  0.00595552
 0.00706495 0.00944506 0.01080077 0.00842491 0.01044174 0.0126128
 0.00869932 0.00846706 0.00762137 0.01478009 0.00772207 0.01305496
 0.00673948 0.00801689 0.01060272 0.01137826 0.0069177  0.01071186
 0.0083437  0.00905157 0.00803609 0.00893249 0.01002789 0.00802375
 0.00934506 0.01199787 0.00686557 0.01114371 0.00862676 0.00830973
 0.00935762 0.00815328 0.00868262 0.00938199 0.00926949 0.00627161
 0.00922161 0.00771521]
mean: 0.00921180787871304

Notice how the mean is very similar, but the variance is even greater- with the worst performance having an error ~204% greater than the best!

This is why we like to pick a value of k somewhere between 5 and 10: the validation sets are big enough not to show too much set to set variance, yet not so big that they take away a substantial chunk of the training data- leading to high bias in the training data set and a lack of data left over for training.

Hyper-parameter Tuning with K-fold Cross-Validation

So as you may remember, one of the points of cross-validation was to reduce bias in the training set, and variance in the validation set.

The other big one was to reduce overfitting to the validation set by forcing us to find hyper-parameter values that give the best average performance over many validation sets.

Before we found that for our specific training:validation split, a max_depth of 1 led to the best performance, so we concluded that this max_depth would give us the best performance on new data.

Lets remind ourselves of how that looked:

That said its entirely possible that a max_depth of 1 was only best for this specific validation set- it is perhaps not the best max_depth averaged over many validation sets.

Lets run hyper-parameter calibration for max_depth again, but this time calibrated with cross-validation over 5 different validation sets.

For this we will use another function from sklearn- validation_curve().

from sklearn.model_selection import validation_curve

It’s pretty similar to cross_val_scores(), but lets us vary hyper-parameters at the same time as running cross-validation (i.e. it performs the full cross-validation process once with each specific value of the hyper-parameter we are varying):

train_scores, valid_scores = validation_curve(RandomForestRegressor(n_estimators=100, random_state=1), X_train, y_train, "max_depth",
                                               param_range, scoring='neg_mean_squared_error', cv=5)
train_scores = np.sqrt(np.abs(train_scores))
valid_scores = np.sqrt(np.abs(valid_scores))

train_scores_mean = np.mean(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)

plt.title("Validation Curve with Random Forest")
plt.xlabel("max_depth")
plt.ylabel("RMSE")
plt.plot(param_range, train_scores_mean, label="train rmse")
plt.plot(param_range, valid_scores_mean, label="validation rmse")

plt.legend()
plt.show()

Okay, pretty much exactly the same result, but this time around we are much more sure we made a good choice!

In general, if our data had been a bit more complex and the overfitting with any level of max_depth wasn’t so clear cut, we might find different hyper-parameters gave the best result when varied with cross-validation rather than with validation on a single set.

Alternative Techniques For Problematic Data

K-fold cross validation will often give you a good result, but occasionally depending on the structure/distribution of the data it can give us problems.

The first of these is in the case that there are a lot of extreme examples in the data and we do not get a good distribution of them between the training, validation and testing sets.

For instance there could be few examples of some classes in a classification task. If these (by bad luck) end up mainly only in the validation or testing sets, then our model will have never/barely encountered them in training and will almost certainly perform badly in classifying them.

Similarly in a regression sense, if there are some examples of very extreme target values- either low or high- and they only turn up in some of our validation and testing sets, again our model is unlikely to do well when encountering them.

Additionally, if “problematic” data only appears in the training, and doesn’t show up in our testing set, we are likely to get an overly optimistic estimation of our model skill- we would like the training/validation/testing distributions to be as similar as possible.

Stratified K-fold

Stratified K-fold is a good solution to this.

It’s a variation of k-fold which places approximately the same percentage of samples of each target class as in the complete data set in each of the training, validation and testing sets.

You can import and set it up like so:

from sklearn.model_selection import StratifiedKFold
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

Note that it returns the indexes of the training/testing (or training/validation) splits you should use, so you will have to manually configure your splits each time as follows:

for train_index, test_index in skfold.split(X, y):
	X_train, X_test= X[train_index], X[test_index]
	y_train, y_test= y[train_index], y[test_index]

  # TRAIN AND VALIDATE WITH THIS SPLIT

Stratified K-fold only works for classification data straight out of the box, but a sneaky and easy way to get it to have a similar effect for regression data is just to bin your regression targets into narrow-ish bands turning the problem into pseudo-classification.

You could even temporarily add a new column to your data equal to the binned target values, assign this temporary column as the target variable, create your “stratified” folds based on this, then drop this extra data column you created and revert to using the exact numeric values as the target variable for the actual training, validation and testing.

Group-K-fold

Another possible issue is when there is obvious group structure present in the data.

For example we could have a scenario where samples of data are collected from different subjects, but in some (or all) cases multiple samples are collected per subject.

Think perhaps of trying to estimate how long containers unloaded from cargo ships dwell in dockyards before leaving a terminal. Each ship might unload hundreds of containers, so there is an obvious grouping of container data long term by ship. If the model is flexible enough to learn highly ship-container specific features, it could fail to generalise well to containers unloaded from different ships in the future, even if it performs very well over a training/validation/testing split made from containers coming from the same ship.

GroupKFold is a variation of K-fold which ensures that the same group is not represented in the different sets- i.e. that all instances of data coming from the same group are present only in one of either the training, validation or testing sets.

You can use it in exactly the same way as StratifiedKfold:

from sklearn.model_selection import GroupKFold

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = GroupKFold(n_splits=3)

for train, test in gkf.split(X, y, groups=groups):
    print("TRAIN INDEXES: {} TEST INDEXES: {}".format(train, test))
TRAIN INDEXES: [0 1 2 3 4 5] TEST INDEXES: [6 7 8 9]
TRAIN INDEXES: [0 1 2 6 7 8 9] TEST INDEXES: [3 4 5]
TRAIN INDEXES: [3 4 5 6 7 8 9] TEST INDEXES: [0 1 2]

As you can see data from each group appears completely in either the training set or testing set for each split.

Cross-Validation for Time Series Data

K-fold cross validation (and it’s variants) work poorly for time series data because they do not respect the chronological ordering of the data.

The data that goes into each of the training, validation and testing splits is picked randomly so we will almost invariably have some amount of the training data come before as well as after the validation and testing sets.

If the patterns in the data are highly dependant on the time they occurred as well as their other features, then essentially this is a form of data leakage as we are using some information from the future to predict the past and present leading to overly optimistic estimates of model skill.

This can be a disaster in some situations as in certain situations it is much easier to fill in the blank about some middling time event if given information from before and after it.

For instance think markets- a million dollar question (or trillion?) to predict how they will move in the future with data from the present, but rather easier to guess what happened in terms of price movements in the middle of given time range if given price data from before and after the period of question (especially on a low time frame).

Note that the explicit presence of a time of data observation column in your data doesn’t necessarily force the definition of whether some data should be considered as a time series or not. It is possible to have an observed time column in your data and for the data to not be strongly time dependant- you have to consider the nature of the data.

For instance, how human bone measurements correlate to height probably has varied slightly over the millennia, but over a period of years or even decades, time is not a strongly relevant component, even if you have the time of observation in your data. Market price movements on the other hand, are highly sensitive to time- even down to the minutes and seconds.

Because of this, for time series data it is essential that the testing set is composed of data strictly from chronologically after the validation and training sets, and likewise that the validation data comes chronologically after the training set.

Walk-Forward Nested Cross-Validation

To achieve this whilst still being able to use the majority of our data for validation (and testing as well as it turns out in this case), we can use a technique known as walk-foward nested cross-validation.

The idea is pretty simple.

We begin using just a small fraction of our data in the past.

We then:

  1. Use the most recent data in that set as our test data
  2. Use the data just prior to that as the validation set
  3. Use all the data prior to that as our training set
  4. Expand our data set forward in time and repeat 1-3 until the data in our test set catches up to the present day.

The process should look something like this:

Courtney Cochrane: https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9

Whilst that might look technically tricky to implement, sklearn has (unsurprisingly) another useful function to help us out with this- TimeSeriesSplit().

from sklearn.model_selection import TimeSeriesSplit

First, do make sure your data-frame has it’s index set as the relevant time series (we already did this right at the start) to make sure the index is chronologically ordered.

Then create a TimeSeriesSplit() object with the amount of walk-forward splits you want (n=5 gives 5 walk-foward cycles with equal sized test sets):

tscv = TimeSeriesSplit(n_splits=5)

When used on a dataframe tscv returns the indexes of the training:test splits in a generative fashion.

Below I’ve created some simpler data to show the output so that it’s easier to tell whats going on:

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4],[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
TRAIN: [0 1 2 3 4] TEST: [5]
TRAIN: [0 1 2 3 4 5] TEST: [6]
TRAIN: [0 1 2 3 4 5 6] TEST: [7]
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8]
TRAIN: [0 1 2 3 4 5 6 7 8] TEST: [9]

This gives us our walk-forward training:testing divisions beautifully.

We can now simply chop off whatever fraction we like by index of the final data in the train set in each walk-forward instance to form the validation data, since it is chronologically ordered by our time series index anyway:

for train_index, test_index in tscv.split(X):
    # 80:20 training:validation inner loop split
    inner_split_point = int(0.8*len(train_index))
    
    valid_index = train_index[inner_split_point:]
    train_index = train_index[:inner_split_point]

    print("TRAIN:", train_index, "VALID:", valid_index, "TEST:", test_index)
    
    X_train, X_valid, X_test = X[train_index], X[valid_index], X[test_index]
    y_train, y_valid, y_test = y[train_index], y[valid_index], y[test_index]
TRAIN: [0 1 2 3] VALID: [4] TEST: [5]
TRAIN: [0 1 2 3] VALID: [4 5] TEST: [6]
TRAIN: [0 1 2 3 4] VALID: [5 6] TEST: [7]
TRAIN: [0 1 2 3 4 5] VALID: [6 7] TEST: [8]
TRAIN: [0 1 2 3 4 5 6] VALID: [7 8] TEST: [9]

Perfect! (though do note we need just a few data samples relative to the size of n_splits to guarantee all non-empty sets- but this is hardly likely to be a problem in reality!)

Now we can simply perform training and validation in a single set (non cross-validation) manner each time in the inner loop. We won’t go through this with an example again since it can be performed exactly as in the Hyper-parameter Tuning with the Validation Set section.

Note as a final point, once you have gone through your chosen variation of training, cross validating and testing your model, it is worth combining all three of your training, validation and testing sets and retraining one final time before heading into the wild.

This final recombination maximises the amount of data the model can learn from (now that we have the optimal features and hyper-parameter settings already figured out) and so will probably lead to a slightly more effective and robust model!

Hopefully this article has given you a few new ideas to play around with and reinforced your understanding of the why behind training, validation and testing sets.

You can find the code used in this article and the accompanying datasets here: https://github.com/GregBland/train_val_test_sets_article

Until the next time!


Related article: What is a Walk-Forward Optimization and How to Run It?

Greg Bland