Machine Learning for Finance Guide – A Real-Life Example

30 min read

Get 10-day Free Algo Trading Course

Last Updated on July 23, 2021

Alphastar – a Starcraft 2 AI that kicks butt!. Source

Table of contents:

  1. What is Machine Learning? 
  2. How is Machine Learning used in Finance?
  3. Main problem
  4. What is a bank Default?
    1. What are the main differences between a Secured and Unsecured debt?
    2. Can a country default?
  5. How to find and load the data?
    1. How to pick your features and labels?
    2. Can we make the data more compact using Python?
  6. How to conduct data analysis with Python?
    1. Describe the data
    2. Plotting the data 
    3. How to use Jasp for statistical analysis?
  7. How to one-hot encode variables?
  8. How to create train and test groups?
  9. How to pick and implement a classifier?
    1. What classifier performs the best?
    2. How to tune the parameters?
  10. How to interpret the data?
  11. Full code
  12. Resources

What is Machine Learning?

Machine Learning (ML) is a method for building algorithms and models that improve themselves through experience.

Machine Learning can be split up into three main categories that are dependent on the way the algorithms operate. Those three categories are:

Supervised Learning – refers to the case when we provide the machine with the inputs and their corresponding desired outputs. Based on this information the machine learns to create outputs as close to the ones we are looking for.

Unsupervised Learning – we feed inputs but there aren’t any target outputs. This means that we don’t tell the algorithm what to do and that it needs to figure out some sort of dependence or underlying logic of what to do.

Reinforcement Learning – here the program navigates a certain environment with a specified goal. As the program navigates, it provides feedback aka rewards that it tries to maximize.

How is Machine Learning used in Finance?

Machine learning allows us to go through an immense amount of data quickly in order to produce accurate predictions. Its main way of operating can be seen as pattern recognition.

When it comes to finance, its sector aggregates so much data (payments, bills, vendors, transactions, etc.) that a team of best statisticians and traders couldn’t process it in an efficient and timely manner. This is where ML comes to play.

Algorithmic traders use ML in order to make better, efficient, calculated, and more informed trading decisions. ML allows algo traders to analyze large volumes of incoming data which in turn can predict the movement of the market.

Robo-advisor is a digital platform built on the principles of ML that provides automated financial services with minimal human supervision. It offers traders dynamic goal planning, orientation, portfolio management, security, and more.

Banks use ML to parse their users’ data in order to detect possible anomalies and fraudulent behavior that might occur from a security breach. When the algorithm recognizes a fishy pattern it can flag the user and save the bank from damage.

ML allows companies to create models that can perform document analysis that save time and money. For example, a ML algorithm can conduct an analysis in a few seconds that a worker would in 200k hours.

A powerful and scary thing that ML can do is to predict the future behavior of people. This surely has its positives and negatives. In this article, we will look at a real-life example of how a bank may implement a machine learning algorithm to predict the future behavior of its clients.

Main problem

The main problem which the article will tackle sounds like this: “How can a bank using machine learning predict if the user will default in the following months?”

In order to be able to code and follow through with the article, the first thing we need to do is to get acquainted with the theory behind the main problem. I can’t stress enough how important it is to understand the topic before trying to build an algorithm or conduct a statistical analysis.

It does not matter how good a coder or statistician you are as the possible mistakes that come from ignorance or laziness will ruin the project that you are working on.

What is a bank Default?

According to Investopedia, default is the failure to repay a debt, including interest or principal, on a loan or security. This happens when the user can’t make payments in time, avoids them, misses them, or just outright stops paying.

What are the main differences between a Secured and Unsecured debt?

In the case of secured debt like a mortgage loan on a house, if the client doesn’t pay in a timely manner or stops paying, he defaults and the bank can reclaim the home.

Another example would be when a certain business goes into default. The main thing that happens is that it defaults on all of its bonds and loans which results in bankruptcy.

In the case of unsecured debt like credit debts or utility bills, if the client doesn’t pay in a timely manner or stops paying, he defaults and it almost always results in legal disputes. For example, a judgment lien is a court decision that gives creditors the power to take possession of the users’ property.

Can a country default?

One may ask if a country can default? The answer is YES and the most famous example is Greece that experienced a crushing debt. In 2015 Greece defaulted by missing a payment of 1.6 billion euros to IMF (International Monetary Fund).

Another example is when Puerto Rico in 2015 managed to pay only $628k from a bond debt of $58 billion. To make it even worse, the island was hit by a hurricane in 2017 that managed to cripple its economy even more. Now that we have the main idea of what a default is and how it occurs, we will make a machine learning algorithm in Python that will strive to solve our main problem.

How to find and load the data?

Some of you may think about how in the world we can obtain bank data without working for a bank, hacking it, or paying a lot of money. Well, in this day and age, there are a lot of data repositories that offer datasets for learning, practice, cleaning, and more.

One of those repositories is a platform named Kaggle. The platform is primarily aimed at data science enthusiasts and represents the proving grounds for anyone looking to test their skills.

Kaggle features an immense amount of datasets ranging from hospitals to avocado prices and you can download them with a click and start working on your algorithms.

Here I found a nice dataset that comes from the UCL Machine Learning repository. The only reason that we’ll download the dataset from Kaggle is that the variables are already renamed and cleaned a bit.

The dataset features data about default payments of credit card clients from a Taiwanese bank from 2005. If we look at the time setting and conduct quick research we can find that there were many cases of bank defaults in Taiwan.

Moreover, the data we have goes from April to September (a 6-month span) and features 25 variables. Before moving on, visit the following link and download the data and unzip it:

https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

The next step is to import pandas so we can gain access to the dataset and load it.

import pandas as pd

In a scenario where we don’t really know what our dataset is made of we can either load it all into pandas or give it a quick glance in excel to get an idea of what types is the data built from, what is the separator, how are missing values handled, etc.

In order to print out a few rows write the following command:

!head -n 6 credit_card_default.csv

The good thing is that we can see the description of our data on the Kaggle website that is built from the following variables:

ID – id of each user

LIMIT_BAL – amount of given credit in NT dollars

SEX – gender (1 = male, 2 = female)

EDUCATION – level of acquired education

MARRIAGE – marital status

AGE

PAY_0 to PAY_6 – repayment status from September to April

BILL_AMT1 to BILL_AMT6 – the amount of bill statement from September to April

PAY_AMT1 to PAY_AMT6 – the amount of previous payment from September to April

default.payment.next.month – Default payment (1 = yes, 2 = no)

Now is the data to load the data frame and specify that the first column (ID) should be our index and that empty strings should be missing values.

df = pd.read_csv('UCI_Credit_Card.csv', index_col =0, na_values='')
df.head()

As we have in the EDUCATION column values 0, 5, and 6 that say Unknown, we will add them to the Other category (4). The MARRIAGE column also has a 0 so we will take care of that too.

df['EDUCATION'] = df['EDUCATION'].replace([0,5,6],4)
print(df['EDUCATION'].value_counts())

fil = (df['MARRIAGE'] == 0)
df.loc[fil, 'MARRIAGE'] = 3
print(df['MARRIAGE'].value_counts())

How to pick your features and labels?

The next reasonable thing to do is to separate the features from the label. In ML learning features are what we pass into our training data while the label is the required output.

For our features (x) we will select all columns except the default.payment.next.month and for the label (y) we shall pick the aforementioned column.

x = df.copy()
y = x.pop('default.payment.next.month')

Can we make the data more compact using Python?

When working with big data (our dataset isn’t one) you should always try to make the data more compact so it fills less space. When loading data files with pandas the library handles is in a decent way but we can optimize it.

When it comes to memory the data type can lead us in the right direction.

Okay, let’s create a function to see the top 5 columns by its memory usage:

def memory_usage(df, columns=5):
    print('Memory usage ----')
    memory_per_column = df.memory_usage(deep=True) / 1024 ** 2
    print(f'Top {columns} columns by memory (MB):')
    print(memory_per_column.sort_values(ascending=False) \
    .head(columns))
    print(f'Total size: {memory_per_column.sum():.4f} MB')
    
memory_usage(df)

As you can see our dataset is already compact and it doesn’t fill a large amount of memory. That is good to know as on the contrary (when working with big data) it could slow down our computing process.

As our data is loaded, compact, and the features and labels selected, we can move on to conduct an exploratory statistical analysis in order to get an idea of valuable default predictors and the main distributions.

How to conduct data analysis with Python?

Doing a data analysis on your dataset is one of the most important steps when aiming to build a good algorithm as you can get insight into underlying relationships behind the main variables.

This insight can show was which variables should be plucked from our dataset and what would the main movers of the prediction be.

In this section, I’ll go through some main approaches and leave room for your own exploration of the dataset.

The first thing that we need to do is to import the relevant libraries that will allow us to calculate the key statistics and graph the data.

import numpy as np
import seaborn as sns

Describe the data

Pandas has a describe function that calculates some of the main statistics like the min, max, mean, std and quartiles that we want to look into:

df.describe().round().T

Now we need to get a bit creative and think what would be the main factors that could tell us key things about our data. The first thing that comes to my mind is that we want to see the following:

  • Do genders vary by age?
  • Do genders vary on limit balance when checked for their education levels?
  • Does the education level have something to do with our label?

Plotting the data

Let us check if the genders vary by age in our dataset as the age factor might be an important one. For this, we want to use the distribution plot. Have in mind that our male clients are coded as 1 and female as 2.

We will also drop the missing values when charting so we don’t have any ugly gaps.

fig, ax = plt.subplots()
sns.distplot(df.loc[df.SEX==1, 'AGE'].dropna(),
    hist=False, color='blue',
    kde_kws={"shade": True},
    ax=ax, label='Male')
sns.distplot(df.loc[df.SEX==2, 'AGE'].dropna(),
    hist=False, color='red',
    kde_kws={"shade": True},
    ax=ax, label='Female')
ax.set_title('Distribution of age')
ax.legend(title='Gender:')

From the graph we can see that male and females don’t vary that much in the shape of the distribution. Moreover, the skewness of our distribution is positive, meaning that we have more younger than older clients.

Females also tend to be a bit younger that males on average.

How to use Jasp for statistical analysis?

For those that would like an easier and quicker (click and drag method) when analyzing and plotting data, I’d recommend downloading the free program called Jasp that is built on R.

Allow me to show you a quick demo of Jasp with the same variables (for the rest of the article we will stick to python). Go over to the following link and download the program: https://jasp-stats.org/

After launching the program click Open and locate your csv dataset and click on it. Jasp will load it and arrange the variables according to their type.

Go to “Descriptives” that are found on the upper left corner and simply drag the variables to the position you want them in. In this case we are moving the AGE variable into “Variables” and the SEX variable to the “Split” section.

Then we simply pick the kind of plots and the statistics we want

As you can see Jasp gains us access to main statistics and graphing tools with a simple click and drag method. The downside of Jasp is that the graphs are not as highly customizable as they are in python.

Jasp is also way slower when graphing and calculating statistics on big data and it tends to crash. I’d recommend using it on small to average datasets.

Moreover, Jasp is great for non-tech researchers that don’t want to pay for programs like SPSS. Feel free to follow along with my python EDA by using Jasp if you like what you see.

Let us use a histogram to see if genders vary with the dependent variable being our label.

ax = sns.countplot('default.payment.next.month', hue='SEX',data=df, orient='h')
ax.set_title('Target variable distribution')

Our graph says that there are more defaults on average for female clients.

Now, let us see if genders vary on limit balance when checked for their education levels. For this, we shall use the violin plot as it allows us to see the distribution of each gender.

ax = sns.violinplot(x='EDUCATION', y='LIMIT_BAL',hue='SEX', split=True, data=df)
ax.set_title('Limit balance per education level distribution')
1=graduate school, 2=university, 3=high school, 4=others

Median is represented by a white dot that you can see in the middle of each category. The black lines in the center of the violin represent the first and third quartiles, while the whore black bar is the interquartile range.

This plot uncovers that the largest balance appears to be in the graduate school group, the education levels are different from each other when checked for limit balance, and there are slight differences between the genders.

But do the default percentages vary with education levels? Let’s give it a look:

ax = df.groupby("EDUCATION")['default.payment.next.month'] \
    .value_counts(normalize=True) \
    .unstack() \
    .plot(kind='barh', stacked='True')
ax.set_title('Percentage of default per education level')
ax.legend(title='Default', bbox_to_anchor=(1,1))
1=graduate school, 2=university, 3=high school, 4=others

As we can see the most defaults happen in the high school and university.

The next thing we are interested in is to see the correlation matrix between the variables. As we have 24 variables checking the correlation numbers would be too confusing and time-consuming.

In order to combat confusion, we can create a heatmap that will color our correlations. The bigger the correlation the more saturated the color is. I’ll also rename the PAY_0 to PAY_1 variable.

df = df.rename({'PAY_0':'PAY_1'}, axis ='columns')
sns.set(rc={'figure.figsize':(25,8)})
sns.set_context("talk", font_scale=0.7)
sns.heatmap(df.corr(), cmap='Oranges', annot=True)

With a quick glance at our label column we can see that it negatively correlates with the limit balance meaning that higher limit balance means lower default and vice versa.

The label also positively correlates with Payment variables meaning that a longer delay indicates a potential default. When it comes to age it doesn’t correlate notably with anything.

Another interesting thing we can do is to use a pair plot. It produces a matrix of plots where the diagonal represents the univariate histogram while the other plots are basic scatterplots.

You can pass as many features as you want into a scatterplot, but as you add more so the confusion goes up and the readability down.

I’ll pass 3 features (LIMIT_BAL, EDUCATION, and default.payment.next.month) as an extension of the previous analysis.

pair_plot = sns.pairplot(df[['EDUCATION', 'LIMIT_BAL','default.payment.next.month']])
pair_plot.fig.suptitle('Pairplot', y=1.05)

You can see that we have a few outliers in the limit balance per education groups that might skew with the distribution. It is good to know as we can pick a classifier or a statistical test more easily aka the ones that aren’t sensitive to outliers.

But what if we want a detailed statistical report with just one line of code? Well, the pandas profiling feature comes into play. It generated a detailed analysis with interactive reports that are easy to present.

Let’s check it out:

pip install pandas-profiling

import pandas_profiling
df.profile_report()

Be sure to check out the Full code section in order to see the complete pandas profiling result. I did not include the pictures as it is quite extensive and would swamp the article.

How to one-hot encode variables?

One-hot encoding is primarily used for categorical variables as ML algorithms can’t handle them well. Moreover, the encoding process makes the data more expressive.

Our data is already expressed in a number e.g. Education 1 means high school, but the numbers also go up to 5 and that can confuse the algorithm as it only sees numbers. It doesn’t know what 1-5 means and this can make the model poor.

In order to combat this one-hot encoding allows us to create dummy variables aka every Education group, from 1 to 5, will become a separate variable with 0 meaning “no” and 1 meaning “yes”.

We want to do this process for all categorical variables like EDUCATION, SEX, MARRIAGE, but also for the monthly payment variables (did/not pay). Let us create binary variables and add them together into our dataframe.

df_dum = pd.get_dummies(df, columns=["EDUCATION"], prefix=["Edu"] )
df_dum = df.merge(df_dum, how='outer')

df_dum2 = pd.get_dummies(df, columns=["SEX"], prefix=["SEX"] )
df_dum2 = credit_dum.merge(df_dum2, how='outer')

df_dum3 = pd.get_dummies(df, columns=["MARRIAGE"], prefix=["MARRIAGE"] )
df_dum3 = df_dum2.merge(df_dum3, how='outer')

df_dum4 = pd.get_dummies(df, columns=["PAY_1"], prefix=["p1"] )
df_dum4 = df_dum3.merge(df_dum4, how='outer')

df_dum5 = pd.get_dummies(df, columns=["PAY_2"], prefix=["p2"] )
df_dum5 = df_dum4.merge(df_dum5, how='outer')

df_dum6 = pd.get_dummies(df, columns=["PAY_3"], prefix=["p3"] )
df_dum6 = df_dum5.merge(df_dum6, how='outer')

df_dum7 = pd.get_dummies(df, columns=["PAY_4"], prefix=["p4"] )
df_dum7 = df_dum6.merge(df_dum7, how='outer')

df_dum8 = pd.get_dummies(df, columns=["PAY_5"], prefix=["p5"] )
df_dum8 = df_dum7.merge(df_dum8, how='outer')

df_dum9 = pd.get_dummies(df, columns=["PAY_6"], prefix=["p6"] )
df_dum9 = df_dum8.merge(df_dum9, how='outer')

df_dum9 = df_dum9.drop(['SEX','EDUCATION','MARRIAGE','PAY_1','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6'],axis=1)
print(df_dum9)

As we are going to test several classifiers the next important step would be to normalize the data. For this, we will use the MinMax scaling method:

from sklearn import preprocessing as prep
minmax_scale = prep.MinMaxScaler().fit(df_dum9)
credit_minmax = minmax_scale.transform(df_dum9)
credit_minmax = pd.DataFrame(credit_minmax, columns = list(df_dum9))
credit_minmax

How to create train and test groups?

In order to not overfit our next step is to split the data into two groups (test/train). Overfitting is when the model uses a complex way to explain idiosyncrasies in the data and becomes unable to generalize the unseen data.

Green = overfit
(https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg)

Splitting the data is one of the most important steps as it prevents data leakage, aka the model seeing the future data. Moreover, it helps us to build consistency within the model.

The Train set is used by the algorithm to learn the optimal combinations of classifiers that produce a good fitting (predictive) model. This step ensures that the algorithm will behave correctly with new (unseen) data.

The Test set is used to measure the performance of the model. If the model fit from the train set also fits the test set, which follows a similar distribution, it means that the model didn’t overfit.

There are several ways to split the data in python. The most basic way is to delegate 20% data to the test set and 80% to the train set. If we’re working with time series we can turn the shuffle parameter off.

In our case, we are looking for frauds and bad user behavior. This means that a random splitting might exclude the default users from the train set and ruin our model.

In order to combat this, we shall use the stratification split that will ensure that both sets have a similar distribution of the specified variable, in our case the label.

Now that we have the main idea in mind let us import the relevant library, do the split and see if the data is normalized.

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=21)
y_train.value_counts(normalize=True)
y_test.value_counts(normalize=True)

For a better understanding you can read the following article that is found on our blog:

Before choosing a classifier and building a model, we need to see what to do with missing data. From my experience, whenever you get a larger dataset so does the missing data creep in. Basically, it will always be present.

Some people tend to overlook certain items, some tend to choose not to answer them because they don’t have enough time, patience, or see the question as too personal to share.

There are many reasons for missing data from random ones to structural ones and many ML algorithms can’t handle the missing data in a good way or the libraries don’t implement the feature for those that can (scikit-learn).

There are several ways to treat missing data and the most popular ones are:

  • Delete the rows that contain missing data – this method isn’t recommended as you can lose a lot of valuable data and introduce bias in your model. Moreover, you can devastate smaller datasets.
  • Replace missing values with a high score (e.g. -99999) – this method is useful for some algorithms that aren’t sensitive to outliers.
  • Replace the variables with a statistic (mean, median, mode) – this method might lead to the reduction of the variance of the dataset and thus lower the reliability of our model.
  • Calculate the missing data with ML – this method can take the missing data as a label and after training on complete rows, calculate the missing feature.

Let’s take a look if our data has any missing values.

import missingno

x.info()
missingno.matrix(x)

From the first output, we can clearly see that we don’t have missing data. In the case that you have many columns (50+) looking through the list could be tiring and that is why we use the second method to produce a graph.

If we had missing data we would see white strips in the gray columns.

How to pick and implement a classifier?

When it comes to ML, picking a classifier comes down to many things like experience, the number of features, linearity, size of the dataset, training speed, complexity, accuracy, use, scalability, and more.

There are many guides and tips on the internet for each case-scenario and the pros and cons of each classifier. In this article, we will explore several classifiers and compare them to each other.

This way allows us to choose the best fitting one and the classifiers that we’re going to test are:

  • Decision tree
  • Random forest
  • KNN
  • SVM
  • Logistic regression

The decision tree is a structure in which each node represents a test of the attribute (whether a client will default), the branches represent the outcomes of the test, whilst each leaf node represents a class label.

The decision tree is made of 3 types of nodes:

  • Decision nodes – represented as squares
  • Chance nodes – represented as circles
  • End nodes – represented as triangles

The decision tree can be boiled down as a set of rules that guide us to the best decision. The main pros of the decision tree are the following:

  • Easy to interpret
  • Doesn’t have many hyperparameters that need tuning
  • Non-parametric
  • Doesn’t need scaling or normalization of the features
  • Doesn’t care about the non-linearity of the dataset
  • Support categorical and numerical variables
  • Training is fast

The cons:

  • It can overfit – if we don’t provide it with depth stop criteria the trees can grow absurdly
  • Not stable – a small change in the inputs can change the model easily
  • No regression – can’t predict continuous variables
  • A great number of features can substantially lower the predictive power

Let us import the relevant libraries and start coding our classifier.

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, make_scorer

Now, we shall call our model, fit it to the training data and compute the prediction:

clf_gini = DecisionTreeClassifier(criterion = 'gini', random_state = 100, max_depth = 3, min_samples_leaf = 5)
clf_gini.fit(X_train, y_train)

prediction = clf_gini.predict(X_test)
print("Decision Tree Model Report")
reports = classification_report(y_test, prediction['DecisionTree'])
print(report)

Another important thing is the confusion matrix as it sums all possible combinations of the predicted values as opposed to the actual target.

confusion_matrix = cm(y_test, prediction)
print(confusion_matrix)

skplt.metrics.plot_confusion_matrix(y_test, prediction)
plt.show()
skplt.metrics.plot_confusion_matrix(y_test,prediction,normalize=True)
plt.show()

The main structure of our confusion matrix is the following:

True Negative | False Positive

False Negative | True Positive

TN – Model predicts a good client, and the client didn’t default

FN – Model predicts a good client, but the client defaults

TP – Model predicts a default, and the client defaults

FP – Model predicts a default, but the client did not default

The prediction report has the following metrics:

  • Accuracy – calculated the model’s overall ability to correctly predict the target. ((TP + TN) / (TP + FP + TN + FN))
  • Precision – out of all predictions of the TP (default), how many observations actually defaulted (TP / (TP + FP))
  • Recall – out of all positives, how many were predicted correctly. (TP / (TP + FN))
  • F-1 Score – this is a harmonic average of recall and precision.
  • Specificity – shows what fraction of negative cases did not default. (TN / (TN + FP))

As you can see our model didn’t perform very well as it had 2006 False Negatives aka it said that the client wouldn’t default but he did. But why would the model be bad, isn’t the accuracy high?

Well, accuracy is highly misleading when we have the class imbalance in the model. For example, if our data contained 98% good clients and only 2% fraudulent the model would show 98% accuracy.

What about the Precision and Recall values? Well, if we want to optimize our model for recall we will get more false positives and less false negatives.

When optimizing for Precision it is reversed, aka we get more false negatives and less false positives. Which one to choose? Well, it depends on the task of the model.

As we want to catch as many defaults as possible with our model, we should optimize the model for the recall. In order to evaluate the performance of the model, we can use the Precision-Recall curve.

This curve is used as we’re dealing with imbalanced data as we have a small number of defaults when compared to non-default clients. Have in mind that we shall take care of the imbalance in a moment.

The area under the Precision-Recall curve ranges from 0 to 1, where 1 marks a perfect model. This means that a model with a Precision-Recall curve of 1 can flag all the positives (perfect recall) and not label a single negative observation as a positive one (perfect precision).

This means that we can see models that approximate the (1, 1) point as exceptional.

y_pred_prob = clf_gini.predict_proba(X_test)[:, 1]
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred_prob)
print(metrics.auc(recall, precision))

ax = plt.subplot()
ax.plot(recall, precision,
label=f'PR-AUC = {metrics.auc(recall, precision):.2f}')
ax.set(title='Precision-Recall Curve',
xlabel='Recall',
ylabel='Precision')
ax.legend()

As we can see the curve could do better and we should fix the data imbalance. The way we shall do this is by introducing a resampling method as the bias in the training set could make the model ignore the defaults.

We will randomly select examples of the minority class, in our case the defaults, and duplicate them. This is known as random oversampling at the method assumes nothing about the data and provides a better fit of the model.

Have in mind that the change in the distribution is only reserved for the training data and not the test data. Overly skewed data when coupled with this method can produce overfitting so the main thing is to compare the two predictions before and after the resampling.

df_train = pd.concat([X_train, y_train],axis=1)
df_train

df_test = pd.concat([X_test, y_test],axis=1)
df_test

count_class_0, count_class_1 = df_train['default.payment.next.month'].value_counts()

df_majority = df_train[df_train['default.payment.next.month']==0]
df_minority = df_train[df_train['default.payment.next.month']==1]

df_minority_upsampled = df_minority.sample(count_class_0, replace=True)
df_upsampled = pd.concat([df_majority,df_minority_upsampled],axis=0)
 
print('Random Oversampling:')
print(df_upsampled['default.payment.next.month'].value_counts())
 

df_upsampled['default.payment.next.month'].value_counts().plot(kind='bar', title='Count (default.payment.next.month)');

Now that we have a more balanced dataset let us split it up into two groups and run the classifier again to see the change. The classifier code remains the same with the only change being the training variables.

X_train_upsampled = df_upsampled.drop(["default.payment.next.month"],axis=1)
y_train_upsampled = df_upsampled["default.payment.next.month"]

clf_gini = DecisionTreeClassifier(criterion = 'gini', random_state = 100, max_depth = 3, min_samples_leaf = 5)
clf_gini.fit(X_train_upsampled, y_train_upsampled)

prediction = clf_gini.predict(X_test)
print("Decision Tree Model Report")
report = classification_report(y_test, prediction)
print(report)

#Plot the Matrix
confusion_matrix = cm(y_test, prediction)
print(confusion_matrix)

skplt.metrics.plot_confusion_matrix(y_test, prediction)
plt.show()
skplt.metrics.plot_confusion_matrix(y_test,prediction,normalize=True)
plt.show()

y_pred_prob = clf_gini.predict_proba(X_test)[:, 1]
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred_prob)
print(metrics.auc(recall, precision))

ax = plt.subplot()
ax.plot(recall, precision,
label=f'PR-AUC = {metrics.auc(recall, precision):.2f}')
ax.set(title='Precision-Recall Curve',
xlabel='Recall',
ylabel='Precision')
ax.legend()

I’ll present the data with the resampled predictions being the first ones and the previous predictions being the second ones so you can compare the performance of the two easily.

Resampled Data
Non-Resampled Data

As we can see, the model performed quite better after the resampling, and our recall for the default prediction improved from 0.22 to 0.83!

Improvement of recall also meant a higher case of false positives but for this example, it is better to expect more users to default and take some action about it than to be surprised by many false negatives.

Thus picking a model that has fewer false negatives would be a wiser choice for a bank. But if the bank starts expending resources to mitigate the problem the role of FP becomes more prominent.

Before moving on to our other 3 classifiers have in mind that I’ll continue using the resampled data, but I advise you to play around and compare each classifier before and after resampling to see how it impact it.

Let’s move to our next classifier which is the Random Forest Plot. Our previous decision tree was easy to interpret but a single one is not efficient when wanting to predict a good result.

Random Forest randomly creates decision trees and leverages the power of multiple decision trees in order to make a good prediction.

The pros:

  • Isn’t sensitive to outliers
  • Non-parametric
  • Can handle continuous and categorical data
  • Performs well on large data

The cons:

  • Can overfit
  • The trees sometimes aren’t optimal
  • Calculations are complex if there are many variables
  • High variance with slight changes
  • Compared to other algorithms, it’s prediction can have low accuracy
clf = RandomForestClassifier(n_jobs=1000,
                            random_state=9,
                            n_estimators=11,
                            verbose=False)

clf.fit(X_train_upsampled, y_train_upsampled)

prediction = clf.predict(X_test)
print("Decision Tree Model Report")
report = classification_report(y_test, prediction)
print(report)

#Plot the Matrix
confusion_matrix = cm(y_test, prediction)
print(confusion_matrix)

skplt.metrics.plot_confusion_matrix(y_test, prediction)
plt.show()
skplt.metrics.plot_confusion_matrix(y_test,prediction,normalize=True)
plt.show()

y_pred_prob = clf.predict_proba(X_test)[:, 1]
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred_prob)
print(metrics.auc(recall, precision))

ax = plt.subplot()
ax.plot(recall, precision,
label=f'PR-AUC = {metrics.auc(recall, precision):.2f}')
ax.set(title='Precision-Recall Curve',
xlabel='Recall',
ylabel='Precision')
ax.legend()

Not too shabby, but it can do better and we’ll see if the other 3 classifiers can beat it. Also, as the training dataset is normalized with our resampling we can introduce a new curve called the ROC curve.

The Receiver Operating Characteristic (ROC) curve represents a trade-off between the FP and TP rate for different probabilities. We can check for the performance of a model by looking at the area beneath the ROC curve (AUC).

It represents a metric between 1 (ideal model) and 0.5 (no skill) and can be viewed in probabilistic terms. If the AUC is 0.5 it means that there is a 50% chance that two observations will be ordered correctly.

The best performing ROC curve would go up to the point of (0, 1) aka it would bow to the upper left part of the chart. Let’s calculate it:

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot

ns_probs = [0 for _ in range(len(y_test))]
lr_probs = clf.predict_proba(X_test)

lr_probs = lr_probs[:, 1]

ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)

print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))

ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)

pyplot.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
pyplot.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
pyplot.show()

Now we move on to the k-nearest neighbors (KNN) classifier. KNN is a “lazy learning” method where the function approximates locally using the distance for classification. It can work on many scales and normalization of the data can improve it considerably.

The best way to visualize what KNN is doing is to consider the following question: “To which group should the green circle be assigned?”

The pros:

  • Doesn’t have assumptions
  • Easy to interpret and intuitive
  • It evolves over time
  • Easy to use
  • Can be used for regression and classification
  • Has one hyper parameter

The cons:

  • KNN is quite slow
  • Needs homogenous features
  • Can’t handle imbalanced data
  • Sensitive to outliers
  • Can’t deal with missing values
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_upsampled,y_train_upsampled)
pred = knn.predict(X_test)
print("Accuracy:")
response = accuracy_score(y_test,pred)
print(response)

confusion_matrix = cm(y_test, pred)
print(confusion_matrix)

skplt.metrics.plot_confusion_matrix(y_test, pred)
plt.show()
skplt.metrics.plot_confusion_matrix(y_test,pred,normalize=True)
plt.show()

print(classification_report(y_test, pred))

y_pred_prob = knn.predict_proba(X_test)[:, 1]
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred_prob)
print(metrics.auc(recall, precision))

ax = plt.subplot()
ax.plot(recall, precision,
label=f'PR-AUC = {metrics.auc(recall, precision):.2f}')
ax.set(title='Precision-Recall Curve',
xlabel='Recall',
ylabel='Precision')
ax.legend()

ns_probs = [0 for _ in range(len(y_test))]
lr_probs = knn.predict_proba(X_test)

lr_probs = lr_probs[:, 1]

ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)

print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))

ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)

pyplot.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
pyplot.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
pyplot.show()

The next classifier is the Logistic Regression that be seen as model that predicts the probability of binary results such as win/lose, default/ no default, healthy/sick, etc.

The pros:

  • Easy to train, interpret and implement
  • It can cover multiple classes (multinomial regression)
  • It’s fast
  • Good accuracy
  • Less inclined to overfitting

The cons:

  • It makes the boundaries linear
  • It assumes the linearity between the dependent variable and the independent variables
  • It’s hard to get complex relations between the variables
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train_upsampled, y_train_upsampled)
prediction = logreg.predict(X_test)
print("Accuracy:")
response = accuracy_score(y_test,prediction)
print(response)

prediction = dict()
prediction['Logistic'] = logreg.predict(X_test)
print('f1 Score:' ,metrics.f1_score(y_test, prediction['Logistic']))

confusion_matrix = cm(y_test, prediction['Logistic'])
print(confusion_matrix)


skplt.metrics.plot_confusion_matrix(y_test, prediction['Logistic'])
plt.show()
skplt.metrics.plot_confusion_matrix(y_test,prediction['Logistic'],normalize=True)
plt.show()

print(classification_report(y_test, prediction['Logistic']))

y_pred_prob = logreg.predict_proba(X_test)[:, 1]
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred_prob)
print(metrics.auc(recall, precision))

ax = plt.subplot()
ax.plot(recall, precision,
label=f'PR-AUC = {metrics.auc(recall, precision):.2f}')
ax.set(title='Precision-Recall Curve',
xlabel='Recall',
ylabel='Precision')
ax.legend()

ns_probs = [0 for _ in range(len(y_test))]
lr_probs = logreg.predict_proba(X_test)

lr_probs = lr_probs[:, 1]

ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)

print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))

ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)

pyplot.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
pyplot.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
pyplot.show()

What classifier performs the best?

Before looking into which of the 4 classifiers had the best performance overall, we need to discuss the process of tuning hyperparameters. But first, what is the difference between a parameter and a hyperparameter?

Parameters:

  • Internal attributes of the model
  • They are learned during the training phase
  • They are estimated based on data

Hyperparameters:

  • External attributes of the model
  • Model’s settings
  • They aren’t estimated on data
  • They are set in place before the training phase
  • They require tuning for a better performance

For tuning hyperparameters mainly two processes coupled together are used and those are the cross-validation and grid search. In ML we need to tune the models so their predictions are better for the unseen data.

This can be seen as a constant combat between underfitting and overfitting, as well as a bias-variance trade-off. In order to tune the hyperparameters we can incorporate a validation group.

This group will tune them before we test the model’s performance on the test set. This can be a bad idea if you have a small dataset as the data used for training will be relocated to the validation set.

This is where cross-validation comes in as it allows us to get reliable estimates of the model’s error while generalizing. The k-fold cross validation functions in three main steps:

  1. Splits the training data randomly into k folds.
  2. Model trains using the k-1 folds and evaluates the performance on the kth fold.
  3. Repeats the process k times and averages the results.
https://en.wikipedia.org/wiki/Cross-validation_(statistics)#/media/File:K-fold_cross_validation_EN.svg

When it comes to grid search, the main thing is to produce a grid of all possible hyperparameter combinations and train the model using each one of them. This allows us to find the optimal parameter in a certain grid.

As the grid grows exponentially when adding new parameters so do the computational resources. That is the reason that a randomized grid method was produced.

In this randomized case, we select a random set of hyperparameters, train the model, obtain the scores and repeat the process until it hits a predefined number of iterations.

When working with large grids the random grid is a better option as it performs faster. For those interested in how many iterations it should go through, I didn’t find a satisfying answer as it depends on the resources and computational power.

Let’s see which classifier performs the best during cross validation and then we shall pick its hyperparameters and tune them.

from sklearn import model_selection

outcome = []
model_names = []
models = [  
          ('DecTree', DecisionTreeClassifier()),
          ('RandomForest', RandomForestClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('LogReg', LogisticRegression()),]

k_fold_validation = StratifiedKFold(5, shuffle=True, random_state=42)
for model_name, model in models:
    k_fold_validation = model_selection.StratifiedKFold(5, shuffle=True, random_state=42)
    results = model_selection.cross_val_score(model, X_train_upsampled, y_train_upsampled, cv=k_fold_validation, scoring='accuracy')
    outcome.append(results)
    model_names.append(model_name)
    output_message = "%s| Mean=%f STD=%f" % (model_name, results.mean(), results.std())
    print(output_message)

We got the following:

DecTree| Mean=0.961845 STD=0.001051
RandomForest| Mean=0.979542 STD=0.000930
KNN| Mean=0.911619 STD=0.000970
LogReg| Mean=0.794756 STD=0.001764

To be honest, it wasn’t a surprise that the Random Forest Tree classifier performed the best. Now is the time to pick the hyperparameters and tune them. But how to know which ones are available?

How to tune the parameters?

Do the following in order to obtain the list of parameters for each of our models:

from pprint import pprint
for model_name, model in models:
    print('\n',model,'Parameters currently in use:\n')
    pprint(model.get_params())

Output:

DecisionTreeClassifier() Parameters currently in use:

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': 'deprecated',
 'random_state': None,
 'splitter': 'best'}

 RandomForestClassifier() Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

 KNeighborsClassifier() Parameters currently in use:

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

 LogisticRegression() Parameters currently in use:

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

Wow, that is quite a list, which one to pick? Well, the best way is to research online what each estimator represents and read papers about which ones are the best to tune.

random_grid = {"n_estimators":[5,10,50,100,250],
               "max_depth":[2,4,8,16,32,None],
              'max_features': ['auto', 'sqrt'],
              'max_features': ['auto', 'sqrt'],
              'min_samples_leaf': [1, 2, 4],
              'min_samples_split': [2, 5, 10],}
pprint(random_grid)

Upon each iteration, the algorithm will choose a different combination of the features. There are many strings and thus we’ll rely on the random search training.

rfc = RandomForestClassifier()
rfc_random = RandomizedSearchCV(estimator = rfc, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rfc_random.fit(train_features, train_labels)

n_inter controls the number of different combinations while cv is the number of folds to use for cross-validation. More folds mean a lesser chance of overfitting while more iterations cover a wider scope.

The more we want the longer it computes. Let’s run the previous code and see the best params:

rfc_random.best_params_

Now let’s see if the random grid search provided us with a better model. We can do this by comparing the base model with the random one.

clft = RandomForestClassifier(n_estimators = 100,
                             min_samples_split = 2,
                             min_samples_leaf = 1,
                             max_features = 'sqrt',
                             max_depth = None)

clft.fit(X_train_upsampled, y_train_upsampled)

prediction = clft.predict(X_test)
print("Decision Tree Model Report")
report = classification_report(y_test, prediction)
print(report)

#Plot the Matrix
confusion_matrix = cm(y_test, prediction)
print(confusion_matrix)

skplt.metrics.plot_confusion_matrix(y_test, prediction)
plt.show()
skplt.metrics.plot_confusion_matrix(y_test,prediction,normalize=True)
plt.show()

y_pred_prob = clft.predict_proba(X_test)[:, 1]
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred_prob)
print(metrics.auc(recall, precision))

ax = plt.subplot()
ax.plot(recall, precision,
label=f'PR-AUC = {metrics.auc(recall, precision):.2f}')
ax.set(title='Precision-Recall Curve',
xlabel='Recall',
ylabel='Precision')
ax.legend()

ns_probs = [0 for _ in range(len(y_test))]
lr_probs = clft.predict_proba(X_test)

lr_probs = lr_probs[:, 1]

ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)

print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))

ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)

pyplot.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
pyplot.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
pyplot.show()

Our model improved a bit but it’s not so substantial. Depending on the problem you’re dealing with you’d want to tune the parameters more. In ML this is a process of trial and error mostly.

One thing you’d want to explore and do is to create a sklearn pipeline to automate the above process with less code. Also, some advanced tuning method would be stacking and Bayesian optimization.

How to interpret the data?

When it comes to banks they need to know what their ML does and what are the most important features in it. When turning back clients or making them more conscientious it is valuable to know the explanation behind the main variables of the algorithm.

Moreover, if we understand the logic behind the model we can focus on specific features and evaluate their correctness on a theoretical basis. It is even sensible to lower the accuracy to gain better interpretations.

But how do we get an idea of the main features that hold the most of the predictive value? Well, the scikit-learn offers us a feature importance function.

When it comes to our Random Forest it uses a metric impurity to create the best splits in the growth phase. While training the classifier we can predict how much a feature contributes to the decrease of impurity.

In order to know the importance of an entire tree the algorithm will average the decrease in impurity over all trees. The bad thing about this approach is that it only calculates the importance of the training set.

It also favors continuous numerical features more which can lead to some questionable cases.

Another thing that we can do is to use the SHAP library to compute the importance of features. Shapley values remove the order effect by considering all the possible ordering approaches.

This is done because the first feature can be very predictive and when the other comes into the model their effect overlaps so the second variable doesn’t bring any new knowledge in.

Thus the Shapley values are used the calculate the marginal effects of a feature in all possible orders and average them. The pros of SHAP are that it has a good theory behind it (game theory.)

SHAP also provides explanations from the micro to the macro level. The bad side of SHAP is that it is really x 10 slow when dealing with big datasets.

from sklearn.inspection import permutation_importance
import shap

feature_names = X_train_upsampled.columns
rf_feat_imp = pd.DataFrame(clft.feature_importances_,
                            index=feat_names,
                            columns=['mdi'])

rf_feat_imp = rf_feat_imp.sort_values('mdi', ascending=False)
rf_feat_imp['cumul_importance_mdi'] = np.cumsum(rf_feat_imp.mdi)

clft.feature_importances_
from matplotlib import style
plt.styple.use('ggplot')
sorted_idx = clft.feature_importances_.argsort()
plt.figure(figsize=(10,15))
plt.barh(feature_names[sorted_idx], clft.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")

plt.show()

As we see, the biggest 3 predictors are age, limit balance and married people. What does SHAP have to say?

explainer = shap.TreeExplainer(clft)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

It follows through but the lower effect sizes are different. I advise the reader to try tuning the other models too and even try dropping some non-important features to see how the models perform.

Through this article, we explored some main concepts and ways to handle data and there are many better but complex approaches that we didn’t get the time to explore.

Full code

GitHub Link

Resources

Database:

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Igor Radovanovic