Cluster Analysis – Machine Learning for Pairs Trading

16 min read

Get 10-day Free Algo Trading Course


Last Updated on July 16, 2022

Table of contents:

  1. What is Cluster Analysis?
  2. Is Cluster Analysis an Unsupervised Machine Learning task?
  3. How can Cluster Analysis be used for Finance?
  4. What is the Pairs Trading strategy?
  5. Does Pair Trading succeed?
  6. What are the main steps of a Machine Learning project?
  7. Where to find stock data and how to load it?
  8. How to explore stock data?
  9. How to prepare stock data for clustering?
  10. How to pick a good machine learning model?
  11. What is k-Means Clustering?
  12. How to use k-Means Clustering for Pairs Trading?
  13. What is Hierarchical Clustering?
  14. How to use Hierarchical Clustering for Pairs Trading?
  15. What is Affinity Propagation Clustering?
  16. How to use Affinity Propagation Clustering for Pairs Trading?
  17. How to evaluate and compare clustering models?
  18. How to extract the trading pairs?
  19. How to efficiently present your findings?
  20. What are the 3 Common Machine Learning Analysis/Testing Mistakes?
  21. Full code

What is Cluster Analysis?

Cluster Analysis is a group of methods that are used to classify phenomena into relative groups known as clusters.

Cluster Analysis doesn’t have any prior information about the groups our features inhabit.

cluster analysis machine learning
The result of a cluster analysis shown as the coloring of the squares into three clusters.

Is Cluster Analysis an Unsupervised Machine Learning task?

Yes. In Unsupervised ML we feed inputs but there aren’t any target outputs. This means that we don’t tell the algorithm what to do and that it needs to figure out some sort of dependence or underlying logic of what to do.

For example, imagine a website where people posted pictures of dogs and cats. We want the said website to classify those pictures into two categories “Cats” and “Dogs” without us needing to label each picture.

Rather than teaching the clustering model what a cat or dog is, we will just say to the algorithm to group the pictures into two groups based on visual similarity. We will then get the result of two unlabeled groups:

How can Cluster Analysis be used for Finance?

When it comes to Finance, Cluster Analysis can easily spot the underlying logic of our dataset without us needing to bang our head trying to figure it out for ourselves.

For example, imagine having the financial data for 200 stocks, and that you want the algorithm to divide them into groups. The algorithm gives us 4 groups (clusters) that we need to examine.

After examination, we conclude that the groups are: “Bestselling”, “Selling but mediocre”, “Worst selling”, and “Stagnating”.

In this article, we’ll explore how Cluster Analysis can help us in creating a Paris Trading Strategy. Let us first remind ourselves what a Pairs Trading strategy is.

What is the Pairs Trading strategy?

Pairs trading is a strategy in which a trader buys one asset while shorting another. The main premise of the trade is that when the two pairs diverge, they will likely converge again resulting in profit for the trader.

A visual representation of this strategy might help you in understanding it better:

Pairs Trading. Chart of 2 prices mean reverting
Pairs Trading

Does Pair Trading succeed?

Pair Trading will work if you choose the right assets to form a pair. Moreover, you should add more parts to your pairs trading strategy like utilizing more pairs or coupling it with sentiment analysis.

As the fundamental building block of this strategy are the pairs we use, we don’t want to pick unreasonable ones. Finding them by hand might be too time-consuming and you might miss out on the underdogs.

This is where Machine Learning (ML) comes into play. In this article, we will go step-by-step through the pace of solving our problem with ML. We will define the problem as follows:

What trading assets work the best together to form a trading pair for the Pairs Trading strategy?

Let’s begin!

Note that this article explores machine learning statistic methods to find assets that moved similarly historically. Pairs trading has been around for a long time and this strategy is common place among hedge funds and traders.

To succeed with pairs trading, you need market knowledge in addition to the statistical tools that you learnt here.

For more information about implementing pairs trading in real-life, check out the following article:

What are the main steps of a Machine Learning project?

Before tackling the main problem that we defined, we need to remind ourselves of what the main steps of an ML project are. These steps can often be overlooked by novice practitioners, so be sure to have them in mind:

  1. Define the Problem – be sure to deeply understand the problem you are trying to solve and elaborate it in a concise and understandable way.
  2. Research the Problem – thoroughly research your problem by exploring if there are any proposed solutions, read papers, communicate with experts, etc.
  3. Obtain the Datayou can’t expect your model to perform without obtaining quality data that fits the problem. Remember, garbage in = garbage out.
  4. Prepare the Data – things aren’t perfect in life as so goes for your data. It might have missing values, wrongly imputed values, undesirable values, and much more. Be sure to clean it!
  5. Pick the right Model – picking the right model is one of the most important steps when trying to solve the problem. Think of how each model works and if it can provide a reasonable solution.

    Is your task to predict a value, cluster, or classify? Should you use supervised, unsupervised or reinforcement learning?
  6. Evaluate the Model – evaluating your model is a no-brainer. You simply need to see how it performs on the main performance metrics like precision and recall.
  7. Tune the Model – depending on the model you choose and its performance you might want to optimize it by tweaking it with its structure and hyperparameters.
  8. Present your findings – after the model is tuned, you are ready to deploy it and present your findings. This is where your communication skills come into play so be sure to practice them.

Have in mind that we covered the main steps, there are even more sub-steps and global steps that might and should arise. Be sure to think both wide and deep.

Where to find stock data and how to load it?

Stock data can be easily obtained by using financial data providers like Quandl, Yahoo Finance, dxFeed, Bloomberg, or by utilizing online brokers like Interactive Brokers, Fidelity Investments, and more.

For this article, we will obtain 3 years’ worth of data for the S&P 500 stock by using Yahoo Finance. For more info on Yahoo Finance check out this article:

S&P is a stock market index that measures the stock performance of 500 large US companies. Let us start up our python and check for the number of tickers in the S&P 500 index and print the first five of them.

#pip install yahoo_fin
import yahoo_fin.stock_info as si

sp500_list = si.tickers_sp500()
print("Number of Tickers in S&P 500:", len(sp500_list))
Number of Tickers in S&P 500: 505
['A', 'AAL', 'AAP', 'AAPL', 'ABBV']

Now let us iterate through the list and obtain our data for each of the tickers:

sp500_historical = {}
for ticker in sp500_list:
    sp500_historical[ticker] = si.get_data(ticker, start_date="01/01/2018", index_as_date = False, interval="1d")

As Yahoo Finance returns a pandas data frame, we have just obtained 505 data frames. Now we need to concatenate them:

data = pd.concat(sp500_historical)
data.reset_index(drop=True, inplace=True)

As you can see, the data is still unusable as all the tickers got grouped into a single column. Moreover, we only need the adjusted closing prices and the date columns. We can sort this out by pivoting the data table as follows:

data = data.pivot(index='date', columns='ticker', values = 'adjclose')

Perfect! Now we have our data sorted in the required way. Let us go ahead and save it as a CSV file for future use:


How to explore stock data?

Stock data can be explored in various ways and the most popular one is by doing an Exploratory Data Analysis which consists of several descriptive statistic methods.

Let’s just briefly look into some main statistics as we really want to explore the data after the clustering is done. We will call the pandas describe method and set the decimal point to 3:

pd.set_option('precision', 3)

How to prepare stock data for clustering?

The next step is to see if we have any missing values:


=> True

As we have missing data, I’m interested in how much is missing. Let us use the missingno library to plot the missing values.

import missingno

As we have many stocks it looks a bit messy but you can still see some huge white lines that represent the missing data. This is a bad thing and we shall remove all the columns with more than 20% of missing data:

print('Data Shape before cleaning =', data.shape)

missing_percentage = data.isnull().mean().sort_values(ascending=False)
dropped_list = sorted(list(missing_percentage[missing_percentage > 0.2].index))
data.drop(labels=dropped_list, axis=1, inplace=True)

print('Data Shape after cleaning =', data.shape)
Data Shape before cleaning = (799, 505)
Data Shape after cleaning = (799, 498)

We dropped only 7 columns which isn’t bad. What do we do with columns that have less than 20% of missing data? We can drop the columns or fill in the missing values by zeros, mean of the column, or more.

I’ll fill the missing values by the last available value in the column:

data = data.fillna(method='ffill')

For our clustering task, we are interested in the volatility and performance of stocks and thus we want to obtain the variance and returns on an annual level. Have in mind that we will take a theoretical year period:

import numpy as np

#Calculate returns and create a data frame
returns = data.pct_change().mean()*266
returns = pd.DataFrame(returns)
returns.columns = ['returns']

#Calculate the volatility
returns['volatility'] = data.pct_change().std()*np.sqrt(266)

data = returns

If you pay attention to the values you can see that, for example, AAL stock has quite a larger volatility than A stock. If we pass the data like this into our models the higher values would be too noisy for the lower ones.

This would make the algorithm not perform well and to combat it we want to scale the variables (mean = 0, variance = 1) by using the StandardScaler from sklearn.

from sklearn.preprocessing import StandardScaler

#Prepare the scaler
scale = StandardScaler().fit(data)

#Fit the scaler
scaled_data = pd.DataFrame(scale.fit_transform(data),columns = data.columns, index = data.index)
X = scaled_data

Now we are ready to decide which models to apply to our data.

How to pick a good machine learning model?

When choosing a good machine learning model you need to know your data. By knowing your data I mean the distributions, missing values, features, labels, etc.

Moreover, you need to know at least the theory behind each model and how and when it is used. All models have their pros and cons and some will perform better than the others on your dataset.

You should think about your problem in the following way: How can it be solved (prediction, clustering, classification)? Should/can I use supervised, unsupervised or reinforcement learning?

After you get the main idea of what the problem requires to be solved, you can move on to choose a few models. If you are a beginner you can simply Google the most used models in the category you have chosen.

Be sure to pick at least 3 models and compare their outputs so you can go with the best-performing one.

Now, let’s apply this to our problem. We have a clustering task that uses the unsupervised learning method and the three models we will choose are:

  • KMeans Clustering
  • Hierarchical Clustering
  • Affinity Propagation Clustering

Now we shall go over each of the selected models, apply them to the data, explore their results and compare them to each other. After comparison, we shall pick the best one and extract the clusters.

What is k-Means Clustering?

k-Means clustering is an algorithm that utilizes unsupervised learning to find and mark K clusters that are specified in advance. K cluster can be found by using either the silhouette or elbow methods.

The way the k-Means algorithm works can be simply explained in 4 main steps which are the following:

  1. After the user has specified the number of clusters (k) the algorithm randomly maps them to the data points as shown in the picture below:

2. k clusters are created by associating every observation with the nearest mean, hence the k-means name.

3. The centroids (circles) of each cluster transform into a new mean. You can imagine this by their movement as represented below:

4. The previous two steps are repeated until the model converges on a satisfying solution which may look like the following one:

How to use k-Means Clustering for Pairs Trading?

Now that we know the basic idea of how the model works, let’s obtain the number of k clusters we should use for our Pairs Trading problem.

We shall start with the elbow method that can be summed up in the following way: Iterate through the values of k and calculate the distortion for each value of k, and distortion and inertia for each value of k in the specified range.

Distortion is the average of the squared distances from the center of each cluster, while inertia is the sum of squared distances of each feature to the closest cluster center.

Don’t worry if this sounds confusing there are great tutorials out there that cover the math behind the algorithm. Let’s input our libraries and launch the elbow method:

from sklearn.cluster import KMeans
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline

K = range(1,15)
distortions = []

#Fit the method
for k in K:
    kmeans = KMeans(n_clusters = k)

#Plot the results
fig = plt.figure(figsize= (15,5))
plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.title('Elbow Method')

By observing the chart we can conclude that the optimal number of clusters would be somewhere between 5 and 6. If you look at the iterations after 6, you can see that we start obtaining less informative clusters.

If you aren’t sure about the number of clusters you can use the kneed library that finds the optimal number. Let’s try it out:

#pip install kneed
from kneed import KneeLocator
kl = KneeLocator(K, distortions, curve="convex", direction="decreasing")

Output = 5

The silhouette method works by measuring how a particular instance is similar to the cluster it is put into. The values for this method are in a range between -1 and 1 where the higher values indicate a better match.

from sklearn.metrics import silhouette_score

#For the silhouette method k needs to start from 2
K = range(2,15)
silhouettes = []

#Fit the method
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10, init='random')
    silhouettes.append(silhouette_score(X, kmeans.labels_))

#Plot the results
fig = plt.figure(figsize= (15,5))
plt.plot(K, silhouettes, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette Method')

kl = KneeLocator(K, silhouettes, curve="convex", direction="decreasing")
print('Suggested number of clusters: ', kl.elbow)

Our two methods show a different optimal number of clusters and we will go with the number 6 as the Elbow Method has shown that it should also work. Let us go ahead and build our k-Means algorithm with 6 clusters.

c = 6
#Fit the model
k_means = KMeans(n_clusters=c)
prediction = k_means.predict(X)

#Plot the results
centroids = k_means.cluster_centers_
fig = plt.figure(figsize = (18,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c=k_means.labels_, cmap="rainbow", label = X.index)
ax.set_title('k-Means Cluster Analysis Results')
ax.set_xlabel('Mean Return')

Quite interesting! If you look at the orange cluster we can see that it is made out of outliers (volatile stocks with a large mean return). We can either remove the outliers, add them to the blue cluster or leave them be.

In this Pairs Trading scenario, I’d prefer leaving them so we can know which stocks are these as they would be interesting to explore further. To know how many instances each cluster has we can write the following:

clustered_series = pd.Series(index=X.index, data=k_means.labels_.flatten())
clustered_series_all = pd.Series(index=X.index, data=k_means.labels_.flatten())
clustered_series = clustered_series[clustered_series != -1]
plt.xlabel('Stocks per Cluster')
plt.ylabel('Cluster Number')

What is Hierarchical Clustering?

Hierarchical Clustering is a method that groups features into clusters based on their similarity. It can perform the groupage by an agglomerative (bottom-up) or divisive (top-down) approach.

The main advantage that hierarchical clustering has is that it doesn’t require us to specify the number of clusters in advance.

The method performs the clustering by creating a tree of clusters by grouping and separating features on each iteration. The product of the clustering process is visualized in a figure known as “dendrogram”.

How to use Hierarchical Clustering for Pairs Trading?

When applying Hierarchical Clustering to our Pairs Trading problem we need to know the main scikit-learn methods by which the similarity between our features is measured and those are the following:

  • Ward linkage – it works by minimizing the within-cluster variance of the clusters that are in the process of merging.
  • Average linkage – it calculates the average distance between each data point in two clusters.
  • Complete linkage – measures the maximum distance between all data points in two clusters.
  • Single linkage – groups the clusters in a bottom-up way.

As we want to minimize the variance distance between our clusters we shall go with Ward’s linkage. Let us jump into the coding part to calculate the linkage and plot a dendrogram:

from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

plt.figure(figsize=(15, 10))  
dend = shc.dendrogram(shc.linkage(X, method='ward'))

Here we can see the dendrogram where the x-axis is represented by our stocks and the y-axis represents the distance between them. The vertical line with maximum distance (blue) shows the cluster threshold.

As we can see, a cut at 13.5 will give us 4 clusters. Allow me to plot that:

plt.figure(figsize=(15, 10))  
dend = shc.dendrogram(shc.linkage(X, method='ward'))
plt.axhline(y=13.5, color='purple', linestyle='--')

Now that we know the number of clusters, we can fit the hierarchical clustering model to our data and obtain a scatter plot where the clustering output instances can be clearly seen.

#Fit the model
clusters = 4
hc = AgglomerativeClustering(n_clusters= clusters, affinity='euclidean', linkage='ward')
labels = hc.fit_predict(X)

#Plot the results
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0], X.iloc[:,1], c=labels, cmap='rainbow')
ax.set_title('Hierarchical Clustering Results')
ax.set_xlabel('Mean Return')

Great! Now we move onto the last clustering algorithm.

What is Affinity Propagation Clustering?

Affinity Propagation Clustering is a method that creates clusters by a criterion of how well suited an instance is to be a representative of another one. Moreover, it doesn’t require a specified number of clusters in advance.

You can imagine this by instances messaging each other on how much they suit one another. After that, an instance that received messages from multiple senders will send back the revised value of attractiveness to each sender.

This messaging will proceed until an agreement is reached. When a sender gets associated with the receiver the receiver will become the exemplar. All data points with the same exemplar will then create a cluster.

How to use Affinity Propagation Clustering for Pairs Trading?

Now that we understand what the Affinity Propagation Clustering model does, we can go ahead an fit it to our data.

from sklearn.cluster import AffinityPropagation

#Fit the model
ap = AffinityPropagation()
labels1 = ap.predict(X)

#Plot the results
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0], X.iloc[:,1], c=labels1, cmap='rainbow')
ax.set_title('Affinity Propagation Clustering Results')
ax.set_xlabel('Mean Return')

Wow, that’s quite a number of clusters. Let’s obtain their number and arrange them for a better look. We will do this by taking the cluster center indices and labels and plotting them. We shall also transform our data into a NumPy array:

from itertools import cycle

#Extract the cluster centers and labels
cci = ap.cluster_centers_indices_
labels2 = ap.labels_

#Print their number
clusters = len(cci)
print('The number of clusters is:',clusters)

#Plot the results
X_ap = np.asarray(X)
colors = cycle('cmykrgbcmykrgbcmykrgbcmykrgb')
for k, col in zip(range(clusters),colors):
    cluster_members = labels2 == k
    cluster_center = X_ap[cci[k]]
    plt.plot(X_ap[cluster_members, 0], X_ap[cluster_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=12)
    for x in X_ap[cluster_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

Estimated number of clusters: 27

For our main problem having a higher number of clusters would make more sense and it would be easier to pick the trading pairs from each of them, but let’s see what the model comparison shows us.

How to evaluate and compare clustering models?

As the clustering models are unsupervised, meaning that we don’t have the labels, we can compare the models by their silhouette score that we introduced earlier in the article.

So let’s see which method performs the best:

print("k-Means Clustering", metrics.silhouette_score(X, k_means.labels_, metric='euclidean'))
print("Hierarchical Clustering", metrics.silhouette_score(X, hc.fit_predict(X), metric='euclidean'))
print("Affinity Propagation Clustering", metrics.silhouette_score(X, ap.labels_, metric='euclidean'))
k-Means Clustering 0.3494916268886619
Hierarchical Clustering 0.3046193567096882
Affinity Propagation Clustering 0.33752158556435613

Seems like that the k-Means algorithm performed the best, so let’s go with it.

How to extract the trading pairs?

cluster_size_limit = 1000
counts = clustered_series.value_counts()
ticker_count = counts[(counts>1) & (counts<=cluster_size_limit)]
print ("Number of clusters: %d" % len(ticker_count))
print ("Number of Pairs: %d" % (ticker_count*(ticker_count-1)).sum())

In order to extract the trading pairs, we need to check how many trading pairs are there to be evaluated. The evaluation will perform a statistical analysis to find pairs that are cointegrated.

Pairs are deemed as cointegrated when they aren’t stationary and tend to move together (recall the Pairs Trading definition from the beginning of the article).

Let’s set up a function that finds the cointegrated pairs within a cluster. I salvaged this code from the platform known as Quantopian that’s shutdown and not in use anymore.

def find_cointegrated_pairs(data, significance=0.05):
    n = data.shape[1]    
    score_matrix = np.zeros((n, n))
    pvalue_matrix = np.ones((n, n))
    keys = data.keys()
    pairs = []
    for i in range(1):
        for j in range(i+1, n):
            S1 = data[keys[i]]            
            S2 = data[keys[j]]
            result = coint(S1, S2)
            score = result[0]
            pvalue = result[1]
            score_matrix[i, j] = score
            pvalue_matrix[i, j] = pvalue
            if pvalue < significance:
                pairs.append((keys[i], keys[j]))
    return score_matrix, pvalue_matrix, pairs

Now we shall look for the cointegrated pairs within clusters and return them:

from statsmodels.tsa.stattools import coint

cluster_dict = {}

for i, clust in enumerate(ticker_count.index):
    tickers = clustered_series[clustered_series == clust].index
    score_matrix, pvalue_matrix, pairs = find_coint_pairs(data1[tickers])
    cluster_dict[clust] = {}
    cluster_dict[clust]['score_matrix'] = score_matrix
    cluster_dict[clust]['pvalue_matrix'] = pvalue_matrix
    cluster_dict[clust]['pairs'] = pairs
pairs = []   
for cluster in cluster_dict.keys():
print ("Number of pairs:", len(pairs))
print ("In those pairs, we found %d unique tickers." % len(np.unique(pairs)))
Number of pairs: 20
In those pairs, we found 25 unique tickers.
[('A', 'AVG0'), ('A', 'CMI'), ('A', 'DHI'), ('A', 'HOLX'), ('A', 'ISRG'), ('A', 'NKE'), ('A', 'ORCL'), ('A', 'TAT'), ('A', 'TMUS'), ('A', 'UNH'), ('ABBV', 'ABC'), ('ABBV', 'JBHT'), ('ABBV', 'NI'), ('AFL', 'HAS'), ('AFL', 'KIM'), ('AAPL', 'ADSK'), ('AAPL', 'CTLT'), ('AAPL', 'QRVO'), ('AAL', 'FANG'), ('AAL', 'UNM')]

Now that we see our trading pairs, let’s go ahead and visualize them by using TSNE (t-distributed stochastic neighbor embedding). TSNE is used for visualizing high-dimensional data by giving each instance a location in a 2d or 3d map.

Let’s import the remaining two libraries and set up a data frame for our trading pairs.

from sklearn.manifold import TSNE
import as cm

stocks = np.unique(pairs)
X_data = pd.DataFrame(index=X.index, data=X).T
in_pairs_series = clustered_series.loc[stocks]
stocks = list(np.unique(pairs))
X_pairs = X_data.T.loc[stocks]

Now we are ready to launch the TSNE algorithm and plot the results:

X_tsne = TSNE(learning_rate=30, perplexity=5, random_state=42, n_jobs=-1).fit_transform(X_pairs)
plt.figure(1, facecolor='white',figsize=(15,10))
for pair in pairs:
    ticker1 = pair[0]
    loc1 = X_pairs.index.get_loc(pair[0])
    x1, y1 = X_tsne[loc1, :]
    ticker2 = pair[0]
    loc2 = X_pairs.index.get_loc(pair[1])
    x2, y2 = X_tsne[loc2, :]
    plt.plot([x1, x2], [y1, y2], 'k-', alpha=0.3, c='b');
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=215, alpha=0.8, c=in_pairs_series.values, cmap=cm.Paired)
plt.title('TSNE Visualization of Pairs'); 

# Join pairs by x and y
for x,y,name in zip(X_tsne[:,0],X_tsne[:,1],X_pairs.index):

    label = name

                 textcoords="offset points",

When you’ve obtained the results from your project now is the time to present them to others.

How to efficiently present your findings?

In order to efficiently present your findings you need to go over the main ML project steps and say a few words on each step and what were your ideas for it and what you obtained from each step.

We did that along the way in this article and I hope that you’ve learned something interesting, and above all, useful. In order to hit the nail on its head, let’s go to tipranks and compare the stocks from the green cluster.

Now that you have the statistical tools to find similar assets, check out our article on how to use them in real-world trading:

What are the 3 Common Machine Learning Analysis/Testing Mistakes?

When you run your analysis, there are 3 common mistakes to take note:

  • Overfitting
  • Look-ahead Bias
  • P-hacking

Do check out this lecture PDF to learn more: 3 Big Mistakes of Backtesting – 1) Overfitting 2) Look-Ahead Bias 3) P-Hacking

Full code

GitHub Link

Igor Radovanovic