Sklearn – An Introduction Guide to Machine Learning

17 min read

Get 10-day Free Algo Trading Course

Loading

Last Updated on April 3, 2023

scikit learn sklearn

Table of Contents

  1. What is Sklearn?
  2. What is Sklearn used for?
  3. How to download Sklearn for Python?
  4. How to pick the best scikit-learn model?
  5. Sklearn preprocessing – Prepare the data for analysis
    1. Sklearn feature encoding
    2. Sklearn data scaling
    3. Sklearn missing values
    4. Sklearn train test split
  6. Sklearn Regression – Predict the future
    1. Sklearn Linear Regression
    2. Other Sklearn regression models
  7. Sklearn Classification – Did I just see a cat?
    1. Sklearn Decision Tree Classifier
    2. Other Sklearn classification models
  8. Sklearn Clustering – Create groups of similar data
    1. Sklearn DBSCAN
    2. Other Sklearn clustering models
  9. Sklearn Dimensionality Reduction – Reducing random variables
    1. Sklearn PCA
    2. Other Sklearn Dimensionality Reduction models
  10. What are the 3 Common Machine Learning Analysis/Testing Mistakes?
  11. Full code

What is Sklearn?

Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.

It is also one of the most used machine learning libraries and is built on top of SciPy.

Link: https://scikit-learn.org/stable/

What is Sklearn used for?

The Sklearn Library is mainly used for modeling data and it provides efficient tools that are easy to use for any kind of predictive data analysis.

The main use cases of this library can be categorized into 6 categories which are the following:

  • Preprocessing
  • Regression
  • Classification
  • Clustering
  • Model Selection
  • Dimensionality Reduction

As this article is mainly aimed at beginners, we will stick to the core concepts of each category and explore some of its most popular features and algorithms.

Advanced readers can use this article as a recollection of some of the main use cases and intuitions behind popular sklearn features that most ML practitioners couldn’t live without.

Each category will be explained in a beginner-friendly and illustrative way followed by the most used models, the intuition behind them, and hands-on experience. But first, we need to set up our sklearn library.

How to download Sklearn for Python?

Sklearn can be obtained in Python by using the pip install function as shown below:

$ pip install -U scikit-learn

Sklearn developers strongly advise using a virtual environment (venv) or a conda environment when working with the library as it helps to avoid potential conflicts with other packages.

How to pick the best Sklearn model?

When it comes to picking the best Sklearn model, there are many factors that come into play that range from experience and data to the problem scope and math behind each algorithm.

Sometimes all chosen algorithms can have similar results and, depending on the problem setting, you will need to pick the one that is the fastest or the one that generalizes the best on big data.

It may happen that all of your promised models won’t perform well enough and that you will simply need to combine multiple models (e.g. ensemble), make your own custom-made model, or go for a deep learning approach.

As picking the right model is one of the foundations of your problem solving, it is wise to read-up on as many models and their uses as you can.

As model selection would be an article, or even a book, for itself, I’ll only provide some rough guidelines in the form of questions that you’ll need to ask yourself when deciding which model to deploy.

How much data do you have?

Some models are better on smaller datasets while others require more data and tend to generalize better on larger datasets (e.g. SGD Regressor vs Lasso Regression).

What are the main characteristics of your data?

Is your data linear, quadratic, or all over the place? How do your distributions look like? Is your data made out of numbers or strings? Is the data labeled?

Sklearn preprocessing – Prepare the data for analysis

When you think of data you probably have in mind a ginormous excel spreadsheet full of rows and columns with numbers in them. Well, the case is that data can come in a plethora of formats like images, videos and audio.

The main job of data preprocessing is to turn this data into a readable format for our algorithm. A machine can’t just “listen in” to an audiotape to learn voice recognition, rather it needs it to be converted numbers.

The main building blocks of our dataset are called features which can be categorical or numerical. Simply put, categorical data is used to group data with similar characteristics while numerical data provides information with numbers.

As the features come from two different categories, they need to be treated (preprocessed) in different ways. The best way to learn is to start coding along with me.

Sklearn feature encoding

Feature encoding is a method where we transform categorical variables into continuous ones. The most popular ways of doing so are known as One Hot Encoding and Label encoding.

For example, a person can have features such as [“male”, “female], [“from US”, “from UK”], [“uses Binance”, “uses Coinbase”]. These features can be encoded as numbers e.g. [“male”, “from US”, “uses Coinbase”] would be [0, 0, 1].

This can be done by using the scikit-learn OrdinalEncoder() function as follows:

pip install scikit-learn
from sklearn import preprocessing

X = [['male', 'from US', 'uses Coinbase'], ['female', 'from UK', 'uses Binance']]
encode = preprocessing.OrdinalEncoder()
encode.fit(X)

encode.transform([['male', 'from UK', 'uses Coinbase']])

Output: array([[1., 0., 1.]])

As you can see, it transformed the features into integers. But they are not continuous and can’t be used with scikit-learn estimators. In order to fix this, a popular and most used method is one hot encoding.

One hot encoding, also known as dummy encoding, can be obtained through the scikit-learn OneHotEncoder() function. It works by transforming each category with N possible values into N binary features where one category is represented as 1 and the rest as 0.

The following example will hopefully make it clear:

one_hot = preprocessing.OneHotEncoder()
one_hot.fit(X)

one_hot.transform([['male', 'from UK', 'uses Coinbase'],
                   ['female', 'from US', 'uses Binance']]).toarray()

Output: array([[0., 1., 1., 0., 0., 1.],
              [1., 0., 0., 1., 1., 0.]])

To see what your encoded features are exactly you can always use the .categories_ attribute as shown below:

one_hot.categories_

Output: [array(['female', 'male'], dtype=object),
         array(['from UK', 'from US'], dtype=object),
         array(['uses Binance', 'uses Coinbase'], dtype=object)]

Sklearn data scaling

Feature scaling is a preprocessing method used to normalize data as it helps by improving some machine learning models. The two most common scaling techniques are known as standardization and normalization.

Standardization makes the values of each feature in the data have zero-mean and unit variance. This method is commonly used with algorithms such as SVMs and Logistic regression.

Standardization is done by subtracting the mean from each feature and dividing it by the standard deviation. It’s some basic statistics and math, but don’t worry if you don’t get it. There are many tutorials that cover it.

In scikit-learn we use the StandardScaler() function to standardize the data. Let us create a random NumPy array and standardize the data by giving it a zero mean and unit variance.

import numpy as np

scaler = preprocessing.StandardScaler()
X = np.random.rand(3,4)
X
X_scaled = scaler.fit_transform(X)
X_scaled
print(f'The scaled mean is: {X_scaled.mean(axis=0)}\nThe scaled variance is: {X_scaled.std(axis=0)}')

Wait for a second! Didn’t you say that all mean values need to be 0?

Well, in practice these values are so close to 0 that they can be viewed as zero. Moreover, due to limitations with numerical representations the scaler can only get the mean really close to a zero.

Let’s move onto the next scaling method called normalization. Normalization is a term with many definitions that change from one field to another and we are going to define it as follows:

Normalization is a scaling technique in which values are shifted and rescaled so that they end up being between 0 and 1. It is also known as Min-Max scaling. In scikit-learn it can be applied with the Normalizer() function.

norm = preprocessing.Normalizer()

X_norm = norm.transform(X)
X_norm

So, which one is better? Well, it depends on your data and the problem you’re trying to solve. Standardization is often good when the data is depicting a Normal distribution and vice versa. If in doubt, try both and see which one improves the model.

Sklearn missing values

In scikit-learn we can use the .impute class to fill in the missing values. The most used functions would be the SimpleImputer(), KNNImputer() and IterativeImputer().

When you encounter a real-life dataset it will 100% have missing values in it that can be there for various reasons ranging from rage quits to bugs and mistakes.

There are several ways to treat them. One way is to delete the whole row (candidate) from the dataset but it can be costly for small to average datasets as you can delete plenty of data.

Some better ways would be to change the missing values with the mean or median of the dataset. You could also try, if possible, to categorize your subject into their subcategory and take the mean/median of it as the new value.

Let’s use the SimpleImputer() to replace the missing value with the mean:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit_transform([[10,np.nan],[2,4],[10,9]])

The strategy hyperparameter can be changed to median, most_frequent, and constant. But Igor, can we impute missing strings? Yes, you can!

import pandas as pd

df = pd.DataFrame([['i', 'g'],
                   ['o', 'r'],
                   ['i', np.nan],
                   [np.nan, 'r']], dtype='category')

imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(df)

If you want to keep track of the missing values and the positions they were in, you can use the MissingIndicator() function:

from sklearn.impute import MissingIndicator

# Image the 3's were imputed by the SimpleImputer()
Y = np.array([[3,1], 
              [5,3],
              [9,4], 
              [3,7]])

missing = MissingIndicator(missing_values=3)
missing.fit_transform(Y)

The IterateImputer() is fancy, as it basically goes across the features and uses the missing feature as the label and other features as the inputs of a regression model. Then it predicts the value of the label for the number of iterations we specify.

If you’re not sure how regression algorithms work, don’t worry as we will soon go over them. As the IterativeImputer() is an experimental feature we will need to enable it before use:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=15, random_state=42)
imputer.fit_transform(([1,5],[4,6],[2, np.nan], [np.nan, 8]))

Sklearn train test split

In Sklearn the data can be split into test and training groups by using the train_test_split() function which is a part of the model_selection class.

But why do we need to split the data into two groups? Well, the training data is the data on which we fit our model and it learns on it. In order to evaluate how the model performs on unseen data, we use test data.

An important thing, in most cases, is to allocate more data to the training set. When speaking of the ratio of this allocation there aren’t any hard rules. It all depends on the size of your dataset.

The most used allocation ratio is 80% for training and 20% for testing. Have in mind that most people use the training/development set split but name the dev set as the test set. This is more of a conceptual mistake.

Now let us create a random dataset and split it into training and testing sets:

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

# Create a random dataset
X, y = make_blobs(n_samples=1500)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

print(f'X training set {X_train.shape}\nX testing set {X_test.shape}\ny training set {y_train.shape}\ny testing set {y_test.shape}')

If your dataset is big enough you’ll often be fine with using this way to split the data. But some datasets come with a severe imbalance in them.

For example, if you’re building a model to detect outliers that default their credit cards you will most often have a very small percentage of them in your data.

This means that the train_test_split() function will most likely allocate too little of the outliers to your training set and the ML algorithm won’t learn to detect them efficiently. Let’s simulate a dataset like that:

from sklearn.datasets import make_classification
from collections import Counter

# Create an imablanced dataset
X, y = make_classification(n_samples=1000, weights=[0.95], flip_y=0, random_state=42)
print(f'Number of y before splitting is {Counter(y)}')

# Split the data the usual way
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
print(f'Number of y in the training set after splitting is {Counter(y_train)}')
print(f'Number of y in the testing set after splitting is {Counter(y_test)}')

As you can see, the training set has 43 examples of y while the testing set has only 7! In order to combat this, we can split the data into training and testing by stratification which is done according to y.

This means that y examples will be adequately stratified in both training and testing sets (20% of y goes to the test set). In scikit-learn this is done by adding the stratify argument as shown below:

# Split the data by stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)
print(f'Number of y in the training set after splitting is {Counter(y_train)}')
print(f'Number of y in the testing set after splitting is {Counter(y_test)}')

For a more in-depth guide and understanding of the train test split and cross-validation, please visit the following article that is found on our blog:

https://algotrading101.com/learn/train-test-split/

For more information about scikit-learn preprocessing functions go here.

Sklearn Regression – Predict the future

The regression method is used for prediction and forecasting and in Sklearn it can be accessed by the linear_model() class.

In regression tasks, we want to predict the outcome y given X. For example, imagine that we want to predict the price of a house (y) given features (X) like its age and number of rooms. The most simple regression model is linear regression.

Sklearn Linear Regression

Sklearn Linear Regression model can be used by accessing the LinearRegression() function. The linear regression model assumes that the dependent variable (y) is a linear combination of the parameters (Xi).

Allow me to illustrate how linear regression works. Imagine that you were tasked to fit a red line so it resembles the trend of the data while minimizing the distance between each point as shown below:

By eye-balling it should look something like this:

Let’s import the sklearn boston house-price dataset and so we can predict the median house value (MEDV) by the house’s age (AGE) and the number of rooms (RM).

Have in mind that this is known as a multiple linear regression as we are using two features.

from sklearn import linear_model, datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

# Load the Boston dataset
boston = datasets.load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
# Add the target variable (label)
df['MEDV'] = boston.target
df.head()

Now we will set our features (X) and the label (y). Notice how we use the numpy np.c_ function that concatenates the data for us.

# Set the features and label
X = pd.DataFrame(np.c_[df['LSTAT'], df['RM']], columns = ['LSTAT','RM'])
y = df['MEDV']

Now we will split the data into training and test sets which we learned earlier how to do:

# Set the features and label
X = pd.DataFrame(np.c_[df['AGE'], df['RM']], columns = ['AGE','RM'])
y = df['MEDV']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Let’s plot each of our features and see how they look. Try to imagine where the regression line would go.

# Plot the features
plt.scatter(X['RM'], y)
plt.scatter(X['AGE'], y)

You can already see that the data is a bit messy. The RM feature appears more linear and is prone to higher correlation with the label while the age feature shows the opposite. We also have outliers.

For this article, we won’t bother to clean up the data as we’re not interested to create a perfect model.

The next thing that we want to do is to fit our model and evaluate some of its core metrics:

regressor = linear_model.LinearRegression()
model = regressor.fit(X_train, y_train)

print('Coefficient of determination:', model.score(X, y))
print('Intercept:', model.intercept_)
print('slope:', model.coef_)
Coefficient of determination: 0.529269171356878
Intercept: -28.203538066489102
slope: [-0.06640957  8.7957305 ]

The coefficient of determination (R2) tells how much of the variance, in our case the variance of the median house income, our model explains. As we see it explains 53% of the variance which is okay.

For the brevity of the article, we won’t go into math now but feel free to look up the in-depth explanation behind the formula. And you don’t need to know it in order to use the regression, not saying that you shouldn’t.

The .intercept_ shows the bias b0, while the .coef_ is an array that contains our b1 and b2. In our case, the intercept is –28.20 and it represents the value of the predicted response when X1 = X2 = 0.

When we look at the slope, we can see that the increase in X1 (AGE) by 1 lowers the median house price by 0.06 while the increase in X2 (RM) results in the rise of the dependent variable by 8.79.

Let’s see how good your regression line predictions were:

# Age regression line
plt.plot(X['AGE'], y, 'o')
model.coef_[0], model.intercept_ = np.polyfit(X['AGE'], y, 1)
plt.plot(X['AGE'], model.coef_[0]*X['AGE']+model.intercept_, color='red')
# Room number regression line
plt.plot(X['RM'], y, 'o')
model.coef_[0], model.intercept_ = np.polyfit(X['RM'], y, 1)
plt.plot(X['RM'], model.coef_[0]*X['RM']+model.intercept_, color='red')

Now, let us predict some data and use a sklearn metric that will tell us how the model is performing:

y_test_predict = regressor.predict(X_test)
print('predicted response:', y_test_predict, sep='\n')
from sklearn.metrics import mean_squared_error

rmse = (np.sqrt(mean_squared_error(y_test, y_test_predict)))
print(rmse)
6.315423538049165

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are. It tells us how concentrated the data is around the regression line.

In our case, the RMSE is high for our liking. I’ll task you to try out other features (LSTAT and RM) and lower the RMSE. What happens when you use those two or more? Which features make the most sense to use?

Feel free to play around and check the Full code section to see some guidelines.

Other Sklearn regression models

There are various regression models that may be more useful and fit the data better than the simple linear regression, and those are the Lasso, Elastic-Net, Ridge, Polynomial, and Bayesian regression.

For more information about them go here.

Sklearn Classification

Classification problem in ML involves teaching a machine how to group data together to match the specified criteria. The most popular models in Sklearn come from the tree() class.

Every day you perform classification. For example, when you go to a grocery store you can easily group different foods by their food group (fruit, meat, grain, etc.).

When it comes to more complex decisions in the fields of medicine, trading, and politics, we’d like some good ML algorithms to aid our decision-making process.

Sklearn  Decision Tree Classifier

In Sklearn, the Decision Tree classifier can be accessed by using the DecisionTreeClassifier() function which is a part of the tree() class.

The main goal of a Decision Tree algorithm is to predict the value of the target variable (label) by learning simple decision rules deduced from the data features. For example, look at my simple decision tree below:

Here are some main characteristics of a Decision Tree Classifier:

  • It is made out of Nodes and Branches
  • Branches connect Nodes
  • The top Node is called the Root Node (“Go outside”)
  • Node from which new nodes arise is called a Parent Node (i.e. “Is it raining?” Node)
  • A node without a Child Node is called a Leaf Node (i.e. “Classic programmer” Node)

The good thing about a Decision Tree Classifier is that it is easy to visualize and interpret. It also requires little to no data preparation. The bad thing about it is that minor changes in the data can change it considerably.

For a more in-depth understanding of its pros and cons go here.

Now, let’s create a decision tree on the popular iris dataset. The dataset is made out of 3 plant species and we’ll want our tree to aid us in deciding to what specimen our plant belongs to according to its petal/sepal width and length.

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import graphviz

# Obtain the data and fit the model
X, y = load_iris(return_X_y=True)
dtc = DecisionTreeClassifier()
dtc = dtc.fit(X, y)

# Graph the Tree
iris = load_iris()
dot_data = tree.export_graphviz(dtc, out_file=None, 
                      feature_names=iris.feature_names,  
                     class_names=iris.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

Take note that “Gini” measures impurity. A node is “pure” when it has 0 Gini which happens when all training instances it applies to belong to the same class.

Have in mind that all algorithms have their hyperparameters which can be tuned to result in a better model. For example you can set the Decision Tree to only go to a certain depth, to have a certain allowed number of leaves and etc.

To see what are the standard hyperparameter that your untouched Decision Tree Classifier has and what each of them does please visit the scikit-learn documentation.

Other Sklearn classification models

Depending on the problem and your data, you might want to try out other classification algorithms that Sklearn has to offer. For example, SVC, Random Forest, AdaBoost, GaussianNB, or KNeighbors Classifier.

If you want to see how they compare to each other go here.

Sklearn Clustering – Create groups of similar data

Clustering is an unsupervised machine learning problem where the algorithm needs to find relevant patterns on unlabeled data. In Sklearn these methods can be accessed via the sklearn.cluster module.

Below you can see an example of the clustering method:

Sklearn DBSCAN

In Sklearn, the DBSCAN clustering model can be utilized by using the the DBSCAN() cluster which is a part of the cluster() class.

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. As the model isn’t deterministic (i.e. clusters must be convex), it is mostly used when the clusters can be in any shape or size.

The DBSCAN algorithm finds clusters by looking for areas with high density that are separated by areas of low density. The algorithm has two main parameters being min_samples and eps.

High min_samples and low eps indicate a higher density needed in order to create a cluster. The min_samples parameter controls how sensitive the algorithm is towards noise (higher values mean that it is less sensitive).

On the other hand, the eps parameter controls the local neighborhood of the points. If it is too high all data will be in one big cluster, if it is too low each data point will be its own cluster.

Enough theorizing, let’s jump to the coding part! We will generate some data and fit the DBSCAN clustering algorithm on it. We will also play a bit with its parameters.

Let’s import the libraries we need, create the data, scale it and fit the model:

from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Make the data and scale it
X, y = make_circles(n_samples=800, factor=0.3, noise=0.1, random_state=42)
X = StandardScaler().fit_transform(X)

# Fit the algorithm
y_predicted = DBSCAN(eps=0.35, min_samples=10).fit_predict(X)

Now, let’s see how our model performed:

# Visualize the data
plt.scatter(X[:,0], X[:,1], c=y_predicted)

Here we can easily spot two clusters, they even resemble an eye (I’m tempted to change the colors to make it look like the eye of Sauron). All models have their performance metrics and let’s check out the main ones.

# Evaluation Metrics
print('Number of clusters: {}'.format(len(set(y_predicted[np.where(y_predicted != -1)]))))
print('Homogeneity: {}'.format(metrics.homogeneity_score(y, y_predicted)))
print('Completeness: {}'.format(metrics.completeness_score(y, y_predicted)))
Number of clusters: 2
Homogeneity: 1.0000000000000007
Completeness: 0.9691231370370732

What would happen if we changed the eps value to 0.4?

y_predicted = DBSCAN(eps=0.4, min_samples=10).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c='orangered')
plt.title('I see you!')
Couldn’t resist

For a more hands-on experience in solving problems with clustering, check out our article on finding trading pairs for the pairs trading strategy with machine learning.

Other Sklearn clustering models

Depending on the clustering problem, you might want to use other clustering algorithms and the most popular ones are K-Means, Hierarchical, Affinity Propagation, and Gaussian mixtures clustering.

If you want to learn the in-depth theory behind clustering and get introduced to various models and the math behind them, go here.

Sklearn Dimensionality Reduction – Reducing random variables

Dimensionality reduction is a method where we want to shrink the size of data while preserving the most important information in it. In Sklearn these methods can be accessed from the decomposition() class.

As humans, we usually think in 4 dimensions (if you count time as one) up to a maximum of 6-7 if you are a quantum physicist. Data can easily go beyond that and we need to reduce it to lower dimensions so it can be observed.

Sklearn PCA

PCA (Principal Component Analysis) is a linear technique for dimensionality reduction. It basically does linear mapping of the data to a lower dimension while maximizing the preserved variance of data.

PCA can be used for an easier visualization of data and as a preprocessing step to speed up the performance of other machine learning algorithms. Let’s go back to our iris dataset and make a 2d visualization from its 4d structure.

Firstly, we will load the required libraries, obtain the dataset, scale the data and check how many dimensions we have:

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

# Load the data and scale it
X, y = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
print(f'The number of dimensions in X is {X.shape[1]}')
The number of dimensions in X is 4

Now we will set our PCA and fit it to the data:

# Load PCA and specify the number of dimensions aka components
pca = PCA(n_components=2)
pc = pca.fit_transform(X)
print(f'The number of reduced dimensions is {pc.shape[1]}')
The number of reduced dimensions is 2

Let’s store the data into a pandas data frame and recode the numerical target features to categorical:

# Put the data into a pandas data frame
df = pd.DataFrame(data = pc, columns = ['pc_1', 'pc_2'])
df['target'] = y
df.head()
# Recode the numerical data to categorical
def recoding(data):
    if data == 0:
        return 'iris-setosa'
    elif data == 1:
        return 'iris-versicolor'
    else:
        return 'iris-virginica'
    
df['target'] = df['target'].apply(recoding)
df.head()

And now for the finale with plot the data:

# Plot the data
fig = plt.figure(figsize = (12,10))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 17)
ax.set_ylabel('Principal Component 2', fontsize = 17)
ax.set_title('2 component PCA', fontsize = 20)
targets = ['iris-setosa', 'iris-versicolor', 'iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = df['target'] == target
    ax.scatter(df.loc[indicesToKeep, 'pc_1'],
               df.loc[indicesToKeep, 'pc_2'],
               c = color,
               s = 50)
ax.legend(targets)
ax.grid()

As you can see, we basically compressed the 4d data into a 2d observable one. In this case, we can say that the algorithm discovered the petals and sepals because we had the width and length of both.

Other Sklearn Dimensionality Reduction models

There are other Dimensionality Reduction models in Sklearn that you would prefer more for certain problems and those are the ICA, IPCA, NMF, LDA, Factor Analysis, and more.

For a more in-depth look go here.

What are the 3 Common Machine Learning Analysis/Testing Mistakes?

When you run your analysis, there are 3 common mistakes to take note:

  • Overfitting
  • Look-ahead Bias
  • P-hacking

Do check out this lecture PDF to learn more: 3 Big Mistakes of Backtesting – 1) Overfitting 2) Look-Ahead Bias 3) P-Hacking

Full Code

GitHub Link

Igor Radovanovic