{"id":9302,"date":"2021-05-11T20:02:41","date_gmt":"2021-05-11T20:02:41","guid":{"rendered":"http:\/\/algotrading101.com\/learn\/?p=9302"},"modified":"2023-04-03T21:09:11","modified_gmt":"2023-04-03T21:09:11","slug":"sklearn-guide","status":"publish","type":"post","link":"https:\/\/algotrading101.com\/learn\/sklearn-guide\/","title":{"rendered":"Sklearn &#8211; An Introduction Guide to Machine Learning"},"content":{"rendered":"<div class=\"pvc_clear\"><\/div><p id=\"pvc_stats_9302\" class=\"pvc_stats total_only  \" data-element-id=\"9302\" style=\"\"><i class=\"pvc-stats-icon medium\" aria-hidden=\"true\"><svg aria-hidden=\"true\" focusable=\"false\" data-prefix=\"far\" data-icon=\"chart-bar\" role=\"img\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\" class=\"svg-inline--fa fa-chart-bar fa-w-16 fa-2x\"><path fill=\"currentColor\" d=\"M396.8 352h22.4c6.4 0 12.8-6.4 12.8-12.8V108.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v230.4c0 6.4 6.4 12.8 12.8 12.8zm-192 0h22.4c6.4 0 12.8-6.4 12.8-12.8V140.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v198.4c0 6.4 6.4 12.8 12.8 12.8zm96 0h22.4c6.4 0 12.8-6.4 12.8-12.8V204.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v134.4c0 6.4 6.4 12.8 12.8 12.8zM496 400H48V80c0-8.84-7.16-16-16-16H16C7.16 64 0 71.16 0 80v336c0 17.67 14.33 32 32 32h464c8.84 0 16-7.16 16-16v-16c0-8.84-7.16-16-16-16zm-387.2-48h22.4c6.4 0 12.8-6.4 12.8-12.8v-70.4c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v70.4c0 6.4 6.4 12.8 12.8 12.8z\" class=\"\"><\/path><\/svg><\/i> <img decoding=\"async\" width=\"16\" height=\"16\" alt=\"Loading\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/plugins\/page-views-count\/ajax-loader-2x.gif\" border=0 \/><\/p><div class=\"pvc_clear\"><\/div>\n\n\n<figure class=\"wp-block-image size-full\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"551\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1200px-Scikit_learn_logo_small-1.webp\" alt=\"scikit learn sklearn\" class=\"wp-image-16093\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1200px-Scikit_learn_logo_small-1.webp 1024w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1200px-Scikit_learn_logo_small-1-300x161.webp 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1200px-Scikit_learn_logo_small-1-768x413.webp 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Table of Contents<\/h3>\n\n\n\n<ol><li><a href=\"#what-is-sklearn\">What is Sklearn?<\/a><\/li><li><a href=\"#sklearn-use\">What is Sklearn used for?<\/a><\/li><li><a href=\"#download-sklearn\">How to download Sklearn for Python?<\/a><\/li><li><a href=\"#pick-best-model\">How to pick the best scikit-learn model?<\/a><\/li><li><a href=\"#sklearn-preprocessing\">Sklearn preprocessing &#8211; Prepare the data for analysis<\/a><ol><li><a href=\"#feature-encoding\">Sklearn feature encoding<\/a><\/li><li><a href=\"#data-scaling\">Sklearn data scaling<\/a><\/li><li><a href=\"#missing-values\">Sklearn missing values<\/a><\/li><li><a href=\"#train-test\">Sklearn train test split<\/a><\/li><\/ol><\/li><li><a href=\"#regression\">Sklearn Regression &#8211; Predict the future<\/a><ol><li><a href=\"#linear-regression\">Sklearn Linear Regression<\/a><\/li><li><a href=\"#other-regression\">Other Sklearn regression models<\/a><\/li><\/ol><\/li><li><a href=\"#classification\">Sklearn Classification &#8211; Did I just see a cat?<\/a><ol><li><a href=\"#tree-classifier\">Sklearn\u00a0Decision Tree Classifier<\/a><\/li><li><a href=\"#other-classification\">Other Sklearn classification models<\/a><\/li><\/ol><\/li><li><a href=\"#clustering\">Sklearn Clustering &#8211; Create groups of similar data<\/a><ol><li><a href=\"#dbscan\">Sklearn DBSCAN<\/a><\/li><li><a href=\"#other-clustering\">Other Sklearn clustering models<\/a><\/li><\/ol><\/li><li><a href=\"#dimensionality\">Sklearn Dimensionality Reduction &#8211; Reducing random variables<\/a><ol><li><a href=\"#pca\">Sklearn PCA<\/a><\/li><li><a href=\"#other-dimensionality\">Other Sklearn Dimensionality Reduction models<\/a><\/li><\/ol><\/li><li><a href=\"#common-machine-learning-testing-mistakes\"><a href=\"https:\/\/algotrading101.com\/learn\/wp-admin\/post.php?post=9276&amp;action=edit#common-machine-learning-testing-mistakes\">What are the 3 Common Machine Learning Analysis\/Testing Mistakes?<\/a><\/a><\/li><li><a href=\"#full-code\">Full code<\/a><\/li><\/ol>\n\n\n\n<a name=\"what-is-sklearn\">\n\n\n\n<h2 class=\"wp-block-heading\">What is Sklearn?<\/h2>\n\n\n\n<p>Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms. <\/p>\n\n\n\n<p>It is also one of the most used machine learning libraries and is built on top of SciPy.<\/p>\n\n\n\n<p>Link: <a href=\"https:\/\/scikit-learn.org\/stable\/\">https:\/\/scikit-learn.org\/stable\/<\/a><\/p>\n\n\n\n<a name=\"sklearn-use\">\n\n\n\n<h2 class=\"wp-block-heading\">What is Sklearn used for?<\/h2>\n\n\n\n<p>The Sklearn Library is mainly used for modeling data and it provides efficient tools that are easy to use for any kind of predictive data analysis. <\/p>\n\n\n\n<p>The main use cases of this library can be categorized into 6 categories which are the following:<\/p>\n\n\n\n<ul><li>Preprocessing<\/li><li>Regression<\/li><li>Classification<\/li><li>Clustering<\/li><li>Model Selection<\/li><li>Dimensionality Reduction<\/li><\/ul>\n\n\n\n<p>As this article is mainly aimed at beginners, we will stick to the core concepts of each category and explore some of its most popular features and algorithms. <\/p>\n\n\n\n<p>Advanced readers can use this article as a recollection of some of the main use cases and intuitions behind popular sklearn features that most ML practitioners couldn&#8217;t live without.<\/p>\n\n\n\n<p>Each category will be explained in a beginner-friendly and illustrative way followed by the most used models, the intuition behind them, and hands-on experience. But first, we need to set up our sklearn library.<\/p>\n\n\n\n<a name=\"download-sklearn\">\n\n\n\n<h2 class=\"wp-block-heading\">How to download Sklearn for Python?<\/h2>\n\n\n\n<p>Sklearn can be obtained in Python by using the <code>pip install <\/code>function as shown below:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>$ pip install -U scikit-learn<\/code><\/pre><\/div>\n\n\n\n<p>Sklearn developers strongly advise using a virtual environment (venv) or a conda environment when working with the library as it helps to avoid potential conflicts with other packages.<\/p>\n\n\n\n<a name=\"pick-best-model\">\n\n\n\n<h2 class=\"wp-block-heading\">How to pick the best Sklearn model?<\/h2>\n\n\n\n<p>When it comes to picking the best Sklearn model, there are many factors that come into play that range from experience and data to the problem scope and math behind each algorithm.<\/p>\n\n\n\n<p>Sometimes all chosen algorithms can have similar results and, depending on the problem setting, you will need to pick the one that is the fastest or the one that generalizes the best on big data.<\/p>\n\n\n\n<p>It may happen that all of your promised models won&#8217;t perform well enough and that you will simply need to combine multiple models (e.g. ensemble), make your own custom-made model, or go for a deep learning approach.<\/p>\n\n\n\n<p>As picking the right model is one of the foundations of your problem solving, it is wise to read-up on as many models and their uses as you can. <\/p>\n\n\n\n<p>As model selection would be an article, or even a book, for itself, I&#8217;ll only provide some rough guidelines in the form of questions that you&#8217;ll need to ask yourself when deciding which model to deploy.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-flow\"><div class=\"wp-block-group__inner-container\">\n<p><strong>How much data do you have?<\/strong><\/p>\n\n\n\n<p>Some models are better on smaller datasets while others require more data and tend to generalize better on larger datasets (e.g. SGD Regressor vs Lasso Regression).<\/p>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-group is-layout-flow\"><div class=\"wp-block-group__inner-container\">\n<p><strong>What are the main characteristics of your data?<\/strong><\/p>\n\n\n\n<p>Is your data linear, quadratic, or all over the place? How do your distributions look like? Is your data made out of numbers or strings? Is the data labeled?<\/p>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-group is-layout-flow\"><div class=\"wp-block-group__inner-container\">\n<p><strong>What kind of a problem are you solving?<\/strong><br><br>Are you trying to predict: which cat will push most jars of the table, is that a dog or a cat, or of which dog breeds are a group of dogs made up? <\/p>\n\n\n\n<p>All of these questions have different approaches and solutions. Thus we will explore later in the article the three main problem classifications:<\/p>\n\n\n\n<ul><li>regression<\/li><li>classification<\/li><li>clustering<\/li><\/ul>\n\n\n\n<div class=\"wp-block-group is-layout-flow\"><div class=\"wp-block-group__inner-container\">\n<p><strong>How do your models perform when compared against each other?<\/strong><\/p>\n\n\n\n<p>You will see that scikit-learn comes equipped with functions that allow us to inspect each model on several characteristics and compare it to the other ones.<\/p>\n\n\n\n<p>Take note that scikit-learn has created a good <a href=\"https:\/\/scikit-learn.org\/stable\/tutorial\/machine_learning_map\/index.html\">algorithm cheat-sheet<\/a> that aids you in your model selection and I&#8217;d advise having it near you at those troubling times.<\/p>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<a name=\"sklearn-preprocessing\">\n\n\n\n<h2 class=\"wp-block-heading\">Sklearn preprocessing &#8211; Prepare the data for analysis<\/h2>\n\n\n\n<p>When you think of data you probably have in mind a ginormous excel spreadsheet full of rows and columns with numbers in them. Well, the case is that data can come in a plethora of formats like images, videos and audio.<\/p>\n\n\n\n<p>The main job of data preprocessing is to turn this data into a readable format for our algorithm. A machine can&#8217;t just &#8220;listen in&#8221; to an audiotape to learn voice recognition, rather it needs it to be converted numbers.<\/p>\n\n\n\n<p>The main building blocks of our dataset are called features which can be categorical or numerical. Simply put, categorical data is used to group data with similar characteristics while numerical data provides information with numbers.<\/p>\n\n\n\n<p>As the features come from two different categories, they need to be treated (preprocessed) in different ways. The best way to learn is to start coding along with me.<\/p>\n\n\n\n<a name=\"feature-encoding\">\n\n\n\n<h3 class=\"wp-block-heading\">Sklearn feature encoding<\/h3>\n\n\n\n<p>Feature encoding is a method where we transform categorical variables into continuous ones. The most popular ways of doing so are known as One Hot Encoding and Label encoding.<\/p>\n\n\n\n<p>For example, a person can have features such as [&#8220;male&#8221;, &#8220;female], [&#8220;from US&#8221;, &#8220;from UK&#8221;], [&#8220;uses Binance&#8221;, &#8220;uses Coinbase&#8221;]. These features can be encoded as numbers e.g. [&#8220;male&#8221;, &#8220;from US&#8221;, &#8220;uses Coinbase&#8221;] would be [0, 0, 1].<\/p>\n\n\n\n<p>This can be done by using the scikit-learn <code>OrdinalEncoder<\/code>() function as follows:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>pip install scikit-learn\nfrom sklearn import preprocessing\n\nX = [[&#39;male&#39;, &#39;from US&#39;, &#39;uses Coinbase&#39;], [&#39;female&#39;, &#39;from UK&#39;, &#39;uses Binance&#39;]]\nencode = preprocessing.OrdinalEncoder()\nencode.fit(X)\n\nencode.transform([[&#39;male&#39;, &#39;from UK&#39;, &#39;uses Coinbase&#39;]])\n\nOutput: array([[1., 0., 1.]])<\/code><\/pre><\/div>\n\n\n\n<p>As you can see, it transformed the features into integers. But they are not continuous and can&#8217;t be used with scikit-learn estimators. In order to fix this, a popular and most used method is one hot encoding.<\/p>\n\n\n\n<p>One hot encoding, also known as dummy encoding, can be obtained through the scikit-learn <code>OneHotEncoder<\/code>() function. It works by transforming each category with N possible values into N binary features where one category is represented as 1 and the rest as 0.<\/p>\n\n\n\n<p>The following example will hopefully make it clear:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>one_hot = preprocessing.OneHotEncoder()\none_hot.fit(X)\n\none_hot.transform([[&#39;male&#39;, &#39;from UK&#39;, &#39;uses Coinbase&#39;],\n                   [&#39;female&#39;, &#39;from US&#39;, &#39;uses Binance&#39;]]).toarray()\n\nOutput: array([[0., 1., 1., 0., 0., 1.],\n              [1., 0., 0., 1., 1., 0.]])<\/code><\/pre><\/div>\n\n\n\n<p>To see what your encoded features are exactly you can always use the <code>.categories_<\/code> attribute as shown below:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>one_hot.categories_\n\nOutput: [array([&#39;female&#39;, &#39;male&#39;], dtype=object),\n         array([&#39;from UK&#39;, &#39;from US&#39;], dtype=object),\n         array([&#39;uses Binance&#39;, &#39;uses Coinbase&#39;], dtype=object)]<\/code><\/pre><\/div>\n\n\n\n<a name=\"data-scaling\">\n\n\n\n<h3 class=\"wp-block-heading\">Sklearn data scaling<\/h3>\n\n\n\n<p>Feature scaling is a preprocessing method used to normalize data as it helps by improving some machine learning models. The two most common scaling techniques are known as standardization and normalization.<\/p>\n\n\n\n<p>Standardization makes the values of each feature in the data have zero-mean and unit variance. This method is commonly used with algorithms such as SVMs and Logistic regression.<\/p>\n\n\n\n<p>Standardization is done by subtracting the mean from each feature and dividing it by the standard deviation. It&#8217;s some basic statistics and math, but don&#8217;t worry if you don&#8217;t get it. There are many tutorials that cover it.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"415\" height=\"219\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/03\/rrwr.png\" alt=\"\" class=\"wp-image-8610\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/03\/rrwr.png 415w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/03\/rrwr-300x158.png 300w\" sizes=\"(max-width: 415px) 100vw, 415px\" \/><\/figure>\n\n\n\n<p>In scikit-learn we use the StandardScaler() function to standardize the data. Let us create a random NumPy array and standardize the data by giving it a zero mean and unit variance.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>import numpy as np\n\nscaler = preprocessing.StandardScaler()\nX = np.random.rand(3,4)\nX<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"683\" height=\"90\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/03\/2-3.png\" alt=\"\" class=\"wp-image-8612\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/03\/2-3.png 683w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/03\/2-3-300x40.png 300w\" sizes=\"(max-width: 683px) 100vw, 683px\" \/><\/figure>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>X_scaled = scaler.fit_transform(X)\nX_scaled<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"722\" height=\"88\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/03\/3-4.png\" alt=\"\" class=\"wp-image-8613\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/03\/3-4.png 722w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/03\/3-4-300x37.png 300w\" sizes=\"(max-width: 722px) 100vw, 722px\" \/><\/figure>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>print(f&#39;The scaled mean is: {X_scaled.mean(axis=0)}\\nThe scaled variance is: {X_scaled.std(axis=0)}&#39;)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1009\" height=\"78\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/4-11.jpg\" alt=\"\" class=\"wp-image-16097\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/4-11.jpg 1009w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/4-11-300x23.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/4-11-768x59.jpg 768w\" sizes=\"(max-width: 1009px) 100vw, 1009px\" \/><\/figure>\n\n\n\n<p>Wait for a second! Didn&#8217;t you say that all mean values need to be 0? <\/p>\n\n\n\n<p>Well, in practice these values are so close to 0 that they can be viewed as zero. Moreover, due to limitations with numerical representations the scaler can only get the mean really close to a zero.<\/p>\n\n\n\n<p>Let&#8217;s move onto the next scaling method called normalization. Normalization is a term with many definitions that change from one field to another and we are going to define it as follows:<\/p>\n\n\n\n<p>Normalization is a scaling technique in which values are shifted and rescaled so that they end up being between 0 and 1. It is also known as Min-Max scaling. In scikit-learn it can be applied with the <code>Normalizer() <\/code>function.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>norm = preprocessing.Normalizer()\n\nX_norm = norm.transform(X)\nX_norm<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"685\" height=\"79\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/5-11.jpg\" alt=\"\" class=\"wp-image-16098\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/5-11.jpg 685w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/5-11-300x35.jpg 300w\" sizes=\"(max-width: 685px) 100vw, 685px\" \/><\/figure>\n\n\n\n<p>So, which one is better? Well, it depends on your data and the problem you&#8217;re trying to solve. Standardization is often good when the data is depicting a Normal distribution and vice versa. If in doubt, try both and see which one improves the model.<\/p>\n\n\n\n<a name=\"missing-values\">\n\n\n\n<h3 class=\"wp-block-heading\">Sklearn missing values<\/h3>\n\n\n\n<p>In scikit-learn we can use the <code>.impute<\/code> class to fill in the missing values. The most used functions would be the <code>SimpleImputer()<\/code>, <code>KNNImputer()<\/code> and <code>IterativeImputer()<\/code>.<\/p>\n\n\n\n<p>When you encounter a real-life dataset it will 100% have missing values in it that can be there for various reasons ranging from rage quits to bugs and mistakes.<\/p>\n\n\n\n<p>There are several ways to treat them. One way is to delete the whole row (candidate) from the dataset but it can be costly for small to average datasets as you can delete plenty of data.<\/p>\n\n\n\n<p>Some better ways would be to change the missing values with the mean or median of the dataset. You could also try, if possible, to categorize your subject into their subcategory and take the mean\/median of it as the new value.<\/p>\n\n\n\n<p>Let&#8217;s use the <code>SimpleImputer()<\/code> to replace the missing value with the mean:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from sklearn.impute import SimpleImputer\n\nimputer = SimpleImputer(missing_values=np.nan, strategy=&quot;mean&quot;)\nimputer.fit_transform([[10,np.nan],[2,4],[10,9]])<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"285\" height=\"98\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/6-11.jpg\" alt=\"\" class=\"wp-image-16099\"\/><\/figure>\n\n\n\n<p>The <code>strategy<\/code> hyperparameter can be changed to median, most_frequent, and constant. But Igor, can we impute missing strings? Yes, you can!<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>import pandas as pd\n\ndf = pd.DataFrame([[&#39;i&#39;, &#39;g&#39;],\n                   [&#39;o&#39;, &#39;r&#39;],\n                   [&#39;i&#39;, np.nan],\n                   [np.nan, &#39;r&#39;]], dtype=&#39;category&#39;)\n\nimputer = SimpleImputer(strategy=&#39;most_frequent&#39;)\nimputer.fit_transform(df)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"420\" height=\"126\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/7-8.jpg\" alt=\"\" class=\"wp-image-16100\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/7-8.jpg 420w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/7-8-300x90.jpg 300w\" sizes=\"(max-width: 420px) 100vw, 420px\" \/><\/figure>\n\n\n\n<p>If you want to keep track of the missing values and the positions they were in, you can use the <code>MissingIndicator()<\/code> function:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from sklearn.impute import MissingIndicator\n\n# Image the 3&#39;s were imputed by the SimpleImputer()\nY = np.array([[3,1], \n              [5,3],\n              [9,4], \n              [3,7]])\n\nmissing = MissingIndicator(missing_values=3)\nmissing.fit_transform(Y)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"331\" height=\"121\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/8-10.jpg\" alt=\"\" class=\"wp-image-16101\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/8-10.jpg 331w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/8-10-300x110.jpg 300w\" sizes=\"(max-width: 331px) 100vw, 331px\" \/><\/figure>\n\n\n\n<p>The <code>IterateImputer()<\/code> is fancy, as it basically goes across the features and uses the missing feature as the label and other features as the inputs of a regression model. Then it predicts the value of the label for the number of iterations we specify.<\/p>\n\n\n\n<p>If you&#8217;re not sure how regression algorithms work, don&#8217;t worry as we will soon go over them. As the <code>IterativeImputer()<\/code>  is an experimental feature we will need to enable it before use:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from sklearn.experimental import enable_iterative_imputer\nfrom sklearn.impute import IterativeImputer\n\nimputer = IterativeImputer(max_iter=15, random_state=42)\nimputer.fit_transform(([1,5],[4,6],[2, np.nan], [np.nan, 8]))<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"433\" height=\"123\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/9-9.jpg\" alt=\"\" class=\"wp-image-16102\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/9-9.jpg 433w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/9-9-300x85.jpg 300w\" sizes=\"(max-width: 433px) 100vw, 433px\" \/><\/figure>\n\n\n\n<a name=\"train-test\">\n\n\n\n<h2 class=\"wp-block-heading\">Sklearn train test split<\/h2>\n\n\n\n<p>In Sklearn the data can be split into test and training groups by using the <code>train_test_split()<\/code> function which is a part of the <code>model_selection<\/code> class.<\/p>\n\n\n\n<p>But why do we need to split the data into two groups? Well, the training data is the data on which we fit our model and it learns on it. In order to evaluate how the model performs on unseen data, we use test data.<\/p>\n\n\n\n<p>An important thing, in most cases, is to allocate more data to the training set. When speaking of the ratio of this allocation there aren&#8217;t any hard rules. It all depends on the size of your dataset. <\/p>\n\n\n\n<p>The most used allocation ratio is 80% for training and 20% for testing. Have in mind that most people use the training\/development set split but name the dev set as the test set. This is more of a conceptual mistake.<\/p>\n\n\n\n<p>Now let us create a random dataset and split it into training and testing sets:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from sklearn.datasets import make_blobs\nfrom sklearn.model_selection import train_test_split\n\n# Create a random dataset\nX, y = make_blobs(n_samples=1500)\n\n# Split the data\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)\n\nprint(f&#39;X training set {X_train.shape}\\nX testing set {X_test.shape}\\ny training set {y_train.shape}\\ny testing set {y_test.shape}&#39;)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"355\" height=\"122\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/10-11.jpg\" alt=\"\" class=\"wp-image-16103\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/10-11.jpg 355w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/10-11-300x103.jpg 300w\" sizes=\"(max-width: 355px) 100vw, 355px\" \/><\/figure>\n\n\n\n<p>If your dataset is big enough you&#8217;ll often be fine with using this way to split the data. But some datasets come with a severe imbalance in them. <\/p>\n\n\n\n<p>For example, if you&#8217;re building a model to detect outliers that default their credit cards you will most often have a very small percentage of them in your data. <\/p>\n\n\n\n<p>This means that the <code>train_test_split()<\/code> function will most likely allocate too little of the outliers to your training set and the ML algorithm won&#8217;t learn to detect them efficiently. Let&#8217;s simulate a dataset like that:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from sklearn.datasets import make_classification\nfrom collections import Counter\n\n# Create an imablanced dataset\nX, y = make_classification(n_samples=1000, weights=[0.95], flip_y=0, random_state=42)\nprint(f&#39;Number of y before splitting is {Counter(y)}&#39;)\n\n# Split the data the usual way\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)\nprint(f&#39;Number of y in the training set after splitting is {Counter(y_train)}&#39;)\nprint(f&#39;Number of y in the testing set after splitting is {Counter(y_test)}&#39;)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"965\" height=\"94\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/11-12.jpg\" alt=\"\" class=\"wp-image-16104\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/11-12.jpg 965w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/11-12-300x29.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/11-12-768x75.jpg 768w\" sizes=\"(max-width: 965px) 100vw, 965px\" \/><\/figure>\n\n\n\n<p>As you can see, the training set has 43 examples of y while the testing set has only 7! In order to combat this, we can split the data into training and testing by stratification which is done according to y.<\/p>\n\n\n\n<p>This means that y examples will be adequately stratified in both training and testing sets (20% of y goes to the test set). In scikit-learn this is done by adding the <code>stratify<\/code> argument as shown below:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Split the data by stratification\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)\nprint(f&#39;Number of y in the training set after splitting is {Counter(y_train)}&#39;)\nprint(f&#39;Number of y in the testing set after splitting is {Counter(y_test)}&#39;)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"69\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/12-9-1024x69.jpg\" alt=\"\" class=\"wp-image-16105\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/12-9-1024x69.jpg 1024w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/12-9-300x20.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/12-9-768x52.jpg 768w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/12-9.jpg 1038w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>For a more in-depth guide and understanding of the train test split and cross-validation, please visit the following article that is found on our blog:<\/p>\n\n\n\n<p><a href=\"https:\/\/algotrading101.com\/learn\/train-test-split\/\">https:\/\/algotrading101.com\/learn\/train-test-split\/<\/a><\/p>\n\n\n\n<p>For more information about scikit-learn preprocessing functions go <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#preprocessing\">here<\/a>.<\/p>\n\n\n\n<a name=\"regression\">\n\n\n\n<h2 class=\"wp-block-heading\">Sklearn Regression &#8211; Predict the future<\/h2>\n\n\n\n<p>The regression method is used for prediction and forecasting and in Sklearn it can be accessed by the <code>linear_model()<\/code> class.<\/p>\n\n\n\n<p>In regression tasks, we want to predict the outcome y given X. For example, imagine that we want to predict the price of a house (y) given features (X) like its age and number of rooms. The most simple regression model is linear regression.<\/p>\n\n\n\n<a name=\"linear-regression\">\n\n\n\n<h3 class=\"wp-block-heading\">Sklearn Linear Regression<\/h3>\n\n\n\n<p>Sklearn Linear Regression model can be used by accessing the <code>LinearRegression()<\/code> function. The linear regression model assumes that the dependent variable (y) is a linear combination of the parameters (X<sub>i<\/sub>).<\/p>\n\n\n\n<p>Allow me to illustrate how linear regression works. Imagine that you were tasked to fit a red line so it resembles the trend of the data while minimizing the distance between each point as shown below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l1-1024x576.jpg\" alt=\"\" class=\"wp-image-16106\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l1-1024x576.jpg 1024w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l1-300x169.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l1-768x432.jpg 768w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l1-1536x864.jpg 1536w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l1.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>By eye-balling it should look something like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l2-1-1024x576.jpg\" alt=\"\" class=\"wp-image-16107\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l2-1-1024x576.jpg 1024w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l2-1-300x169.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l2-1-768x432.jpg 768w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l2-1-1536x864.jpg 1536w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l2-1.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Let&#8217;s import the sklearn boston house-price dataset and so we can predict the median house value (MEDV) by the house&#8217;s age (AGE) and the number of rooms (RM). <\/p>\n\n\n\n<p>Have in mind that this is known as a multiple linear regression as we are using two features.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from sklearn import linear_model, datasets\nfrom sklearn.model_selection import train_test_split\nimport matplotlib.pyplot as plt\n%matplotlib inline\nimport pandas as pd\n\n# Load the Boston dataset\nboston = datasets.load_boston()\ndf = pd.DataFrame(boston.data, columns=boston.feature_names)\n# Add the target variable (label)\ndf[&#39;MEDV&#39;] = boston.target\ndf.head()<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"234\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/04\/13-1024x234.png\" alt=\"\" class=\"wp-image-8648\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/04\/13-1024x234.png 1024w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/04\/13-300x69.png 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/04\/13-768x175.png 768w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/04\/13.png 1051w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Now we will set our features (X) and the label (y). Notice how we use the numpy <code>np.c_<\/code> function that concatenates the data for us.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Set the features and label\nX = pd.DataFrame(np.c_[df[&#39;LSTAT&#39;], df[&#39;RM&#39;]], columns = [&#39;LSTAT&#39;,&#39;RM&#39;])\ny = df[&#39;MEDV&#39;]<\/code><\/pre><\/div>\n\n\n\n<p>Now we will split the data into training and test sets which we learned earlier how to do:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Set the features and label\nX = pd.DataFrame(np.c_[df[&#39;AGE&#39;], df[&#39;RM&#39;]], columns = [&#39;AGE&#39;,&#39;RM&#39;])\ny = df[&#39;MEDV&#39;]\n\n# Split the data\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)\nprint(X_train.shape, X_test.shape, y_train.shape, y_test.shape)<\/code><\/pre><\/div>\n\n\n\n<div class=\"wp-block-group is-layout-flow\"><div class=\"wp-block-group__inner-container\">\n<p>Let&#8217;s plot each of our features and see how they look. Try to imagine where the regression line would go.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Plot the features\nplt.scatter(X[&#39;RM&#39;], y)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"369\" height=\"248\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l3-1.jpg\" alt=\"\" class=\"wp-image-16108\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l3-1.jpg 369w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l3-1-300x202.jpg 300w\" sizes=\"(max-width: 369px) 100vw, 369px\" \/><\/figure>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>plt.scatter(X[&#39;AGE&#39;], y)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"368\" height=\"248\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l4-1.jpg\" alt=\"\" class=\"wp-image-16109\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l4-1.jpg 368w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l4-1-300x202.jpg 300w\" sizes=\"(max-width: 368px) 100vw, 368px\" \/><\/figure>\n\n\n\n<p>You can already see that the data is a bit messy. The RM feature appears more linear and is prone to higher correlation with the label while the age feature shows the opposite. We also have outliers.<\/p>\n\n\n\n<p>For this article, we won&#8217;t bother to clean up the data as we&#8217;re not interested to create a perfect model.<\/p>\n<\/div><\/div>\n\n\n\n<p>The next thing that we want to do is to fit our model and evaluate some of its core metrics:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>regressor = linear_model.LinearRegression()\nmodel = regressor.fit(X_train, y_train)\n\nprint(&#39;Coefficient of determination:&#39;, model.score(X, y))\nprint(&#39;Intercept:&#39;, model.intercept_)\nprint(&#39;slope:&#39;, model.coef_)<\/code><\/pre><\/div>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-plain\"><code>Coefficient of determination: 0.529269171356878\nIntercept: -28.203538066489102\nslope: [-0.06640957  8.7957305 ]<\/code><\/pre><\/div>\n\n\n\n<p>The coefficient of determination (R<sup>2<\/sup>) tells how much of the variance, in our case the variance of the median house income, our model explains. As we see it explains 53% of the variance which is okay.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"785\" height=\"73\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/04\/15.png\" alt=\"\" class=\"wp-image-8650\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/04\/15.png 785w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/04\/15-300x28.png 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/04\/15-768x71.png 768w\" sizes=\"(max-width: 785px) 100vw, 785px\" \/><\/figure>\n\n\n\n<p>For the brevity of the article, we won&#8217;t go into math now but feel free to look up the in-depth explanation behind the formula. And you don&#8217;t need to know it in order to use the regression, not saying that you shouldn&#8217;t.<\/p>\n\n\n\n<p>The <code>.intercept_ <\/code>shows the bias b<sub>0<\/sub>, while the<code> .coef_<\/code> is an array that contains our b<sub>1<\/sub> and b<sub>2<\/sub>. In our case, the intercept is &#8211;28.20 and it represents the value of the predicted response when X<sub>1<\/sub> = X<sub>2<\/sub> = 0.<\/p>\n\n\n\n<p>When we look at the slope, we can see that the increase in X<sub>1<\/sub> (AGE) by 1 lowers the median house price by 0.06 while the increase in X<sub>2<\/sub> (RM) results in the rise of the dependent variable by 8.79.<\/p>\n\n\n\n<p>Let&#8217;s see how good your regression line predictions were:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Age regression line\nplt.plot(X[&#39;AGE&#39;], y, &#39;o&#39;)\nmodel.coef_[0], model.intercept_ = np.polyfit(X[&#39;AGE&#39;], y, 1)\nplt.plot(X[&#39;AGE&#39;], model.coef_[0]*X[&#39;AGE&#39;]+model.intercept_, color=&#39;red&#39;)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"368\" height=\"248\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l5-1.jpg\" alt=\"\" class=\"wp-image-16110\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l5-1.jpg 368w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l5-1-300x202.jpg 300w\" sizes=\"(max-width: 368px) 100vw, 368px\" \/><\/figure>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Room number regression line\nplt.plot(X[&#39;RM&#39;], y, &#39;o&#39;)\nmodel.coef_[0], model.intercept_ = np.polyfit(X[&#39;RM&#39;], y, 1)\nplt.plot(X[&#39;RM&#39;], model.coef_[0]*X[&#39;RM&#39;]+model.intercept_, color=&#39;red&#39;)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"369\" height=\"248\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l6.jpg\" alt=\"\" class=\"wp-image-16111\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l6.jpg 369w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/l6-300x202.jpg 300w\" sizes=\"(max-width: 369px) 100vw, 369px\" \/><\/figure>\n\n\n\n<p>Now, let us predict some data and use a sklearn metric that will tell us how the model is performing:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>y_test_predict = regressor.predict(X_test)\nprint(&#39;predicted response:&#39;, y_test_predict, sep=&#39;\\n&#39;)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"858\" height=\"455\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/14-11.jpg\" alt=\"\" class=\"wp-image-16112\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/14-11.jpg 858w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/14-11-300x159.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/14-11-768x407.jpg 768w\" sizes=\"(max-width: 858px) 100vw, 858px\" \/><\/figure>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from sklearn.metrics import mean_squared_error\n\nrmse = (np.sqrt(mean_squared_error(y_test, y_test_predict)))\nprint(rmse)<\/code><\/pre><\/div>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-plain\"><code>6.315423538049165<\/code><\/pre><\/div>\n\n\n\n<p>Root Mean Square Error<strong>&nbsp;<\/strong>(RMSE) is the&nbsp;standard deviation&nbsp;of the&nbsp;residuals&nbsp;(prediction errors). Residuals are a measure of how far from the regression line data points are. It tells us how concentrated the data is around the regression line.<\/p>\n\n\n\n<p>In our case, the RMSE is high for our liking. I&#8217;ll task you to try out other features (LSTAT and RM) and lower the RMSE. What happens when you use those two or more? Which features make the most sense to use?<\/p>\n\n\n\n<p>Feel free to play around and check the Full code section to see some guidelines.<\/p>\n\n\n\n<a name=\"other-regression\">\n\n\n\n<h3 class=\"wp-block-heading\">Other Sklearn regression models<\/h3>\n\n\n\n<p>There are various regression models that may be more useful and fit the data better than the simple linear regression, and those are the Lasso, Elastic-Net, Ridge, Polynomial, and Bayesian regression.<\/p>\n\n\n\n<p>For more information about them go <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/linear_model.html\">here<\/a>.<\/p>\n\n\n\n<a name=\"classification\">\n\n\n\n<h2 class=\"wp-block-heading\">Sklearn Classification<\/h2>\n\n\n\n<p>Classification problem in ML involves teaching a machine how to group data together to match the specified criteria. The most popular models in Sklearn come from the <code>tree()<\/code> class.<\/p>\n\n\n\n<p>Every day you perform classification. For example, when you go to a grocery store you can easily group different foods by their food group (fruit, meat, grain, etc.).<\/p>\n\n\n\n<p>When it comes to more complex decisions in the fields of medicine, trading, and politics, we&#8217;d like some good ML algorithms to aid our decision-making process.<\/p>\n\n\n\n<a name=\"tree-classifier\">\n\n\n\n<h3 class=\"wp-block-heading\">Sklearn &nbsp;Decision Tree Classifier<\/h3>\n\n\n\n<p>In Sklearn, the Decision Tree classifier can be accessed by using the <code>DecisionTreeClassifier()<\/code> function which is a part of the<code> tree()<\/code> class.<\/p>\n\n\n\n<p>The main goal of a Decision Tree algorithm is to predict the value of the target variable (label) by learning simple decision rules deduced from the data features. For example, look at my  simple decision tree below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/Go-outside-1024x576.jpg\" alt=\"\" class=\"wp-image-16113\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/Go-outside-1024x576.jpg 1024w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/Go-outside-300x169.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/Go-outside-768x432.jpg 768w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/Go-outside-1536x864.jpg 1536w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/Go-outside.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Here are some main characteristics of a Decision Tree Classifier:<\/p>\n\n\n\n<ul><li>It is made out of Nodes and Branches<\/li><li>Branches connect Nodes<\/li><li>The top Node is called the Root Node (&#8220;Go outside&#8221;)<\/li><li>Node from which new nodes arise is called a Parent Node (i.e. &#8220;Is it raining?&#8221; Node)<\/li><li>A node without a Child Node is called a Leaf Node (i.e. &#8220;Classic programmer&#8221; Node)<\/li><\/ul>\n\n\n\n<p>The good thing about a Decision Tree Classifier is that it is easy to visualize and interpret. It also requires little to no data preparation. The bad thing about it is that minor changes in the data can change it considerably.<\/p>\n\n\n\n<p>For a more in-depth understanding of its pros and cons go <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/tree.html#classification\">here<\/a>.<\/p>\n\n\n\n<p>Now, let&#8217;s create a decision tree on the popular iris dataset. The dataset is made out of 3 plant species and we&#8217;ll want our tree to aid us in deciding to what specimen our plant belongs to according to its petal\/sepal width and length.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from sklearn import tree\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.datasets import load_iris\nimport graphviz\n\n# Obtain the data and fit the model\nX, y = load_iris(return_X_y=True)\ndtc = DecisionTreeClassifier()\ndtc = dtc.fit(X, y)\n\n# Graph the Tree\niris = load_iris()\ndot_data = tree.export_graphviz(dtc, out_file=None, \n                      feature_names=iris.feature_names,  \n                     class_names=iris.target_names,  \n                     filled=True, rounded=True,  \n                     special_characters=True)  \ngraph = graphviz.Source(dot_data)  \ngraph <\/code><\/pre><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"776\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/iris-1024x776.jpg\" alt=\"\" class=\"wp-image-16114\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/iris-1024x776.jpg 1024w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/iris-300x227.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/iris-768x582.jpg 768w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/iris.jpg 1102w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n\n\n<p>Take note that &#8220;Gini&#8221; measures impurity. A node is \u201cpure\u201d when it has 0 Gini which happens when all training instances it applies to belong to the same class.<\/p>\n\n\n\n<p>Have in mind that all algorithms have their hyperparameters which can be tuned to result in a better model. For example you can set the Decision Tree to only go to a certain depth, to have a certain allowed number of leaves and etc.<\/p>\n\n\n\n<p>To see what are the standard hyperparameter that your untouched Decision Tree Classifier has and what each of them does please visit the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html\">scikit-learn documentation<\/a>.<\/p>\n\n\n\n<a name=\"other-classification\">\n\n\n\n<h3 class=\"wp-block-heading\">Other Sklearn classification models<\/h3>\n\n\n\n<p>Depending on the problem and your data, you might want to try out other classification algorithms that Sklearn has to offer. For example, SVC, Random Forest, AdaBoost, GaussianNB, or KNeighbors Classifier.<\/p>\n\n\n\n<p>If you want to see how they compare to each other go <a href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/classification\/plot_classifier_comparison.html\">here<\/a>.<\/p>\n\n\n\n<a name=\"clustering\">\n\n\n\n<h2 class=\"wp-block-heading\">Sklearn Clustering &#8211; Create groups of similar data<\/h2>\n\n\n\n<p>Clustering is an unsupervised machine learning problem where the algorithm needs to find relevant patterns on unlabeled data. In Sklearn these methods can be accessed via the <code>sklearn.cluster<\/code> module.<\/p>\n\n\n\n<p>Below you can see an example of the clustering method:<\/p>\n\n\n\n<figure class=\"wp-block-gallery alignwide has-nested-images columns-default is-cropped wp-block-gallery-6 is-layout-flex\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" data-id=\"16115\"  src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1_1-1024x576.jpg\" alt=\"\" class=\"wp-image-16115\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1_1-1024x576.jpg 1024w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1_1-300x169.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1_1-768x432.jpg 768w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1_1-1536x864.jpg 1536w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1_1.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" data-id=\"16116\"  src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/2_1-1024x576.jpg\" alt=\"\" class=\"wp-image-16116\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/2_1-1024x576.jpg 1024w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/2_1-300x169.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/2_1-768x432.jpg 768w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/2_1-1536x864.jpg 1536w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/2_1.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/figure>\n\n\n\n<a name=\"dbscan\">\n\n\n\n<h3 class=\"wp-block-heading\">Sklearn DBSCAN<\/h3>\n\n\n\n<p>In Sklearn, the DBSCAN clustering model can be utilized by using the the <code>DBSCAN()<\/code> cluster which is a part of the <code>cluster()<\/code> class.<\/p>\n\n\n\n<p>DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. As the model isn&#8217;t deterministic (i.e. clusters must be convex), it is mostly used when the clusters can be in any shape or size.<\/p>\n\n\n\n<p>The DBSCAN algorithm finds clusters by looking for areas with high density that are separated by areas of low density. The algorithm has two main parameters being <code>min_samples<\/code> and <code>eps<\/code>.<\/p>\n\n\n\n<p>High min_samples and low eps indicate a higher density needed in order to create a cluster. The <code>min_samples<\/code> parameter controls how sensitive the algorithm is towards noise (higher values mean that it is less sensitive).<\/p>\n\n\n\n<p>On the other hand, the <code>eps<\/code> parameter controls the local neighborhood of the points. If it is too high all data will be in one big cluster, if it is too low each data point will be its own cluster.<\/p>\n\n\n\n<p>Enough theorizing, let&#8217;s jump to the coding part! We will generate some data and fit the DBSCAN clustering algorithm on it. We will also play a bit with its parameters.<\/p>\n\n\n\n<p>Let&#8217;s import the libraries we need, create the data, scale it and fit the model:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from sklearn.datasets import make_circles\nfrom sklearn.cluster import DBSCAN\nfrom sklearn import metrics\nfrom sklearn.preprocessing import StandardScaler\nimport numpy as np\nimport matplotlib.pyplot as plt\n%matplotlib inline\n\n# Make the data and scale it\nX, y = make_circles(n_samples=800, factor=0.3, noise=0.1, random_state=42)\nX = StandardScaler().fit_transform(X)\n\n# Fit the algorithm\ny_predicted = DBSCAN(eps=0.35, min_samples=10).fit_predict(X)<\/code><\/pre><\/div>\n\n\n\n<p>Now, let&#8217;s see how our model performed:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Visualize the data\nplt.scatter(X[:,0], X[:,1], c=y_predicted)<\/code><\/pre><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"370\" height=\"248\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/download-2.jpg\" alt=\"\" class=\"wp-image-16117\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/download-2.jpg 370w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/download-2-300x201.jpg 300w\" sizes=\"(max-width: 370px) 100vw, 370px\" \/><\/figure><\/div>\n\n\n<p>Here we can easily spot two clusters, they even resemble an eye (I&#8217;m tempted to change the colors to make it look like the eye of Sauron). All models have their performance metrics and let&#8217;s check out the main ones.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Evaluation Metrics\nprint(&#39;Number of clusters: {}&#39;.format(len(set(y_predicted[np.where(y_predicted != -1)]))))\nprint(&#39;Homogeneity: {}&#39;.format(metrics.homogeneity_score(y, y_predicted)))\nprint(&#39;Completeness: {}&#39;.format(metrics.completeness_score(y, y_predicted)))<\/code><\/pre><\/div>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-plain\"><code>Number of clusters: 2\nHomogeneity: 1.0000000000000007\nCompleteness: 0.9691231370370732<\/code><\/pre><\/div>\n\n\n\n<p>What would happen if we changed the <code>eps<\/code> value to 0.4?<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>y_predicted = DBSCAN(eps=0.4, min_samples=10).fit_predict(X)\nplt.scatter(X[:,0], X[:,1], c=&#39;orangered&#39;)\nplt.title(&#39;I see you!&#39;)<\/code><\/pre><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"370\" height=\"264\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/download-1-1.jpg\" alt=\"\" class=\"wp-image-16118\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/download-1-1.jpg 370w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/download-1-1-300x214.jpg 300w\" sizes=\"(max-width: 370px) 100vw, 370px\" \/><figcaption>Couldn&#8217;t resist<\/figcaption><\/figure><\/div>\n\n\n<p>For a more hands-on experience in solving problems with clustering, check out our <a href=\"https:\/\/algotrading101.com\/learn\/cluster-analysis-guide\/\">article<\/a> on finding trading pairs for the pairs trading strategy with machine learning.<\/p>\n\n\n\n<a name=\"other-clustering\">\n\n\n\n<h2 class=\"wp-block-heading\">Other Sklearn clustering models<\/h2>\n\n\n\n<p>Depending on the clustering problem, you might want to use other clustering algorithms and the most popular ones are K-Means, Hierarchical, Affinity Propagation, and Gaussian mixtures clustering.<\/p>\n\n\n\n<p>If you want to learn the in-depth theory behind clustering and get introduced to various models and the math behind them, go <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/clustering.html\">here<\/a>.<\/p>\n\n\n\n<a name=\"dimensionality\">\n\n\n\n<h2 class=\"wp-block-heading\">Sklearn Dimensionality Reduction &#8211; Reducing random variables<\/h2>\n\n\n\n<p>Dimensionality reduction is a method where we want to shrink the size of data while preserving the most important information in it. In Sklearn these methods can be accessed from the <code>decomposition()<\/code> class.<\/p>\n\n\n\n<p>As humans, we usually think in 4 dimensions (if you count time as one) up to a maximum of 6-7 if you are a quantum physicist. Data can easily go beyond that and we need to reduce it to lower dimensions so it can be observed.<\/p>\n\n\n\n<a name=\"pca\">\n\n\n\n<h3 class=\"wp-block-heading\">Sklearn PCA<\/h3>\n\n\n\n<p>PCA (Principal Component Analysis) is a linear technique for dimensionality reduction. It basically does linear mapping of the data to a lower dimension while maximizing the preserved variance of data.<\/p>\n\n\n\n<p>PCA can be used for an easier visualization of data and as a preprocessing step to speed up the performance of other machine learning algorithms. Let&#8217;s go back to our iris dataset and make a 2d visualization from its 4d structure.<\/p>\n\n\n\n<p>Firstly, we will load the required libraries, obtain the dataset, scale the data and check how many dimensions we have:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from sklearn.datasets import load_iris\nfrom sklearn.decomposition import PCA\nfrom sklearn.preprocessing import StandardScaler\nimport matplotlib.pyplot as plt\nimport pandas as pd\n%matplotlib inline\n\n# Load the data and scale it\nX, y = load_iris(return_X_y=True)\nX = StandardScaler().fit_transform(X)\nprint(f&#39;The number of dimensions in X is {X.shape[1]}&#39;)<\/code><\/pre><\/div>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-plain\"><code>The number of dimensions in X is 4<\/code><\/pre><\/div>\n\n\n\n<p>Now we will set our PCA and fit it to the data:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Load PCA and specify the number of dimensions aka components\npca = PCA(n_components=2)\npc = pca.fit_transform(X)\nprint(f&#39;The number of reduced dimensions is {pc.shape[1]}&#39;)<\/code><\/pre><\/div>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-plain\"><code>The number of reduced dimensions is 2<\/code><\/pre><\/div>\n\n\n\n<p>Let&#8217;s store the data into a pandas data frame and recode the numerical target features to categorical:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Put the data into a pandas data frame\ndf = pd.DataFrame(data = pc, columns = [&#39;pc_1&#39;, &#39;pc_2&#39;])\ndf[&#39;target&#39;] = y\ndf.head()<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"339\" height=\"251\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/16-7.jpg\" alt=\"\" class=\"wp-image-16119\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/16-7.jpg 339w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/16-7-300x222.jpg 300w\" sizes=\"(max-width: 339px) 100vw, 339px\" \/><\/figure>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Recode the numerical data to categorical\ndef recoding(data):\n    if data == 0:\n        return &#39;iris-setosa&#39;\n    elif data == 1:\n        return &#39;iris-versicolor&#39;\n    else:\n        return &#39;iris-virginica&#39;\n    \ndf[&#39;target&#39;] = df[&#39;target&#39;].apply(recoding)\ndf.head()<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"376\" height=\"238\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/17-8.jpg\" alt=\"\" class=\"wp-image-16120\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/17-8.jpg 376w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/17-8-300x190.jpg 300w\" sizes=\"(max-width: 376px) 100vw, 376px\" \/><\/figure>\n\n\n\n<p>And now for the finale with plot the data:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Plot the data\nfig = plt.figure(figsize = (12,10))\nax = fig.add_subplot(1,1,1) \nax.set_xlabel(&#39;Principal Component 1&#39;, fontsize = 17)\nax.set_ylabel(&#39;Principal Component 2&#39;, fontsize = 17)\nax.set_title(&#39;2 component PCA&#39;, fontsize = 20)\ntargets = [&#39;iris-setosa&#39;, &#39;iris-versicolor&#39;, &#39;iris-virginica&#39;]\ncolors = [&#39;r&#39;, &#39;g&#39;, &#39;b&#39;]\nfor target, color in zip(targets,colors):\n    indicesToKeep = df[&#39;target&#39;] == target\n    ax.scatter(df.loc[indicesToKeep, &#39;pc_1&#39;],\n               df.loc[indicesToKeep, &#39;pc_2&#39;],\n               c = color,\n               s = 50)\nax.legend(targets)\nax.grid()<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"727\" height=\"618\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/download-2-1.jpg\" alt=\"\" class=\"wp-image-16121\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/download-2-1.jpg 727w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/download-2-1-300x255.jpg 300w\" sizes=\"(max-width: 727px) 100vw, 727px\" \/><\/figure>\n\n\n\n<p>As you can see, we basically compressed the 4d data into a 2d observable one. In this case, we can say that the algorithm discovered the petals and sepals because we had the width and length of both.<\/p>\n\n\n\n<a name=\"other-dimensionality\">\n\n\n\n<h3 class=\"wp-block-heading\">Other Sklearn Dimensionality Reduction models<\/h3>\n\n\n\n<p>There are other Dimensionality Reduction models in Sklearn that you would prefer more for certain problems and those are the ICA, IPCA, NMF, LDA, Factor Analysis, and more.<\/p>\n\n\n\n<p>For a more in-depth look go <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/decomposition.html#decompositions\">here<\/a>.<\/p>\n\n\n\n<a name=\"common-machine-learning-testing-mistakes\">\n\n\n\n<h2 class=\"wp-block-heading\">What are the 3 Common Machine Learning Analysis\/Testing Mistakes?<\/h2>\n\n\n\n<p>When you run your analysis, there are 3 common mistakes to take note:<\/p>\n\n\n\n<ul><li>Overfitting<\/li><li>Look-ahead Bias<\/li><li>P-hacking<\/li><\/ul>\n\n\n\n<p>Do check out this lecture PDF to learn more:&nbsp;<a href=\"https:\/\/course.algotrading101.com\/courses\/pt101-practical-python-for-finance-trading-masterclass\/lectures\/27360454\">3 Big Mistakes of Backtesting \u2013 1) Overfitting 2) Look-Ahead Bias 3) P-Hacking<\/a><\/p>\n\n\n\n<a name=\"full-code\">\n\n\n\n<h2 class=\"wp-block-heading\">Full Code<\/h2>\n\n\n\n<p><a href=\"https:\/\/github.com\/IgorWounds\/Sklearn-Introduction-Guide\">GitHub Link<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<div class=\"pvc_clear\"><\/div>\n<p id=\"pvc_stats_9302\" class=\"pvc_stats total_only  \" data-element-id=\"9302\" style=\"\"><i class=\"pvc-stats-icon medium\" aria-hidden=\"true\"><svg aria-hidden=\"true\" focusable=\"false\" data-prefix=\"far\" data-icon=\"chart-bar\" role=\"img\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\" class=\"svg-inline--fa fa-chart-bar fa-w-16 fa-2x\"><path fill=\"currentColor\" d=\"M396.8 352h22.4c6.4 0 12.8-6.4 12.8-12.8V108.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v230.4c0 6.4 6.4 12.8 12.8 12.8zm-192 0h22.4c6.4 0 12.8-6.4 12.8-12.8V140.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v198.4c0 6.4 6.4 12.8 12.8 12.8zm96 0h22.4c6.4 0 12.8-6.4 12.8-12.8V204.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v134.4c0 6.4 6.4 12.8 12.8 12.8zM496 400H48V80c0-8.84-7.16-16-16-16H16C7.16 64 0 71.16 0 80v336c0 17.67 14.33 32 32 32h464c8.84 0 16-7.16 16-16v-16c0-8.84-7.16-16-16-16zm-387.2-48h22.4c6.4 0 12.8-6.4 12.8-12.8v-70.4c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v70.4c0 6.4 6.4 12.8 12.8 12.8z\" class=\"\"><\/path><\/svg><\/i> <img decoding=\"async\" width=\"16\" height=\"16\" alt=\"Loading\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/plugins\/page-views-count\/ajax-loader-2x.gif\" border=0 \/><\/p>\n<div class=\"pvc_clear\"><\/div>\n<p>Table of Contents What is Sklearn? What is Sklearn used for? How to download Sklearn for Python? How to pick the best scikit-learn model? Sklearn preprocessing &#8211; Prepare the data for analysis Sklearn feature encoding Sklearn data scaling Sklearn missing values Sklearn train test split Sklearn Regression &#8211; Predict the future Sklearn Linear Regression Other [&hellip;]<\/p>\n","protected":false},"author":14,"featured_media":9305,"comment_status":"closed","ping_status":"open","sticky":true,"template":"","format":"standard","meta":{"_lmt_disableupdate":"no","_lmt_disable":"no","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"categories":[3],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog<\/title>\n<meta name=\"description\" content=\"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/algotrading101.com\/learn\/sklearn-guide\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog\" \/>\n<meta property=\"og:description\" content=\"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/algotrading101.com\/learn\/sklearn-guide\/\" \/>\n<meta property=\"og:site_name\" content=\"Quantitative Trading Ideas and Guides - AlgoTrading101 Blog\" \/>\n<meta property=\"article:published_time\" content=\"2021-05-11T20:02:41+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-04-03T21:09:11+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/05\/sklearn-guide.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2152\" \/>\n\t<meta property=\"og:image:height\" content=\"864\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Igor Radovanovic\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Igor Radovanovic\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"27 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog","description":"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/algotrading101.com\/learn\/sklearn-guide\/","og_locale":"en_US","og_type":"article","og_title":"Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog","og_description":"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.","og_url":"https:\/\/algotrading101.com\/learn\/sklearn-guide\/","og_site_name":"Quantitative Trading Ideas and Guides - AlgoTrading101 Blog","article_published_time":"2021-05-11T20:02:41+00:00","article_modified_time":"2023-04-03T21:09:11+00:00","og_image":[{"width":2152,"height":864,"url":"http:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/05\/sklearn-guide.png","type":"image\/png"}],"author":"Igor Radovanovic","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Igor Radovanovic","Est. reading time":"27 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/algotrading101.com\/learn\/sklearn-guide\/#article","isPartOf":{"@id":"https:\/\/algotrading101.com\/learn\/sklearn-guide\/"},"author":{"name":"Igor Radovanovic","@id":"https:\/\/algotrading101.com\/learn\/#\/schema\/person\/a7ae60c112a73b7c3fe14ac56726a0ae"},"headline":"Sklearn &#8211; An Introduction Guide to Machine Learning","datePublished":"2021-05-11T20:02:41+00:00","dateModified":"2023-04-03T21:09:11+00:00","mainEntityOfPage":{"@id":"https:\/\/algotrading101.com\/learn\/sklearn-guide\/"},"wordCount":3816,"publisher":{"@id":"https:\/\/algotrading101.com\/learn\/#organization"},"articleSection":["Programming"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/algotrading101.com\/learn\/sklearn-guide\/","url":"https:\/\/algotrading101.com\/learn\/sklearn-guide\/","name":"Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog","isPartOf":{"@id":"https:\/\/algotrading101.com\/learn\/#website"},"datePublished":"2021-05-11T20:02:41+00:00","dateModified":"2023-04-03T21:09:11+00:00","description":"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.","breadcrumb":{"@id":"https:\/\/algotrading101.com\/learn\/sklearn-guide\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/algotrading101.com\/learn\/sklearn-guide\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/algotrading101.com\/learn\/sklearn-guide\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/algotrading101.com\/learn\/"},{"@type":"ListItem","position":2,"name":"Sklearn &#8211; An Introduction Guide to Machine Learning"}]},{"@type":"WebSite","@id":"https:\/\/algotrading101.com\/learn\/#website","url":"https:\/\/algotrading101.com\/learn\/","name":"Quantitative Trading Ideas and Guides - AlgoTrading101 Blog","description":"Authentic Stories about Algorithmic trading, coding and life.","publisher":{"@id":"https:\/\/algotrading101.com\/learn\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/algotrading101.com\/learn\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/algotrading101.com\/learn\/#organization","name":"AlgoTrading101","url":"https:\/\/algotrading101.com\/learn\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/algotrading101.com\/learn\/#\/schema\/logo\/image\/","url":"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2020\/11\/AlgoTrading101-Lucas-Liew.jpg","contentUrl":"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2020\/11\/AlgoTrading101-Lucas-Liew.jpg","width":1200,"height":627,"caption":"AlgoTrading101"},"image":{"@id":"https:\/\/algotrading101.com\/learn\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/algotrading101.com\/learn\/#\/schema\/person\/a7ae60c112a73b7c3fe14ac56726a0ae","name":"Igor Radovanovic","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/algotrading101.com\/learn\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/d46175c509b3ee240a1e2bbe735a4d1e?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/d46175c509b3ee240a1e2bbe735a4d1e?s=96&d=mm&r=g","caption":"Igor Radovanovic"},"sameAs":["https:\/\/igorradovanovic.com","https:\/\/www.linkedin.com\/in\/igor-radovanovic-profile"],"url":"https:\/\/algotrading101.com\/learn\/author\/igor\/"}]}},"modified_by":"Igor Radovanovic","_links":{"self":[{"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/posts\/9302"}],"collection":[{"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/comments?post=9302"}],"version-history":[{"count":9,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/posts\/9302\/revisions"}],"predecessor-version":[{"id":21156,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/posts\/9302\/revisions\/21156"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/media\/9305"}],"wp:attachment":[{"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/media?parent=9302"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/categories?post=9302"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/tags?post=9302"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}