{"id":10535,"date":"2021-07-23T09:46:03","date_gmt":"2021-07-23T09:46:03","guid":{"rendered":"http:\/\/algotrading101.com\/learn\/?p=10535"},"modified":"2023-04-03T21:06:35","modified_gmt":"2023-04-03T21:06:35","slug":"pyspark-guide","status":"publish","type":"post","link":"https:\/\/algotrading101.com\/learn\/pyspark-guide\/","title":{"rendered":"PySpark &#8211; A Beginner&#8217;s Guide to Apache Spark and Big Data"},"content":{"rendered":"<div class=\"pvc_clear\"><\/div><p id=\"pvc_stats_10535\" class=\"pvc_stats total_only  \" data-element-id=\"10535\" style=\"\"><i class=\"pvc-stats-icon medium\" aria-hidden=\"true\"><svg aria-hidden=\"true\" focusable=\"false\" data-prefix=\"far\" data-icon=\"chart-bar\" role=\"img\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\" class=\"svg-inline--fa fa-chart-bar fa-w-16 fa-2x\"><path fill=\"currentColor\" d=\"M396.8 352h22.4c6.4 0 12.8-6.4 12.8-12.8V108.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v230.4c0 6.4 6.4 12.8 12.8 12.8zm-192 0h22.4c6.4 0 12.8-6.4 12.8-12.8V140.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v198.4c0 6.4 6.4 12.8 12.8 12.8zm96 0h22.4c6.4 0 12.8-6.4 12.8-12.8V204.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v134.4c0 6.4 6.4 12.8 12.8 12.8zM496 400H48V80c0-8.84-7.16-16-16-16H16C7.16 64 0 71.16 0 80v336c0 17.67 14.33 32 32 32h464c8.84 0 16-7.16 16-16v-16c0-8.84-7.16-16-16-16zm-387.2-48h22.4c6.4 0 12.8-6.4 12.8-12.8v-70.4c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v70.4c0 6.4 6.4 12.8 12.8 12.8z\" class=\"\"><\/path><\/svg><\/i> <img decoding=\"async\" width=\"16\" height=\"16\" alt=\"Loading\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/plugins\/page-views-count\/ajax-loader-2x.gif\" border=0 \/><\/p><div class=\"pvc_clear\"><\/div>\n<figure class=\"wp-block-image size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/Apache_Spark_logo.svg_.jpg\" alt=\"PySpark Apache Spark Big Data\" class=\"wp-image-15957\" width=\"400\" height=\"208\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/Apache_Spark_logo.svg_.jpg 800w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/Apache_Spark_logo.svg_-300x156.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/Apache_Spark_logo.svg_-768x398.jpg 768w\" sizes=\"(max-width: 400px) 100vw, 400px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Table of Contents:<\/h3>\n\n\n\n<ol><li><a href=\"#what-is-pyspark\">What is PySpark?<\/a><\/li><li><a href=\"#what-is-apache-spark\">What is Apache Spark?<\/a><\/li><li><a href=\"#apache-spark-use\">What is Apache Spark used for?<\/a><\/li><li><a href=\"#pyspark-use\">What is PySpark used for?<\/a><\/li><li><a href=\"#free\">Is Apache Spark free?<\/a><\/li><li><a href=\"#apache-spark-pros\">Why should I use Apache Spark?<\/a><\/li><li><a href=\"#apache-spark-cons\">Why shouldn\u2019t I use Apache Spark?<\/a><\/li><li><a href=\"#pyspark-pros\">Why should I use PySpark?<\/a><\/li><li><a href=\"#pyspark-cons\">Why shouldn\u2019t I use PySpark?<\/a><\/li><li><a href=\"#apache-spark-alternatives\">What are some Apache Spark alternatives?<\/a><\/li><li><a href=\"#apache-spark-clients\">What are some Apache Spark clients?<\/a><\/li><li><a href=\"#getting-started\">How to get started with Apache Spark?<\/a><ol><li><a href=\"#prerequisites\">Prerequisites<\/a><\/li><li><a href=\"#download\">Download and set-up<\/a><\/li><li><a href=\"#launch\">Launch Spark<\/a><\/li><\/ol><\/li><li><a href=\"#apache-spark-components\">What are the main components of Apache Spark?<\/a><\/li><li><a href=\"#rdd\">What is the Apache Spark RDD?<\/a><\/li><li><a href=\"#use-pyspark-notebook\">How to use PySpark in Jupyter Notebooks?<\/a><\/li><li><a href=\"#data\">Meet the Data<\/a><\/li><li><a href=\"#pyspark-session\">How to start a PySpark session?<\/a><\/li><li><a href=\"#pyspark-data\">How to load data in PySpark?<\/a><\/li><li><a href=\"#pyspark-functions\">What are the most common PySpark functions?<\/a><ol><li><a href=\"#select\">select<\/a><\/li><li><a href=\"#filter\">filter<\/a><\/li><li><a href=\"#map\">map<\/a><\/li><li><a href=\"#reduce\">reduce<\/a><\/li><\/ol><\/li><li><a href=\"#convert-rdd\">How to convert an RDD to a DataFrame in PySpark?<\/a><\/li><li><a href=\"#pyspark-preprocess\">How to preprocess data with PySpark?<\/a><\/li><li><a href=\"#pyspark-ml\">How to run a Machine Learning model with PySpark?<\/a><\/li><li><a href=\"#full-code\">Full code<\/a><\/li><\/ol>\n\n\n\n<a name=\"what-is-pyspark\">\n\n\n\n<h2 class=\"wp-block-heading\">What is PySpark?<\/h2>\n\n\n\n<p>PySpark is a Python library that serves as an interface for Apache Spark.<\/p>\n\n\n\n<p>Link: <a href=\"https:\/\/spark.apache.org\">https:\/\/spark.apache.org<\/a><\/p>\n\n\n\n<a name=\"what-is-apache-spark\">\n\n\n\n<h2 class=\"wp-block-heading\">What is Apache Spark?<\/h2>\n\n\n\n<p>Apache Spark is an open-source distributed computing engine that is used for Big Data processing. <\/p>\n\n\n\n<p>It is a general-purpose engine as it supports Python, R, SQL, Scala, and Java.<\/p>\n\n\n\n<a name=\"apache-spark-use\">\n\n\n\n<h2 class=\"wp-block-heading\">What is Apache Spark used for?<\/h2>\n\n\n\n<p>Apache Spark is often used with Big Data as it allows for distributed computing and it offers built-in data streaming, machine learning, SQL, and graph processing. It is often used by data engineers and data scientists.<\/p>\n\n\n\n<a name=\"pyspark-use\">\n\n\n\n<h2 class=\"wp-block-heading\">What is PySpark used for?<\/h2>\n\n\n\n<p>PySpark is used as an API for Apache Spark. This allows us to leave the Apache Spark terminal and enter our preferred Python programming IDE without losing what Apache Spark has to offer.<\/p>\n\n\n\n<a name=\"free\">\n\n\n\n<h2 class=\"wp-block-heading\">Is Apache Spark free?<\/h2>\n\n\n\n<p>Apache Spark is an open-source engine and thus it is completely free to download and use.<\/p>\n\n\n\n<a name=\"apache-spark-pros\">\n\n\n\n<h2 class=\"wp-block-heading\">Why should I use Apache Spark?<\/h2>\n\n\n\n<ul><li>Apache Spark offers distributed computing<\/li><li>Apache Spark is easy to use<\/li><li>Apache Spark is free<\/li><li>Offer advanced analytics<\/li><li>Is a very powerful engine<\/li><li>Offers machine learning, streaming, SQL, and graph processing modules<\/li><li>Is applicable to various programming languages like Python, R, Java&#8230;<\/li><li>Has a good community and is advancing as a product<\/li><\/ul>\n\n\n\n<a name=\"apache-spark-cons\">\n\n\n\n<h2 class=\"wp-block-heading\">Why shouldn&#8217;t I use Apache Spark?<\/h2>\n\n\n\n<ul><li>Apache Spark can have scaling problems with compute-intensive jobs<\/li><li>It can consume a lot of memory<\/li><li>Can have issues with small files<\/li><li>Is constrained by the number of available ML algorithms<\/li><\/ul>\n\n\n\n<a name=\"pyspark-pros\">\n\n\n\n<h2 class=\"wp-block-heading\">Why should I use PySpark?<\/h2>\n\n\n\n<ul><li>PySpark is easy to use<\/li><li>PySpark can handle synchronization errors<\/li><li>The learning curve isn&#8217;t steep as in other languages like Scala<\/li><li>Can easily handle big data<\/li><li>Has all the pros of Apache Spark added to it<\/li><\/ul>\n\n\n\n<a name=\"pyspark-cons\">\n\n\n\n<h2 class=\"wp-block-heading\">Why shouldn&#8217;t I use PySpark?<\/h2>\n\n\n\n<ul><li>PySpark can be less efficient as it uses Python<\/li><li>It is slow when compared to other languages like Scala<\/li><li>It can be replaced with other libraries like Dask that easily integrate with Pandas (depends on the problem and dataset)<\/li><li>Suffers from all the cons of Apache Spark<\/li><\/ul>\n\n\n\n<a name=\"apache-spark-alternatives\">\n\n\n\n<h2 class=\"wp-block-heading\">What are some Apache Spark alternatives?<\/h2>\n\n\n\n<p>Apache Spark can be replaced with some alternatives and they are the following:<\/p>\n\n\n\n<ul><li>Apache Hadoop<\/li><li>Google BigQuery<\/li><li>Amazon EMR<\/li><li>IBM Analytics Engine<\/li><li>Apache Flink<\/li><li>Lumify<\/li><li>Presto<\/li><li>Apache Pig<\/li><\/ul>\n\n\n\n<a name=\"apache-spark-clients\">\n\n\n\n<h2 class=\"wp-block-heading\">What are some Apache Spark clients?<\/h2>\n\n\n\n<p>Some of the programming clients that has Apache Spark APIs are the following:<\/p>\n\n\n\n<ul><li><a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/index.html\">Python<\/a><\/li><li><a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/index.html\">Scala<\/a><\/li><li><a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/java\/index.html\">Java<\/a><\/li><li><a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/R\/index.html\">R<\/a><\/li><li><a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/sql\/index.html\">SQL<\/a><\/li><\/ul>\n\n\n\n<a name=\"getting-started\">\n\n\n\n<h2 class=\"wp-block-heading\">How to get started with Apache Spark?<\/h2>\n\n\n\n<p>In order to get started with Apache Spark and the PySpark library, we will need to go through multiple steps. This can be a bit confusing if you have never done something similar but don&#8217;t worry. We will do it together!<\/p>\n\n\n\n<a name=\"prerequisites\">\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisites<\/h3>\n\n\n\n<p>The first things that we need to take care of are the prerequisites that we need in order to make Apache Spark and PySpark work. These prerequisites are Java 8, Python 3, and something to extract .tar files.<\/p>\n\n\n\n<p>Let&#8217;s see what Java version are you rocking on your computer. If you&#8217;re on Windows like me, go to Start, type <code>cmd<\/code>, and enter the Command Prompt. When there, type the following command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>java -version<\/code><\/pre>\n\n\n\n<p>And you&#8217;ll get a message similar to this one that will specify your Java version:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>java version \"1.8.0_281\"<\/code><\/pre>\n\n\n\n<p>If you didn&#8217;t get a response you don&#8217;t have Java installed. If your java is outdated ( &lt; 8) or non-existent, go over to the following <a href=\"https:\/\/java.com\/en\/download\/\">link<\/a> and download the latest version.<\/p>\n\n\n\n<p>If you, for some reason, don&#8217;t have Python installed here is a <a href=\"https:\/\/www.python.org\/\">link<\/a> to download it. And lastly, for the extraction of .tar files, I use <a href=\"https:\/\/www.7-zip.org\/download.html\">7-zip<\/a>. You can use anything that does the job.<\/p>\n\n\n\n<a name=\"download\">\n\n\n\n<h3 class=\"wp-block-heading\">Download and set-up<\/h3>\n\n\n\n<p>Go over to the following <a href=\"https:\/\/spark.apache.org\/downloads.html\">link<\/a> and download the 3.0.3. Spark release that is pre-built for Apache Hadoop 2.7.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"915\" height=\"241\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1-6.jpg\" alt=\"\" class=\"wp-image-15959\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1-6.jpg 915w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1-6-300x79.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/1-6-768x202.jpg 768w\" sizes=\"(max-width: 915px) 100vw, 915px\" \/><\/figure>\n\n\n\n<p>Now click the blue link that is written under number 3 and select one of the mirrors that you would like to download from. While it is downloading create a folder named Spark in your root drive (C:).<\/p>\n\n\n\n<p>Go into that folder and extract the downloaded file into it. The next thing that you need to add is the winutils.exe file for the underlying Hadoop version that Spark will be utilizing.<\/p>\n\n\n\n<p>To do this, go over to the following <a href=\"https:\/\/github.com\/cdarlint\/winutils\">GitHub page<\/a> and select the version of Hadoop that we downloaded. After that, scroll down until you see the winutils.exe file. Click on it and download it.<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"467\" height=\"434\" data-id=\"15960\"  src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/2-7.jpg\" alt=\"\" class=\"wp-image-15960\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/2-7.jpg 467w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/2-7-300x279.jpg 300w\" sizes=\"(max-width: 467px) 100vw, 467px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"417\" height=\"201\" data-id=\"15961\"  src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/3-6.jpg\" alt=\"\" class=\"wp-image-15961\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/3-6.jpg 417w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/3-6-300x145.jpg 300w\" sizes=\"(max-width: 417px) 100vw, 417px\" \/><\/figure>\n<\/figure>\n\n\n\n<p>Now create a new folder in your root drive and name it &#8220;Hadoop&#8221;, then create a folder inside of that folder and name it &#8220;bin&#8221;. Inside the bin folder paste the winutils.exe file that we just downloaded.<\/p>\n\n\n\n<p>Now for the final steps, we need to configure our environmental variables. Environmental variables allow us to add Spark and Hadoop to our system PATH. This way we can call Spark in Python as they will be on the same PATH.<\/p>\n\n\n\n<p>Click Start and type &#8220;environment&#8221;. Then select the &#8220;Edit the system environment variables&#8221; option. A new window will pop up and in the lower right corner of it select &#8220;Environment Variables&#8221;.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"404\" height=\"455\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/4-7.jpg\" alt=\"\" class=\"wp-image-15962\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/4-7.jpg 404w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/4-7-266x300.jpg 266w\" sizes=\"(max-width: 404px) 100vw, 404px\" \/><\/figure>\n\n\n\n<p>A new window will appear that will show your environmental variables. In my case, I already have Spark there:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"574\" height=\"207\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/5-6.jpg\" alt=\"\" class=\"wp-image-15963\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/5-6.jpg 574w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/5-6-300x108.jpg 300w\" sizes=\"(max-width: 574px) 100vw, 574px\" \/><\/figure>\n\n\n\n<p>To add it there, click on &#8220;New&#8221;. Then set the name to be &#8220;SPARK_HOME&#8221; and for the Variable value add the path where you downloaded your spark. It should be something like this <code>C:\\Spark\\spark...<\/code> Click OK.<\/p>\n\n\n\n<p>For the next step be sure to be careful and not change your Path. Click on the &#8220;Path&#8221; in your user variables and then select &#8220;Edit&#8221;. A new window will appear, click on the &#8220;New&#8221; button and then write this <code>%SPARK_HOME%\\bin<\/code><\/p>\n\n\n\n<p>You&#8217;ve successfully added Spark to your PATH! Now, repeat this process for both Hadoop and Java. The only things that will change will be their locations and the end name that you give to them.<\/p>\n\n\n\n<p>Your end product should look like this:<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-3 is-layout-flex\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"411\" height=\"172\" data-id=\"15964\"  src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/6-6.jpg\" alt=\"\" class=\"wp-image-15964\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/6-6.jpg 411w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/6-6-300x126.jpg 300w\" sizes=\"(max-width: 411px) 100vw, 411px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"595\" height=\"262\" data-id=\"15965\"  src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/7-5.jpg\" alt=\"\" class=\"wp-image-15965\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/7-5.jpg 595w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/7-5-300x132.jpg 300w\" sizes=\"(max-width: 595px) 100vw, 595px\" \/><\/figure>\n<\/figure>\n\n\n\n<a name=\"launch\">\n\n\n\n<h3 class=\"wp-block-heading\">Launch Spark<\/h3>\n\n\n\n<p>Now let us launch our Spark and see it in its full glory. Start a new command prompt and then enter spark-shell to launch Spark. A new window will appear with Spark up and running.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"685\" height=\"173\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/8-6.jpg\" alt=\"\" class=\"wp-image-15966\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/8-6.jpg 685w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/8-6-300x76.jpg 300w\" sizes=\"(max-width: 685px) 100vw, 685px\" \/><\/figure>\n\n\n\n<p>Now open up your browser and write&nbsp;<code>http:\/\/localhost:4040\/<\/code> or whatever the name of your system is. This will open up the Apache Spark UI where you will be able to see all the information you might need.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"833\" height=\"394\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/9-4.jpg\" alt=\"\" class=\"wp-image-15967\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/9-4.jpg 833w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/9-4-300x142.jpg 300w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/9-4-768x363.jpg 768w\" sizes=\"(max-width: 833px) 100vw, 833px\" \/><\/figure>\n\n\n\n<a name=\"apache-spark-components\">\n\n\n\n<h2 class=\"wp-block-heading\">What are the main components of Apache Spark?<\/h2>\n\n\n\n<p>There are several components that make Apache Spark and they are the following:<\/p>\n\n\n\n<ul><li><strong>Spark Core <\/strong>&#8211; is the main part of Apache Spark that provides in-built memory computing and does all the basic I\/O functions, memory management, and much more.<\/li><li><strong>Spark Streaming <\/strong>&#8211; allows for data streaming that can go up to a couple of gigabytes per second.<\/li><li><strong>Spark SQL<\/strong> &#8211; allows the use of SQL (Structured Query Language) for easier data manipulation and analysis.<\/li><li><strong>MlLib<\/strong> &#8211; packs several machine learning models that can be used in several programming languages.<\/li><li><strong>GraphX<\/strong> &#8211; provides several methods for implementing graph theory to your dataset (i.e. network analysis).<\/li><\/ul>\n\n\n\n<a name=\"rdd\">\n\n\n\n<h2 class=\"wp-block-heading\">What is the Apache Spark RDD?<\/h2>\n\n\n\n<p>Apache Spark RDD (Resilient Distributed Dataset) is a data structure that serves as the main building block. An RDD can be seen as an immutable and partitioned set of data values that can be processed on a distributed system.<\/p>\n\n\n\n<p>To conclude, they are resilient because they are immutable, distributed as they have partitions that can be processed in a distributed manner, and datasets as they hold our data.<\/p>\n\n\n\n<a name=\"use-pyspark-notebook\">\n\n\n\n<h2 class=\"wp-block-heading\">How to use PySpark in Jupyter Notebooks?<\/h2>\n\n\n\n<p>To use PySpark in your Jupyter notebook, all you need to do is to install the PySpark pip package with the following command:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>pip install pyspark<\/code><\/pre><\/div>\n\n\n\n<p>As your Python is located on your system PATH it will work with your Apache Spark. If you want to use something like <a href=\"https:\/\/algotrading101.com\/learn\/google-colab-guide\/\">Google Colab<\/a> you will need to run the following block of code that will set up Apache Spark for you:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>!apt-get install openjdk-8-jdk-headless -qq &gt; \/dev\/null\n!wget -q https:\/\/www-us.apache.org\/dist\/spark\/spark-3.0.3\/spark-3.0.3-bin-hadoop2.7.tgz\n!tar xf spark-3.0.3-bin-hadoop2.7.tgz\n!pip install -q findspark\nimport os\nos.environ[&quot;JAVA_HOME&quot;] = &quot;\/usr\/lib\/jvm\/java-8-openjdk-amd64&quot;\nos.environ[&quot;SPARK_HOME&quot;] = &quot;\/content\/spark-3.0.3-bin-hadoop2.7&quot;\nimport findspark\nfindspark.init()<\/code><\/pre><\/div>\n\n\n\n<p>If you want to use Kaggle like we&#8217;re going to do, you can just go straight to the &#8220;pip install pyspark&#8221; command as Apache Spark will be ready for use.<\/p>\n\n\n\n<a name=\"data\">\n\n\n\n<h2 class=\"wp-block-heading\">Meet the Data<\/h2>\n\n\n\n<p>The dataset that we are going to use for this article will be the Stock Market Data from 1996 to 2020 which is found on <a href=\"https:\/\/www.kaggle.com\/aceofit\/stockmarketdatafrom1996to2020\">Kaggle<\/a>. The dataset is 12.32 GB which exceeds the zone of being comfortable to use with pandas.<\/p>\n\n\n\n<p>For the purpose of this article, we will go over the basics of Apache Spark that will set you up for future use. In the end, we&#8217;ll fit a simple regression algorithm to the data.<\/p>\n\n\n\n<p>We&#8217;ll use Kaggle as our IDE. All that you need to do to follow along is to open up a new notebook on the main page of the dataset.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"530\" height=\"370\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/07\/10-1.png\" alt=\"\" class=\"wp-image-10384\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/07\/10-1.png 530w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/07\/10-1-300x209.png 300w\" sizes=\"(max-width: 530px) 100vw, 530px\" \/><\/figure>\n\n\n\n<a name=\"pyspark-session\">\n\n\n\n<h2 class=\"wp-block-heading\">How to start a PySpark session?<\/h2>\n\n\n\n<p>To start a PySpark session you will need to specify the builder access, where the program will run, the name of the application, and the session creation parameter. All of that is done with the following lines of code:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>!pip install pyspark\nfrom pyspark.sql import SparkSession\n\n# Create the Session\nspark = SparkSession.builder\\\n    .master(&quot;local&quot;)\\\n    .appName(&quot;PySpark Tutorial&quot;)\\\n    .getOrCreate()<\/code><\/pre><\/div>\n\n\n\n<a name=\"pyspark-rdd\">\n\n\n\n<h2 class=\"wp-block-heading\">How to create an RDD in PySpark?<\/h2>\n\n\n\n<p>In order to create an RDD in PySpark, all we need to do is to initialize the <code>sparkContext <\/code>with the data we want it to have. For example, the following code will create an RDD of the FB stock data and show the first two rows:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>sc = spark.sparkContext\nrdd = sc.textFile(&#39;..\/input\/stockmarketdatafrom1996to2020\/Data\/Data\/FB\/FB.csv&#39;)\nrdd.take(2)<\/code><\/pre><\/div>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;'Date,Open,High,Low,Close,Adj Close,Volume',\n '2012-05-18,42.049999,45.000000,38.000000,38.230000,38.230000,573576400']<\/code><\/pre>\n\n\n\n<a name=\"pyspark-data\">\n\n\n\n<h2 class=\"wp-block-heading\">How to load data in PySpark?<\/h2>\n\n\n\n<p>To load data in PySpark you will often use the <code>.read.file_type()<\/code> function with the specified path to your desired file. To import our dataset, we use the following command:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>stock_1 = spark.read.csv(&#39;..\/input\/stockmarketdatafrom1996to2020\/Data\/Data\/AAPL\/AAPL.csv&#39;,\\\n                         inferSchema=True, header=True)\nstock_1.show(5)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"648\" height=\"205\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/11-7.jpg\" alt=\"\" class=\"wp-image-15968\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/11-7.jpg 648w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/11-7-300x95.jpg 300w\" sizes=\"(max-width: 648px) 100vw, 648px\" \/><\/figure>\n\n\n\n<p>To find your data path you can simply navigate the Data section on the right side of your screen and copy the path to the desired file. In our case, I selected a random stock from the Data folder that has all stocks in it.<\/p>\n\n\n\n<p>The <code>inferSchema <\/code>parameter will automatically infer the input schema from our data and the <code>header <\/code>parameter will use the first row as the column names. After the data is loaded we print out the first 5 rows.<\/p>\n\n\n\n<p>You could try loading all the stocks from the Data file but that would take too long to wait and the goal of the article is to show you how to go around using Apache Spark. <\/p>\n\n\n\n<p>To list all of them and their directories you can run the following code:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from pathlib import Path\ncontents = list(Path(&#39;..\/input\/stockmarketdatafrom1996to2020\/Data\/Data&#39;).iterdir())\nprint(contents)<\/code><\/pre><\/div>\n\n\n\n<p>Let&#8217;s get the second stock ready for when we do the regression:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>stock_2 = spark.read.csv(&#39;..\/input\/stockmarketdatafrom1996to2020\/Data\/Data\/MSFT\/MSFT.csv&#39;,\\\n                         inferSchema=True, header=True)<\/code><\/pre><\/div>\n\n\n\n<p>You can also check the schema of your data frame:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>stock_1.printSchema()<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"385\" height=\"182\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/12-4.jpg\" alt=\"\" class=\"wp-image-15969\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/12-4.jpg 385w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/12-4-300x142.jpg 300w\" sizes=\"(max-width: 385px) 100vw, 385px\" \/><\/figure>\n\n\n\n<a name=\"pyspark-functions\">\n\n\n\n<h2 class=\"wp-block-heading\">What are the most common PySpark functions?<\/h2>\n\n\n\n<p>Some of the most common PySpark functions that you will probably be using are the <code>select<\/code>, <code>filter<\/code>, <code>reduce<\/code>, <code>map<\/code>, and more. I&#8217;ll showcase each one of them in an easy-to-understand manner.<\/p>\n\n\n\n<a name=\"select\">\n\n\n\n<h3 class=\"wp-block-heading\">select<\/h3>\n\n\n\n<p>The <code>select <\/code>function is often used when we want to see or create a subset of our data. In order to do this, we want to specify the column names. For example, let&#8217;s hone in on the closing prices of the APPL stock data:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>stock_1.select(&quot;Close&quot;).show(10)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"277\" height=\"301\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/13-5.jpg\" alt=\"\" class=\"wp-image-15970\"\/><\/figure>\n\n\n\n<a name=\"filter\">\n\n\n\n<h3 class=\"wp-block-heading\">filter<\/h3>\n\n\n\n<p>The <code>filter<\/code> function will apply a filter on the data that you have specified. For example, we can show only the top 10 APPL  closing prices that are above $148 with their timestamps.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from pyspark.sql import functions as F\n\nstock_1.filter(F.col(&quot;Close&quot;)&gt;148.00).select(&quot;Date&quot;,&quot;Close&quot;).show(10)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"253\" height=\"304\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/14-5.jpg\" alt=\"\" class=\"wp-image-15971\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/14-5.jpg 253w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/14-5-250x300.jpg 250w\" sizes=\"(max-width: 253px) 100vw, 253px\" \/><\/figure>\n\n\n\n<a name=\"map\">\n\n\n\n<h3 class=\"wp-block-heading\">map<\/h3>\n\n\n\n<p>The <code>map <\/code>function will allow us to parse the previously created RDD. For example, we can parse the values in it and create a list out of each row.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>rdd = rdd.map(lambda line: line.split(&quot;,&quot;))\nrdd.top(5)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"565\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/15-4.jpg\" alt=\"\" class=\"wp-image-15972\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/15-4.jpg 600w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/15-4-300x283.jpg 300w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/figure>\n\n\n\n<a name=\"reduce\">\n\n\n\n<h3 class=\"wp-block-heading\">reduce<\/h3>\n\n\n\n<p>The <code>reduce<\/code> function will allow us to &#8220;reduce&#8221; the values by aggregating them aka by doing various calculations like counting, summing, dividing, and similar. <\/p>\n\n\n\n<p>For example, let&#8217;s create an RDD with random numbers and sum them.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>num = sc.parallelize([23, 1, 4, 5, 6, 7])\nnum_sum = num.reduce(lambda a,b:a+b)\nprint(num_sum)<\/code><\/pre><\/div>\n\n\n\n<p>46<\/p>\n\n\n\n<a name=\"convert-rdd\">\n\n\n\n<h2 class=\"wp-block-heading\">How to convert an RDD to a DataFrame in PySpark?<\/h2>\n\n\n\n<p>To convert an RDD to a DataFrame in PySpark, you will need to utilize the <code>map<\/code>, <code>sql.Row<\/code> and <code>toDF <\/code>functions while specifying the column names and value lines. Let&#8217;s take our previously parsed FB stock RDD and convert it:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from pyspark.sql import Row\n\nheader = rdd.first()\nstock_3 = rdd.filter(lambda line: line != header)\\\n             .map(lambda line: Row(date=line[0],\n                                   open=line[1],\n                                   high=line[2],\n                                   low=line[3],\n                                   close=line[4],\n                                   adj_close=line[5],\n                                   volume=line[6])).toDF()\nstock_3.show(5)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"672\" height=\"215\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/16-3.jpg\" alt=\"\" class=\"wp-image-15973\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/16-3.jpg 672w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/16-3-300x96.jpg 300w\" sizes=\"(max-width: 672px) 100vw, 672px\" \/><\/figure>\n\n\n\n<p>Notice how I filtered out the first row from the RDD. This was done because the first row carried the column names and we didn&#8217;t want it in our values.<\/p>\n\n\n\n<a name=\"pyspark-preprocess\">\n\n\n\n<h2 class=\"wp-block-heading\">How to preprocess data with PySpark?<\/h2>\n\n\n\n<p>To preprocess data with PySpark there are several methods that depend on what you wish to do. For example, I will show you how to standardize the values for your analysis.<\/p>\n\n\n\n<p>The first thing that we will do is to convert our Adj Close values to a float type. Then we will rename the columns that will make our analysis later on and merge the two data frames.<\/p>\n\n\n\n<p>After that, we will need to convert those to a vector in order to be available to the standard scaler. We&#8217;ll print out the results after each step so that you can see the progression:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from pyspark.ml.feature import StandardScaler\nfrom pyspark.ml.feature import VectorAssembler\n\ninput_1 = stock_1.select(&quot;Adj Close&quot;)\ninput_1.show(5)\ninput_2 = stock_2.select(&quot;Adj Close&quot;)\n#######################\n\ninput_1 = input_1.withColumnRenamed(&quot;Adj Close&quot;,&quot;label&quot;)\ninput_2 = input_2.withColumnRenamed(&quot;Adj Close&quot;,&quot;feature&quot;)\n\ninput_data = input_1.join(input_2)\ninput_data.show(5)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"270\" height=\"415\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/17-3.jpg\" alt=\"\" class=\"wp-image-15974\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/17-3.jpg 270w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/17-3-195x300.jpg 195w\" sizes=\"(max-width: 270px) 100vw, 270px\" \/><\/figure>\n\n\n\n<p>And now for the scaling part:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>assembler = VectorAssembler(\n    inputCols=[&quot;feature&quot;],\n    outputCol=&quot;features&quot;)\n\ninput_data = assembler.transform(input_data)\n\nstandardScaler = StandardScaler(inputCol=&quot;features&quot;, outputCol=&quot;features_scaled&quot;)\nscaler = standardScaler.fit(input_data.select(&quot;features&quot;))\n\ndf = scaler.transform(input_data)\ndf.show(5)<\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"493\" height=\"198\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/18-1.jpg\" alt=\"\" class=\"wp-image-15975\" srcset=\"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/18-1.jpg 493w, https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2022\/07\/18-1-300x120.jpg 300w\" sizes=\"(max-width: 493px) 100vw, 493px\" \/><\/figure>\n\n\n\n<a name=\"pyspark-ml\">\n\n\n\n<h2 class=\"wp-block-heading\">How to run a Machine Learning model with PySpark?<\/h2>\n\n\n\n<p>To run a Machine Learning model in PySpark, all you need to do is to import the model from the <code>pyspark.ml<\/code> library and initialize it with the parameters that you want it to have.<\/p>\n\n\n\n<p>For example, let&#8217;s create a simple linear regression model and see if the prices of stock_1 can predict the prices of stock_2. To do this, we will first split the data into train and test sets ( 80-20% respectively).<\/p>\n\n\n\n<p>We then fit the model to the train data. This might take several minutes to complete. As Apache Spark doesn&#8217;t have all the models you might need using <a href=\"https:\/\/algotrading101.com\/learn\/sklearn-guide\/\">Sklearn <\/a>is a good option and it can easily work with Apache Spark.<\/p>\n\n\n\n<p>Moreover, Sklearn sometimes speeds up the model fitting.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from pyspark.ml.regression import LinearRegression\n\ntrain_data, test_data = df.randomSplit([.8,.2], seed=42)\n\nreg = LinearRegression(labelCol=&quot;label&quot;,\\\n                       featuresCol=&quot;features_scaled&quot;, maxIter=5)\nmodel = reg.fit(train_data)<\/code><\/pre><\/div>\n\n\n\n<p>When the fitting is done we can do the predictions on the test data. Have in mind that we won&#8217;t optimize the hyperparameters in this article. We will zip the predictions and the true labels and print out the first five.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Predict test_data\npredicted = model.transform(test_data)\n\n# Take predictions and the true label - zip them\npredictions = predicted.select(&quot;prediction&quot;).rdd.map(lambda x: x[0])\nlabels = predicted.select(&quot;label&quot;).rdd.map(lambda x: x[0])\npred_lab = predictions.zip(labels).collect()\n\n# Print out first 5  predictions\npred_lab[:5]<\/code><\/pre><\/div>\n\n\n\n<p>Also, have in mind that this is a very x10 simple model that shouldn&#8217;t be used on data like this. The goal is to show you how to use the ML library. To access the model&#8217;s coefficients and useful statistics we can do the following:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code># Model coefficients\nprint(model.coefficients)\n\n# Intercept\nprint(model.intercept)\n\n# RMSE\nprint(model.summary.rootMeanSquaredError)<\/code><\/pre><\/div>\n\n\n\n<a name=\"full-code\">\n\n\n\n<h2 class=\"wp-block-heading\">Full code<\/h2>\n\n\n\n<p><a href=\"https:\/\/www.kaggle.com\/igorradovanovic\/apache-spark-algotrading101\">Kaggle Notebook<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<div class=\"pvc_clear\"><\/div>\n<p id=\"pvc_stats_10535\" class=\"pvc_stats total_only  \" data-element-id=\"10535\" style=\"\"><i class=\"pvc-stats-icon medium\" aria-hidden=\"true\"><svg aria-hidden=\"true\" focusable=\"false\" data-prefix=\"far\" data-icon=\"chart-bar\" role=\"img\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\" class=\"svg-inline--fa fa-chart-bar fa-w-16 fa-2x\"><path fill=\"currentColor\" d=\"M396.8 352h22.4c6.4 0 12.8-6.4 12.8-12.8V108.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v230.4c0 6.4 6.4 12.8 12.8 12.8zm-192 0h22.4c6.4 0 12.8-6.4 12.8-12.8V140.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v198.4c0 6.4 6.4 12.8 12.8 12.8zm96 0h22.4c6.4 0 12.8-6.4 12.8-12.8V204.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v134.4c0 6.4 6.4 12.8 12.8 12.8zM496 400H48V80c0-8.84-7.16-16-16-16H16C7.16 64 0 71.16 0 80v336c0 17.67 14.33 32 32 32h464c8.84 0 16-7.16 16-16v-16c0-8.84-7.16-16-16-16zm-387.2-48h22.4c6.4 0 12.8-6.4 12.8-12.8v-70.4c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v70.4c0 6.4 6.4 12.8 12.8 12.8z\" class=\"\"><\/path><\/svg><\/i> <img decoding=\"async\" width=\"16\" height=\"16\" alt=\"Loading\" src=\"https:\/\/algotrading101.com\/learn\/wp-content\/plugins\/page-views-count\/ajax-loader-2x.gif\" border=0 \/><\/p>\n<div class=\"pvc_clear\"><\/div>\n<p>Table of Contents: What is PySpark? What is Apache Spark? What is Apache Spark used for? What is PySpark used for? Is Apache Spark free? Why should I use Apache Spark? Why shouldn\u2019t I use Apache Spark? Why should I use PySpark? Why shouldn\u2019t I use PySpark? What are some Apache Spark alternatives? What are [&hellip;]<\/p>\n","protected":false},"author":14,"featured_media":10385,"comment_status":"closed","ping_status":"open","sticky":true,"template":"","format":"standard","meta":{"_lmt_disableupdate":"no","_lmt_disable":"no","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"categories":[3],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>PySpark - A Beginner&#039;s Guide to Apache Spark and Big Data - AlgoTrading101 Blog<\/title>\n<meta name=\"description\" content=\"PySpark is a Python library that serves as an interface for Apache Spark. Apache Spark is a computing engine that is used for big data.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/algotrading101.com\/learn\/pyspark-guide\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"PySpark - A Beginner&#039;s Guide to Apache Spark and Big Data - AlgoTrading101 Blog\" \/>\n<meta property=\"og:description\" content=\"PySpark is a Python library that serves as an interface for Apache Spark. Apache Spark is a computing engine that is used for big data.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/algotrading101.com\/learn\/pyspark-guide\/\" \/>\n<meta property=\"og:site_name\" content=\"Quantitative Trading Ideas and Guides - AlgoTrading101 Blog\" \/>\n<meta property=\"article:published_time\" content=\"2021-07-23T09:46:03+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-04-03T21:06:35+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/07\/Apache_Spark_logo.svg_.png\" \/>\n\t<meta property=\"og:image:width\" content=\"800\" \/>\n\t<meta property=\"og:image:height\" content=\"415\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Igor Radovanovic\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Igor Radovanovic\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"16 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"PySpark - A Beginner's Guide to Apache Spark and Big Data - AlgoTrading101 Blog","description":"PySpark is a Python library that serves as an interface for Apache Spark. Apache Spark is a computing engine that is used for big data.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/algotrading101.com\/learn\/pyspark-guide\/","og_locale":"en_US","og_type":"article","og_title":"PySpark - A Beginner's Guide to Apache Spark and Big Data - AlgoTrading101 Blog","og_description":"PySpark is a Python library that serves as an interface for Apache Spark. Apache Spark is a computing engine that is used for big data.","og_url":"https:\/\/algotrading101.com\/learn\/pyspark-guide\/","og_site_name":"Quantitative Trading Ideas and Guides - AlgoTrading101 Blog","article_published_time":"2021-07-23T09:46:03+00:00","article_modified_time":"2023-04-03T21:06:35+00:00","og_image":[{"width":800,"height":415,"url":"http:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2021\/07\/Apache_Spark_logo.svg_.png","type":"image\/png"}],"author":"Igor Radovanovic","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Igor Radovanovic","Est. reading time":"16 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/algotrading101.com\/learn\/pyspark-guide\/#article","isPartOf":{"@id":"https:\/\/algotrading101.com\/learn\/pyspark-guide\/"},"author":{"name":"Igor Radovanovic","@id":"https:\/\/algotrading101.com\/learn\/#\/schema\/person\/a7ae60c112a73b7c3fe14ac56726a0ae"},"headline":"PySpark &#8211; A Beginner&#8217;s Guide to Apache Spark and Big Data","datePublished":"2021-07-23T09:46:03+00:00","dateModified":"2023-04-03T21:06:35+00:00","mainEntityOfPage":{"@id":"https:\/\/algotrading101.com\/learn\/pyspark-guide\/"},"wordCount":2468,"publisher":{"@id":"https:\/\/algotrading101.com\/learn\/#organization"},"articleSection":["Programming"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/algotrading101.com\/learn\/pyspark-guide\/","url":"https:\/\/algotrading101.com\/learn\/pyspark-guide\/","name":"PySpark - A Beginner's Guide to Apache Spark and Big Data - AlgoTrading101 Blog","isPartOf":{"@id":"https:\/\/algotrading101.com\/learn\/#website"},"datePublished":"2021-07-23T09:46:03+00:00","dateModified":"2023-04-03T21:06:35+00:00","description":"PySpark is a Python library that serves as an interface for Apache Spark. Apache Spark is a computing engine that is used for big data.","breadcrumb":{"@id":"https:\/\/algotrading101.com\/learn\/pyspark-guide\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/algotrading101.com\/learn\/pyspark-guide\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/algotrading101.com\/learn\/pyspark-guide\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/algotrading101.com\/learn\/"},{"@type":"ListItem","position":2,"name":"PySpark &#8211; A Beginner&#8217;s Guide to Apache Spark and Big Data"}]},{"@type":"WebSite","@id":"https:\/\/algotrading101.com\/learn\/#website","url":"https:\/\/algotrading101.com\/learn\/","name":"Quantitative Trading Ideas and Guides - AlgoTrading101 Blog","description":"Authentic Stories about Algorithmic trading, coding and life.","publisher":{"@id":"https:\/\/algotrading101.com\/learn\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/algotrading101.com\/learn\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/algotrading101.com\/learn\/#organization","name":"AlgoTrading101","url":"https:\/\/algotrading101.com\/learn\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/algotrading101.com\/learn\/#\/schema\/logo\/image\/","url":"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2020\/11\/AlgoTrading101-Lucas-Liew.jpg","contentUrl":"https:\/\/algotrading101.com\/learn\/wp-content\/uploads\/2020\/11\/AlgoTrading101-Lucas-Liew.jpg","width":1200,"height":627,"caption":"AlgoTrading101"},"image":{"@id":"https:\/\/algotrading101.com\/learn\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/algotrading101.com\/learn\/#\/schema\/person\/a7ae60c112a73b7c3fe14ac56726a0ae","name":"Igor Radovanovic","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/algotrading101.com\/learn\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/d46175c509b3ee240a1e2bbe735a4d1e?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/d46175c509b3ee240a1e2bbe735a4d1e?s=96&d=mm&r=g","caption":"Igor Radovanovic"},"sameAs":["https:\/\/igorradovanovic.com","https:\/\/www.linkedin.com\/in\/igor-radovanovic-profile"],"url":"https:\/\/algotrading101.com\/learn\/author\/igor\/"}]}},"modified_by":"Igor Radovanovic","_links":{"self":[{"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/posts\/10535"}],"collection":[{"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/comments?post=10535"}],"version-history":[{"count":4,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/posts\/10535\/revisions"}],"predecessor-version":[{"id":21150,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/posts\/10535\/revisions\/21150"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/media\/10385"}],"wp:attachment":[{"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/media?parent=10535"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/categories?post=10535"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/algotrading101.com\/learn\/wp-json\/wp\/v2\/tags?post=10535"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}