Google Colab – A Step-by-step Guide

16 min read

Get 10-day Free Algo Trading Course

Last Updated on May 14, 2021

Google Colab Logo

Table of contents:

  1. What is Google Colab?
  2. Is Google Colab free?
  3. Why should I use Google Colab?
  4. Why shouldn’t I use Google Colab?
  5. What are the alternatives to Google Colab?
  6. Does Google Colab support Python?
  7. How do I get started with Google Colab?
  8. How do I import libraries/install dependencies in Google Colab?
  9. How do I enable GPU/TPU usage in Google Colab?
  10. How do I import data in Google Colab?
  11. Using Machine Learning in Google Colab to predict house prices
  12. How can I load Kaggle datasets directly into Google Colab?
  13. How can I visualize data/produce charts using Google Colab?
  14. How can I deploy ML algorithms in Google Colab?
    1. Prepare the Data
    2. Pick an Algorithm and evaluate it
    3. Optimize the Algorithm
  15. How can I save my Google Colab notebook directly to GitHub?
  16. How can I mount external Python files in Google Colab?
  17. What are Google Colab Magics?
  18. What are some other interesting Google Colab features?
  19. What are the 3 Common Machine Learning Analysis/Testing Mistakes?
  20. Full code

What is Google Colab?

Google Colab is a online notebook-like coding environment that is well-suited for machine learning and data analysis.

It comes equipped with many Machine Learning libraries and offers GPU usage. It is mainly used by data scientists and ML engineers.

Is Google Colab free?

Yes, Google Colab is free to use and you can access all of its features to a certain degree. There is a subscription plan called Google Colab Pro that offers access to upgraded features.

These upgraded features allow the use of more processing power, RAM, and memory. You can access this plan for $9.99 per month if you come from one of the following countries:

  • US
  • Canada
  • UK
  • Germany
  • France
  • India
  • Japan
  • Thailand
  • Brazil

Why should I use Google Colab?

  • Google Colab is Free
  • Easy to get started
  • Allows access to GPUs/TPUs
  • Easy to share code with others
  • Easy graphical visualizations in Notebooks

Let’s go over each of the pros of Google Colab.

Firstly, it is free to use and everyone can access it. There are also some premium features if you want to utilize GPUs/TPUs with more power and fewer limitations.

Getting started with Google Colab is easy. You don’t need to install any prerequisites or have a decent PC or laptop. All you need is a browser where you’ll get a Jupyter Notebook-like environment.

Google Colab comes ready with GPUs and TPUs which can be utilized with a click of a single button. This makes Google Colab a great coding environment for machine learning practitioners.

Sharing code with Google Colab can be done through Google Drive or directly to GitHub with an in-built feature.

Being like a Jupyter Notebook, a Google Colab document allows you to run code in blocks, and intersperse these blocks with Markdown cells. It can also easily display multiple graphical outputs.

All of these features make Google Colab a phenomenal asset when it comes to collaborative, data science, machine learning, and data analysis projects.

Why shouldn’t I use Google Colab?

  • GPU/TPU usage is limited
  • Not the most powerful GPU/TPU setups available
  • Not the best de-bugging environment
  • It is hard to work with big data
  • Have to re-install extra dependencies every new runtime

GPU/TPU usage is not endless with Google Colab as resources aren’t infinite. The free version lasts for 12 hours of continuous usage and is not very tolerant with inactivity, whilst the pro version allows 24 hours of continuous usage with greater tolerance.

The free version of Google Colab allows the usage of a K80 GPU while the Pro version allows a T4 or P100 GPU. For most of you, these GPUs are way more powerful than the ones you have, but for more money we can get even better ones (e.g. AWS).

Being a Notebook environment, it will be harder to catch bugs in your code before running it.

As big datasets need to fit in a Google drive, it can be difficult to deal with them because you are limited to 15 GB of free space with a Gmail id.

Lastly, you’ll have to (re)install any additional libraries you want to use every time you (re)connect to a Google Colab notebook. A good thing is that it comes equipped with pre-installed libraries that are often used.

What are the alternatives to Google Colab?

Google Colab can be replaced with other platforms that can be more suitable for your needs. Here are some of them:

  • Jupyter Notebook
  • Kaggle
  • Azure Notebooks
  • Amazon SageMaker
  • Paperspace Gradient
  • FloydHub

Does Google Colab support Python?

Yes, Google Colab supports Python (and as of October 2019 only allows the creation of Python 3 notebooks), though in some cases with further tinkering it might be possible to get R, Swift, or Julia to work.

Have in mind that since the 1st January 2020, Python 2 is no longer supported.

How do I get started with Google Colab?

There are several ways to get started with Google Colab and we will go over each of them. All approaches are quite easy and it depends on what you want to start working on (i.e. fresh notebook or GitHub repository).

The first way is to go over to your Google Drive account. In the top left corner select “New”, then “More” in the drop-down panel, and then “Google Collaboratory”.

To open an existing Google Colab document simply right click on it –> Open With –> Google Collaboratory. You can also load other people’s Google Colab documents if you share a google drive with them.

To import/open files directly from GitHub you will need the Open in Colab chrome extension. Add it to your chrome, then navigate over to the notebook you want to open in Github, click on your browser’s Extensions tab, then click Open in Colab.

As the extension is new and is still being worked on I’d advise waiting a bit for it to get polished.

Another way is to go to this link and click the “New Notebook” button. You can also see that you have access to Google Colab examples, Recent notebooks, Google Drive, GitHub, and you can upload your own notebook.

After opening up a new Notebook you can do a couple of things with it. The first thing is to give it a name in the upper left corner. In the upper right corner, you can click on the Settings icon.

When in settings you can change your theme to dark, set your editor key bindings and colors, change the font size, and more. Be sure to customize these features so they suit your preferences.

Now, let us get acquainted with some of the most used shortcuts so we can save ourselves some time (for Mac users CTRL == Command):

  • Command Pallete – Ctrl+Shift+P
  • Add a comment – Ctrl+Alt+M
  • Convert to text cell – Ctrl+M M
  • Add a new cell below – Ctrl+B B
  • Run all cells – Ctrl+F9
  • Run the current cell – Ctrl+Enter
  • Save Notebook – Ctrl+S
  • Show keyboard shortcuts – Ctrl+M H

All shortcuts can be edited to suit your needs.

On the left taskbar, you can view your Notebook’s table of contents that shows all the Markdown headers in a structured way, useful code snippets, files, and a search and replace tool.

To start coding, in the upper right side you may see the connect button so be sure to click it. When connected you will see something like this:

Now that we know how to use some of the main features of Google Colab, we are ready to start working on a problem. I believe that this is the best way to get acquainted with new environments and learn some new things as a bonus.

How do I import libraries/install dependencies in Google Colab?

Importing libraries and installing dependencies in Google Colab is quite easy. You need to use your usual !pip install and import commands followed by the libraries/dependencies name.

A great thing about Google Colab is that it comes with many preinstalled dependencies that are often used.

Any installations only remain for the duration of your session, so if you close the session/notebook, you’ll have to run inline installations whenever you open your project again.

You can check which version of a library you’re using with !pip show. For example, to check which version of TensorFlow you are using you would use !pip show tensorflow

To upgrade an already installed library to the latest version, use !pip install --upgrade tensorflow

And finally to install a specific version, use !pip install tensorflow==1.2

How do I enable GPU/TPU usage in Google Colab?

All you have to do in Google Colab to enable a GPU or TPU is head over to the “Runtime” section, select “Change runtime type” and select either GPU or TPU.

Due to TPU’s specialist nature, there are some best practices you can use to help optimize your data flow to utilize them to their fullest potential. The “TPUs in Colab” section of the Google Colab docs highlights some of these.

How do I import data in Google Colab?

There are several ways to import data with Google Colab from a Google Drive, including mounting your Google Drive in the Colab notebook’s runtime’s virtual machine, using PyDrive, and using a native REST API.

We’ll go over how to mount your Google Drive quickly here, but you can learn about how to use the other methods (and other data loading/saving options) here.

To mount you drive, simply run the following code:

from google.colab import drive
drive.mount('/content/drive')

You’ll be given a link to authorize this action via a code output:

After clicking the link, it will take you to the following screen where you will click allow.

After that, a authorization code will appear that you will copy and paste it in the cell. Press enter and that’s it.

Now for instance you could open a text document in your drive, write some text in it and save it:

with open('/content/drive/My Drive/foo.txt', 'w') as f:
  f.write('Hello Google Drive!')
!cat /content/drive/My\ Drive/foo.txt

drive.flush_and_unmount()

For machine learning work, we will usually be loading tabular data from csv or xlsx files into pandas data frames or loading image data into arrays, so let’s quickly cover how to do exactly those things with Google Colab!

Let’s assume we’ve already mounted our drive like shown just above. Upload the csv/xlsx file you want to use onto Google Drive, then browse for its location.

The default path to your drive is '/content/drive/My Drive/', and our file that we used in another article is directly in the files section- not in any further folders- so its path is '/content/drive/My Drive/Name'.

We can now simply use the pandas read_csv function to load the file directly into a pandas data frame:

import pandas as pd

path = "/content/drive/My Drive/Name"

df = pd.read_csv(path)

df #displays dataframe

You can also upload a file directly from your computer using the following code:

from google.colab import files
uploaded = files.upload()

Click on Choose Files, browse to your desired file and open it.

Finally, we can use the BytesIO function from the io module to stream the data into a pandas data frame:

import io

df2 = pd.read_csv(io.BytesIO(uploaded['reddit_wsb.csv']))

As the dataset that we’ll use comes from Kaggle we shall open it directly into Google Colab without needing to download it manually or use one of the above-mentioned ways.

For this, we will need to create a Kaggle API token. Go to your Kaggle account details and scroll down to the API section. When there click the “Create New API Token” button and a Kaggle JSON file will be downloaded.

Now we go back to our notebooks and import the required libraries to load the dataset:

#pip install kaggle - Should come preinstalled
import pandas as pd
from google.colab import files

Now let’s upload our downloaded Kaggle.json file and check if it is in the right place:

files.upload()
ls -lha kaggle.json

The next thing that we need to do is to set the file configuration:

# The Kaggle API client expects this file to be in ~/.kaggle,
# and we will move it there
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json

Now we are ready to download our dataset directly from Kaggle. Go to our dataset on Kaggle and Copy the API command as shown below:

Then paste it as a new command and Google Colab will download it:

!kaggle datasets download -d camnugent/california-housing-prices

Now we unzip the file and remove the zip:

# Unzip the data and delete the zip
!unzip california-housing-prices.zip  && rm california-housing-prices.zip

If you access the Files icon on your toolbar that is on the left, you will se our obtained housing dataset.

Let’s see the dataset:

df = pd.read_csv('/content/housing.csv')
df.head()

How can I visualize data/produce charts using Google Colab?

Google Colab is similar to Jupyter Notebooks so you can instantly see your graphs after running the graphing command. The most used graphing libraries are matplotlib, seaborn, ggplot, plotly, and more.

Now that we have our dataset we want to explore it by conducting an Explanatory Data Analysis (EDA). Let’s give our dataset a quick glance at the values it has.

df.info()

We can already see that the variable total_bedrooms has some missing values. We also see that all variables are numerical except for the ocean_proximity one. We will take care of it later but let’s see what it has:

df['ocean_proximity'].value_counts()

For a quick glance at the numerical variables, we can use the pandas describe function as shown:

df.describe()

Now let’s graph these variables with matplotlib. We will first check out the histogram:

df.hist(bins=50, figsize=(15,10))
plt.show()

The first thing that we can see from the median_income histogram is that the values were preprocessed, in this case they were scaled. After checking out the information behind the data I’ve uncovered that each number is expressed in tens of thousands of dollars (e.g. 5 ≈ 50000$).

We also see that most of our distributions have quite a skew i.e. they lean more towards the left side. Also, our variables have different scales and we will deal with this issue later on.

Here is a tip: Before performing an EDA be sure to split the data to train and test sets to avoid the Data Snooping Bias. Data snooping is the inappropriate use of data mining to uncover misleading relationships in data.

So let’s split our data into train and test sets but with a twist. As median income is a good predictor of the house value, we want to split our data in a way that will be representative of the median income stratums.

If you check out the median income histogram you will see that most of the data is between 1.5 and 6 but it goes beyond 6 too. If the sets don’t contain enough instances of each stratum the model might be biased towards one.

In order to combat this, we will use the pd.cut function to stratify the data by 1.5 increments.

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

# As we want the stratums of data income to be representative we will split the data by them
# But first, we need to create these stratums
df['income_stratums'] = pd.cut(df['median_income'],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

df['income_stratums'].hist()

Now we split the data by the stratums and delete them:

# Split the data by stratums
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df['income_stratums']):
  train_set = df.loc[train_index]
  test_set = df.loc[test_index]

# Delete the income_stratums column
for stratum in (train_set, test_set):
  stratum.drop('income_stratums', axis=1, inplace=True)

The next thing is to change this graph to show us the housing prices. The radius around each circle will show the population of a district and the color will represent the price.

train_set.plot(kind='scatter', x='longitude', y='latitude', figsize=(10,10),
               s=train_set['population']/100, label='population',
               c='median_house_value', cmap='rainbow',
               colorbar=True)
plt.legend()

As we can see, the high-density areas are the Bay Area and around San Diego and Los Angeles. Also, there is density in the Central Valley around Fresno and Sacramento.

The housing prices are also correlated with the density area as one could expect. Also, houses closer to the sea tend to be more expensive. When speaking of correlations, we should check them out.

But before we go to create a correlation matrix we should see if our features make sense aka we could make them more informative.

# Before looking at correlations we might want to create new features 
# that make more sense
# Let's look at the variables that we have
test_set

For example, three features can be combined to be more informative. Try to find them. If you’re thinking of creating bedrooms per room, population per household, and rooms per household you were correct.

# Create new features
df['bedrooms_per_room'] = df['total_bedrooms']/df['total_rooms']
df['population_per_household'] = df['population']/df['households']
df['rooms_per_household'] = df['total_rooms']/df['households']

# Check for correlations
correlations = df.corr()
correlations['median_house_value'].sort_values(ascending=False)

We can see that our room_per_household is more correlated with the label than the total_rooms or households. Also, the bedrooms_per_room is more correlated with the label than its parent variables.

To interpret some of them, we see that the higher the median income is – the higher are the house prices. The lower the bedrooms per room ratio is – the higher the prices get.

Let’s see how the top 4 variables by correlation look when correlated to each other:

variables =['median_house_value', 'median_income',
            'bedrooms_per_room', 'rooms_per_household']
pd.plotting.scatter_matrix(df[variables], figsize=(12, 10))

Okay, here we can see how variables behave to each other. If you look at median_income and median_house_value you might notice that our prices are capped at $500k.

You can also notice that data tends to group in a few horizontal lines around $450k, $350k, and $280k. These are the things one should take care of before passing the data to the algorithm as the algorithm might learn these occurrences.

How can I deploy ML algorithms in Google Colab?

Machine Learning algorithms can be used in Google Colab the same way you use them in any other coding environment. Google Colab also comes with preinstalled ML libraries like TensorFlow and scikit-learn.

Have in mind that we will go fast through the following sub-sections and that you should check out our Sklearn Introduction article if you get stuck on a certain point.

Prepare the Data

Before we pick a few ML models and deploy them, we want to prepare (preprocess) our data to be ready for the algorithms. So let’s do that.

We already know that we have one categorical feature (ocean_proximity) and the rest are numerical. As they are different from each other, they will be preprocessed in different ways.

Firstly, let’s split our train set into two parts where one will contain the label.

# Split Train Set
housing_features = train_set.drop('median_house_value', axis=1)
housing_label = train_set['median_house_value'].copy()

Now as we want to automate the process of data preparation, we will split the housing_features into numerical and categorical. After that, we will create our own functions and sklearn pipelines to process them.

# Split housing_features to categorical and numerical sets
numerical = housing_features.drop('ocean_proximity', axis=1)
categorical = housing_features['ocean_proximity'].copy()

Now we will create a sklearn numerical pipeline that:

  • Imputes missing data by the median value
  • Creates new features
  • Standardizes the numerical features

The first thing we want to do is to import our dependencies and create a custom function that will create new features (the ones we created before). The function should also allow us to choose which features to include so we can test them.

from sklearn import pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer

# Create a function that creates new features (Inspired by Aurelien Geron)
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class FeatureGenerator(BaseEstimator, TransformerMixin):
  def __init__(self, add_bedrooms_per_room=True):
    self.add_bedrooms_per_room = add_bedrooms_per_room
  def fit(self, X, y=None)
    return self
  def transform(self, X, y=None)
    rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
    population_per_household = X[:, population_ix] / X[:, households_ix]
    if self.add_bedrooms_per_room:
      bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
      return np.c_[X, rooms_per_household, population_per_household, 
                   bedrooms_per_room]
    else:
      return np.c_[X, rooms_per_household, population_per_household]

Now we can create our full numerical pipeline:

# Numerical pipeline
numerical_pipe = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('feature_generator', FeatureGenerator()),
            ('standardizer', StandardScaler())
            ])

The next step is to create a categorical pipeline that will perform One Hot Encoding on the ocean_proximity feature. We will combine the two pipelines into a single one and run all features through it.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_features = list(numerical)
cat_features = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", numerical_pipe, num_features),
        ("cat", OneHotEncoder(), cat_features),
    ])

# Run all features through the pipeline
housing = full_pipeline.fit_transform(housing_features)
housing.shape

(16512, 16)

It comes quite in handy to preprocess your data all at once. And you can easily save your pipelines and functions to use for later. When doing ML you will see that you’ll soon start creating a list of your custom functions.

Pick an Algorithm and evaluate it

As this is a supervised regression task, we will pick a regression model. When doing ML it is advised to pick multiple algorithms and compare them to pick the best one.

For the brevity of the article, we will go for a single one which is the Random Forest Regressor. For practice, you can try to build a multiple algorithm pipeline that runs them and prints the comparisons.

We will also check for the RMSE which basically shows us the discrepancy between the predicted and observed values.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

18728.778

Not bad. But we are likely overfitting our data. The thing that we want to do next is to optimize the algorithm.

Optimize the Algorithm

In order to optimize the machine learning algorithm, we will perform a randomized search of specified hyperparameters. The search will look for optimal ones that we should use for our model.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

hyperparameters = {
        'n_estimators': randint(low=1, high=250),
        'max_features': randint(low=1, high=10),
    }

rnd_search = RandomizedSearchCV(clf, param_distributions=hyperparameters,
                                n_iter=15, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing, housing_label)

Let’s see our errors and pick the hyperparameters that give us the lowest value:

cv = rnd_search.cv_results_
for mean_score, params in zip(cv["mean_test_score"], cv["params"]):
    print(np.sqrt(-mean_score), params)

Seems like it is 8 max features and 189 estimators. There were other things to do like doing a grid search to see which features are the best and similar. But it won’t be our focus for this article.

Let’s see how the model performs on the test set:

model = RandomForestRegressor(n_estimators=189, max_features=8, random_state=42)

X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()

X_test = full_pipeline.transform(X_test)
model.fit(housing, housing_label)
predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
rmse

46879.44

That’s it. Feel free to play with other regression models and see how they behave while we move on to our next header.

How can I save my Google Colab notebook directly to GitHub?

To save your Google Colab notebook to GitHub you will go to the “File” section and select “Save a copy in GitHub”. After that, a pop-up screen will ask for your authorization. And you will then do the usual repository stuff.

How can I mount external Python files in Google Colab?

Suppose you have some Python code stored in your Google Drive and you want to run it in Google Colab with their GPU/TPU. To mount the external file write the following command:

from google.colab import drive
drive.mount('/content/drive')

You will be provided with an URL that will take you to a new tab to give permission to Google Drive. After you allow access to Google Drive you will be given an authorization code to enter in your code cell.

To list the contents of your drive run the following command:

!ls "/content/drive/My Drive/Colab Notebooks"

To run a specific content, for example hello.py, write the following:

!python3 "/content/drive/My Drive/Colab Notebooks/hello.py"

What are Google Colab Magics?

Google Colab Magics are a set of system commands that can be seen as a mini extensive command language. There are two types of magics which are Line and Cell magics.

The line magics start with %, while the cell magics start with %%. To see a full list of available magics run the following command:

%lsmagic

Now let’s run a line magic that will show you your local directory:

%ldir

And a cell magic:

%%html
<marquee style='width: 50%; color: Red;'>Welcome to Algotrading101!</marquee>

What are some other interesting Google Colab features?

Google Colab has other interesting features like markdown that shows nice mathematical equations, custom widgets, forms, and more. To take a look at these features go here.

What are the 3 Common Machine Learning Analysis/Testing Mistakes?

When you run your analysis, there are 3 common mistakes to take note:

  • Overfitting
  • Look-ahead Bias
  • P-hacking

Do check out this lecture PDF to learn more: 3 Big Mistakes of Backtesting – 1) Overfitting 2) Look-Ahead Bias 3) P-Hacking

Full Code

GitHub Link

Igor Radovanovic