Web Scraping Tutorial – Reddit Data for Finance

16 min read

Get 10-day Free Algo Trading Course

Last Updated on June 24, 2020

Wallstreetbets Homepage Screenshot
r/wallstreetbets homepage. They have over 1 million official users and millions more lurking.

How to use Reddit data for Finance 

Reddit is an online discussion forum of dedicated and smart individuals which can be a great place to generate ideas. We can use this information from thread posts to understand which stocks are being most talked about and which are potentially being bought and sold. 

Every single trading day the subdomain /r/wallstreetbets has a thread called ‘Daily Discussions’ where users will talk about which stocks they are buying, selling and what they’re hoping to do in the stock market. This presents a great opportunity to see which stocks are being talked about and what may be potential buys or sells. 

The data we want to scrape 

We will be scraping data from the previous days Daily Discussions thread (most thread have been 30–60k posts daily!). This is when most of the people interested in posting will have done so. 

What are the Challenges of scraping Reddit ? 

Reddit is a well structured website and is relatively user friendly when it comes to web scrapping. The challenges we have to tackle are the following

  1. The need to use browser automation to grab data from the Reddit website
  2. Browsing threads within Reddit that are large requires multiple clicks to get to the comments.
  3. The Reddit API will only allow a certain amount of requests per minute
  4. Grabbing up to 60,000 comments to analyse

» Try using web-scraped data to compliment your fundamental investing. Here’s a new buzzword: Quantamental (Quantitative + Fundamental). Check out this article to learn more: What is Quantamental? 3 Techniques to Investing

Overview of the tasks 

  1. Generate a stock list of tickers (Hard work done for you!)
  2. Grab the previous day’s ‘Daily Discussion’ thread link 
  3. Interact with the reddit API to generate all comment links 
  4. Grab the text of each comment 
  5. Compare the stock ticker list with the comments text 
  6. Output information to CSV file
  7. Output information to Googlesheets

So with that, let’s tackle these one at a time. 

Generating the Stock ticker list

The US market has over 1476 companies on some form of stock exchange. I created a simple program to scrape these tickers from an online program (Sharepad) but you can grab the text file here so most of the work is done for you.

You may ask well why are we doing that ? When dealing with a large set of data, it can be useful sometimes to output this into a text file and subsequently call upon this file to generate the list. Having a list of 1400 stock tickers in the program will certainly make the code much more bulky! 

Getting data from Reddit 

reddit.com: search results – flair:”Daily Discussion”
r/wallstreetbets: Like 4chan found a Bloomberg Terminalwww.reddit.com

If we do a search for Daily Discussion using the Reddit search function we come across this 

Reddit Search function 

When doing any web scrapping project it is important to get used to the website you want to get data from. 
The threads we are interested in is actually always the 2nd item on the list. Specifying ‘Daily Discussion Thread’ unfortunately gives us the image you see below.

The next challenge for webscraping Reddit is that it has a lot of interactive elements on the page, which invariably means javascript is being used. The simplest way to get information from a website is using a package called ‘Requests’, this allows us to make HTTP requests to the servers of reddit and gain the HTML code that we want. 

Lets take a look at what happens 

import requests
url = 'https://www.reddit.com/r/wallstreetbets/search/?q=flair%3A%22Daily%20Discussion%22&restrict_sr=1&sort=new'
requests.get(url)

Output:

<Response [200]>

Awesome! We have the response we want. For those not used to HTTP status codes, the code 200 means we have generated a response from the server. You can look up the status codes here for further details. 

But lets look deeper at this, we can use the text method of requests package to look at the response we get back. Now the response we get back is a lot of javascript. We can tell this by <script> </script> tags.

html = requests.get(url)
print(html.text)

Output:

'<!DOCTYPE html><html lang="en-US"><head><script>\n          var __SUPPORTS_TIMING_API = typeof performance === \'object\' && !!performance.mark && !! performance.measure && !!performance.getEntriesByType;\n          function __perfMark(name) { __SUPPORTS_TIMING_API && performance.mark(name); };\n          var __firstLoaded = false;\n          function __markFirstPostVisible() {\n            if (__firstLoaded) { return; }\n            __firstLoaded = true;\n            __perfMark("first_post_title_image_loaded");\n          }\n        </script><script>

Scrolling down this response we can see there is no information that is useful to us! So what is the next step to getting the data we want ? 

Web scrapping javascript requires us to try automate how the browser interacts with the website. To do this within python we use the package Selenium. 

Selenium is actually a testing interface but has grown use in those who require to do web scraping. The good news is that it’s fairly simple to pick up the basics!

To import the necessary package we have to import the webdriver class. Now this is how selenium interacts with a browser. We invoke the webdriver and specify which browser we want to use. Then we can use the selenium methods to grab the data as the browser sees the webpage. 

To use selenium we have to grab chromedriver which can be find here, this is standalone package that can be used for testing. Watch out! Check your version of the chrome browser you are using before downloading it. It must correspond to the correct version. Other browsers can be used and I refer you to selenium documentation here

from selenium import webdriver
url = 'https://www.reddit.com/r/wallstreetbets/search/?q=flair%3A%22Daily%20Discussion%22&restrict_sr=1&sort=new'
driver = webdriver.Chrome(executable_path=r'c:\chromedriver.exe')
driver.get(url)

Notes
1. We import webdriver and defined the url we want to grab 
2. We define a variable driver , we invoke the chrome method and define the path to find the chromedriver.
3. We use the webdriver get()method to simulate the browser, this will open up chromedriver and navigate to the website we specify. 

Now we can start to use selenium’s methods to navigate through the website. Selenium is quite versatile in that it can simulate browser actions like click, drag and drop and scrolling. We won’t need to use to do anything elaborate. 

So we now have access to the html code underlying the scripts. We can use selenium to grab the html code. Selenium gives you the option to get html code using a range of different attributes/tags. We will use XPATH for this. XPATH is originally used to get content from XML documents, however it can also be used for HTML documents too. There is a lot to take in with XPATH so we will go through the basics to get what we need. If you want to learn more about XPATH please see here

Getting the correct thread link from Reddit

Now we’re set up with Selenium, we now need to look at the HTML in detail to see what will be the best way to grab the data we want from the page. 

We can see from the html we can see that the <span> tag where the text of the threadlink is two parent’s up from the actual link. 

Now looking at the page will tell you three things. 

  1. That weekends have only one thread and the date is from Friday to Sunday.
  2. The list has to be specific to the ‘Daily Discussions Thread’ and ‘Weekend Discussion Thread’
  3. We need to provide a way to grab yesterday’s link no matter the date. 

To do this, we will need to create a variable that gets yesterday date and to handle this we import the ‘datetime’ package and ‘dateutil’. 

 Datetime is a module that can process many strings related to dates and simplify it down to a datetime object. With this datetime object you can then compare any date. We use this to get yesterdays date.

The parse method of dateutil.parser any string from a website that looks vaguely like a date and converts this to a datetime object. This allows us to compare the dates in the thread to yesterday’s date. 

We will then need to loop through the list of threads, specifying ‘Daily Discussion Thread’ and ‘Weekend’. Based on the date of the thread text we can grab the link. For the weekend because we’re given a range of dates in the thread text, we have to make sure the link is the same if yesterday’s date is a Sunday or a Saturday. 

from datetime import date,timedelta
from dateutil.parser import parse
yesterday = date.today() — timedelta(days=1)
links = driver.find_elements_by_xpath('//*[@class="_eYtD2XCVieq6emjKBH3m"]')
for a in links:
    if a.text.startswith(‘Daily Discussion Thread’):
        date = “”.join(a.text.split(‘ ‘)[-3:])
        parsed = parse(date) 
        if parse(str(yesterday)) == parsed:
            link = a.find_element_by_xpath(‘../..’).get_attribute(‘href’)
 
    if a.text.startswith(‘Weekend’):
        weekend_date = a.text.split(‘ ‘)
        parsed_date = weekend_date[-3] + ‘ ‘ + /
                      weekend_date[-2].split(“-”)[1] + weekend_date[-1]
         parsed = parse(parsed_date) 
         saturday = weekend_date[-3] + ‘ ‘ +   /
                   str(int(weekend_date[-2].split(“-”)[1]                            
                    .replace(‘,’,’’)) — 1) + ‘ ‘ + weekend_date[-1] 
         
         if parse(str(yesterday)) == parsed: 
              link = a.find_element_by_xpath(‘../..’).get_attribute(‘href’)
 
         elif parse(str(yesterday)) == parse(str(saturday)):
              link = a.find_element_by_xpath(‘../..’).get_attribute(‘href’)

Notes
1. date.today() method generates today’s date as a datetime object, the method timedelta(days=1) will subtract one day from the datetime object. We define the variable yesterday with this. 

2. We define the variable links, now XPATH as explained before uses quotes. The // part refers to the whole HTML document, the *specifies all possible objects. We specify we want all possible objects of the attribute ‘class’ specifying the class for all thread link text. 

3. We loop through all thread link’s, for selenium we have to specify ‘.text’ to get the underlying element’s text. Remember a.text means we are accessing the selenium elements text. We use the string method startswith on this string to specify whether it’s a daily discussion thread or a weekend thread. 

4. We split this string up, as we’re only interested in the last three parts of the string. Using the split()method creates a list, we then can specify the last three parts of the list and join those into a a string using the join list method. 

5. Using the parse() method we can convert this string into a datetime object of that date. 

6. We then compare yesterday’s date with the parsed date string from the thread link.

7. If those two dates are equal, we use XPATH again to grab the link. If you look at the HTML code, the thread link text is two tags down from the link. So in this case we can ‘chain’ the selenium methods for getting xpath with a.find_element_by_xpath . The ../.. specifies we want the parent node of the parent node (grandparent node) to the thread link text tag. This is where the link referring to the thread link text is.

8. To handle the weekend we use similar code. This time we need to manipulate the text. weekend_date[-2] gives us ‘28–30’’ in the thread link text . So we split this part of the string up and use the ‘30’ (which corresponds to a Sunday) as a date. We create the date string with the month and year. This then can be converted into a datetime object using the parse method.

9. To handle the Saturday date, we need to make sure that when we select the thread link text, that the date we convert into a datetime object is the same as the yesterday variable date. To do this, we need to take one away from the ‘30’ part of the thread text. We then call this variable saturday.

10. One we’ve created the dates for Sunday and Saturday, we then use if statements to compare yesterday variable to Sunday date on the threadlink or the Saturday date we constructed. We then specify the same variable link as the thread should be the same for both dates.

Phew! Lots of code, but at the end what we have is for whatever day of the week it is, we grab yesterday’s date. We this date to the thread link text’s, if we get a hit when then grab that thread link specifically. This can then be used 

The output:

‘https://www.reddit.com/r/wallstreetbets/comments/fpv2fn/daily_discussion_thread_april_01_2020/’ 

But for the purposes of the API we actually only need the thread id which is the ‘fpv2fn’ part. We can split this string up and grab the part we actually want.

stock_link = link.split('/')[-3]

Output:

'fpv2fn'

Notes
1. We actually need the 3rd link in the list, we invoke the get_attribute method and grab the link we need. 
2. We use the split string method to split this link up at the/. This creates a list of all the chunks of the link. 
3. We actually only need the third last list item to the comments we need.

Getting Reddit comments id’s from a Reddit thread

Reddit is quite user friendly when it comes to getting data from it’s website. It provides an API to do just that. We are actually going to use a simpler API called ‘Pushshift’ which is a big data API for reddit. This is much more user friendly than the Reddit API for those who are not familiar with it! There’s also no need to authenticate which is necessary for the Reddit API. 

The way Reddit thread work is that every comment has a ‘id’ associated with it. This makes the word easier to grab all the comments, as it would be much harder if we had to individual click all the comments to expand the posts to gain all of this information. 

Pushshift provides us with a way to get a list of all the comment ‘ids’, which then can be used to grab all the comment text from . 

stock_link = link.split('/')[-3]
html = requests.get(f'https://api.pushshift.io/reddit/submission/comment_ids/{stock_link}')
raw_comment_list = html.json()
driver.close()

Notes
1. We use the requests package to interact with the pushshift API, we specify that we want the comment ids for the link we grabbed the data from earlier.
2. We are using f-strings to input that thread id into the pushshift request. 
3. Requests has a json() function that will give us a json file of all the comments. 
4. We then close the selenium browser as we won’t need it after this

When we look at the raw_comment_list it looks like this

{'data': ['fln2yod',
'fln2yuh',
'fln2z7c',
'fln2z7e',
'fln2za1', ..... ]}

So now we have a list of all the comment id’s of the post in question, lets take a look at creating a list of stock tickers.

Creating a list of stock tickers


Make sure you have downloaded the txt file and have it in the same place as the python file first before writing the code! For this we will use the open function to read the txt file and make a list from this. This safes us from having to write large amounts of code.

with open('stockslist.txt', 'r') as w:
stocks = w.readlines()
stocks_list = []
for a in stocks:
a = a.replace('\n','')
stocks_list.append(a)

Notes
1. We use the with statement and open function in read mode and call this w 
2. We create a variable stocks and use the open method readlines() to grab a list of all the stocks.
3. We create an empty list stocks_list and we loop through each stock ticker and we invoke the replace() string method, we need to do this as readlines() gives us a newline \n that we don’t want. We then add this stock ticker to our list. 

Grab the comment texts from Reddit 

We now can use the Pushshift API to grab the comment body for each comment we got in the list of the previous section.

import numpy as np
orig_list = np.array(raw_comment_list['data'])
comment_list = ",".join(orig_list[0:1000])
def get_comments(comment_list):
 
     html = requests.get(f’https://api.pushshift.io/
            reddit/comment/search?ids{comment_list}&fields=body&size=1000')
    newcomments = html.json()
    return newcomments 

Notes
1. We select the list of comment id’s by raw_comment_list['data'] 
2. We feed this to a numpy array. The reason for doing this is that we have over sometimes 60,000 list items and numpy has the architecture to deal with this. 
3. We define the variable comment_list we join each comment id up with a , this can then be inputted into the Pushshift API.
4. We can’t push all comment id’s into the API request, we have to do this in 1000’s of IDs at a time
5. We make another request to the Pushshift API. The search?ids allows us to search the API with the comment id list. We do this by using an f-string again to input our string of comment id’s. The &fields specifies we only want the body of the comment and the =size says how many responses we want back. 
6. This comes back as a json response and we use the request json() method to get the comment bodies

Analysing Reddit Comments

Now we have 1000 comments text, we need to be able to grab the data on the amount of times stock tickers are mentioned. We will get to how we get all of them in the next section. 

from collections import Counter
stock_dict = Counter()
def get_stock_list(newcomments,stocks_list):
for a in newcomments['data']:
for ticker in stocks_list:
if ticker in a['body']:
stock_dict[ticker]+=1

Notes

  1. We import the Counter class, this is a piece of code that allows us to count how many times the stock ticker has been mentioned.
  2. We instantiate that is to call upon the Counter class. We can use this very much like a dictionary. 
  3. We create a function to make use of the comments text and stock list we created earlier
  4. We loop through every comment in the variable a from newcomments['date'], to access each a list of all the comments. 
  5. Now we want to specify if any of the stock tickers are in the comment text, to do this we scan each stock ticker, and for any comment body a[body] we want to know if the ticker is in the text. 
  6. We add the ticker if it’s mentioned to the variable stock_dict which is a dictionary-like type. 

Grabbing all Reddit comments 

So the Pushshift API and Reddit API are limited to the number of times you can make requests to it. So we have to package up the requests we make to the API. This is why we are using numpy, we can handle packing up the request easily with it.

We essentially let numpy delete the first 1000 items of the original list and create a while loop to keep doing this, but what we do with those 1000 items is get the comments and then update the Counter class we have.

orig_list = np.array(raw_comment_list['data'])
remove_me = slice(0,1000)
cleaned = np.delete(orig_list, remove_me)
i = 0
while i < len(cleaned):
print(len(cleaned))
cleaned = np.delete(cleaned, remove_me)
new_comments_list = ",".join(cleaned[0:1000])
newcomments = get_comments(new_comments_list)
get_stock_list(newcomments,stocks_list)
stock = dict(stock_dict)

Notes
1. Like before we load the list into the np array.
2. We define what we want to delete by using a numpy slice method
3. We define the cleaned up list by invoking the numpy delete method of the original list and using the defined slice of 1000 times.
4. A while loop is created whilst there are still items in the list, we then delete a further 1000 items. We then create a new list of the next 1000 comment ids.
5. Using the get_comments function with the new comment id list, we add this to the get_stock_list function and this then updates the Counter class. This is the wonderful thing about the Counter class, you can keep adding to it till you’re done.
6. Finally because stock_dict is dict like, we can convert it to a dictionary very easily. This will make it easier to output this data. 

Final Code

Here the final code all in one place for you to look at before we output the dictionary to various methods.

from selenium import webdriver
from collections import Counter
import numpy as np
from datetime import date,timedelta
from dateutil.parser import parse 
def grab_html()
     url = 'https://www.reddit.com/r/wallstreetbets/search/?q=flair%3A%22Daily%20Discussion%22&restrict_sr=1&sort=new'
     driver = webdriver.Chrome(executable_path=r'c:\chromedriver.exe')
     driver.get(url)
     return driver
def grab_link(driver):
     yesterday = date.today() — timedelta(days=1)
     links = driver.find_elements_by_xpath
             ('//[@class="_eYtD2XCVieq6emjKBH3m"]') 
     for a in links:
         if a.text.startswith(‘Daily Discussion Thread’):
             date = “”.join(a.text.split(‘ ‘)[-3:])
             parsed = parse(date) 
             if parse(str(yesterday)) == parsed:
                link = a.find_element_by_xpath(‘../..’).
                       get_attribute(‘href’)
        if a.text.startswith(‘Weekend’):
             weekend_date = a.text.split(‘ ‘)
             parsed_date = weekend_date[-3] + ‘ ‘ + /
                           weekend_date[-2].split(“-”)[1] + 
                           weekend_date[-1] 
         parsed = parse(parsed_date) 
         saturday = weekend_date[-3] + ‘ ‘ +   /
                   str(int(weekend_date[-2].split(“-”)[1]                            
                    .replace(‘,’,’’)) — 1) + ‘ ‘ + weekend_date[-1] 
         
             if parse(str(yesterday)) == parsed: 
                link = a.find_element_by_xpath(‘../..’)
                       .get_attribute(‘href’)
             elif parse(str(yesterday)) == parse(str(saturday)):
                link = a.find_element_by_xpath(‘../..’)
                       .get_attribute(‘href’) 
    stock_link = link.split('/')[-3]
    driver.close() 
    return stock_link
def grab_commentid_list(stock_link):
    html = requests.get(f'https://api.pushshift.io/reddit/
           submission/comment_ids/{stock_link}')
    raw_comment_list = html.json()
    return raw_comment_list
    
     

def grab_stocklist():
    with open('stockslist.txt', 'r') as w:
        stocks = w.readlines()
        stocks_list = []
        for a in stocks:
            a = a.replace('\n','')
            stocks_list.append(a)
   return stocks_list
def get_comments(comment_list):
     html = requests.get(f’https://api.pushshift.io/reddit/comment/
            search?ids={comment_list}&fields=body&size=1000')
 
    newcomments = html.json()
    return newcomments
def get_stock_list(newcomments,stocks_list):
stock_dict = Counter()
for a in newcomments['data']:
for ticker in stocks_list:
if ticker in a['body']:
stock_dict[ticker]+=1
return stock_dict
def grab_stock_count(stock_dict,raw_comment_list):
     orig_list = np.array(raw_comment_list['data'])
     comment_list = ",".join(orig_list[0:1000])
     remove_me = slice(0,1000)
     cleaned = np.delete(orig_list, remove_me)
     i = 0
     while i < len(cleaned):
        print(len(cleaned))
        cleaned = np.delete(cleaned, remove_me)
        new_comments_list = ",".join(cleaned[0:1000])
        newcomments = get_comments(new_comments_list)
        get_stock_list(newcomments,stocks_list)
     stock = dict(stock_dict) 
     return stock
if __name__ == "__main__":
    driver = grab_html()
    stock_link = grab_link(driver)
    grab_commentid_list(stock_link) 
    stockslist = grab_stocklist()
    newcomments = get_comments(comment_list)
    stock_dict = get_stock_list(new_comments,stocks_list)
    stock = grab_stock_count(stock_dict)

Demonstrate how to output the data as csv

If you’re not familiar with working with csv, I suggest you look here. We will write a csv file with the dictionary we just created using the csv.writer method.

data = list(zip(sorted(stock.keys()),sorted(stock.values())))
with open(‘stock.csv’,’w’) as w:
writer = csv.writer(w, lineterminator=’\n’)
writer.writerow([‘Stock’,’Number of Mentions’])
for a in data:
writer.writerow(a)

Notes

1. stock.keys()and stock.values() means we can access each stock ticker and value. We invoke a sorted dictionary method to sort them alphanumerically first. We then zip each item up, this means that each first item of both lists of keys and values are put into a tuple. We then can make a list of those tuples which we define as data.
2. We use a with statement and open function to create the file stock.csv
3. We define the variable writer with the csv writer method. Now we have to specify that we don’t want new lines added when we write the rows into this writer method so we define lineterminator='\n' .
4. We use the writerow() method to create the columns 
5. We then loop each key and value of our tuples we created and write that as a row.

Outputting Data to Google Sheets

We use the package ‘pygsheets’ to allow us to interact with the googlesheets API! 

For this you have to setup the googlesheet API yourself first to be able to do this! I refer to the pygsheets documentation which takes you through this process here. Once this is done and we have the credentials in a json file, we can use pygsheets to interact with googlesheets.

First we create a pandas dataframe with the dictionary we created. We then use the json file we get to authorise use of the google sheets API. Create a googlesheet and in the link you should see a number of letters and numbers, this is the key we need to input the data to. We then use pysheets to add a worksheet with our dataframe.

df = pd.fromdict(data)
gc = pygsheets.authorize(client_secret=’client_secret_1.json’)
key = ‘xxxxxx’
sheet = gc.open_by_key(key)
worksheet = sheet.add_worksheet(“Stock list”)
worksheet.set_dataframe(df,’A1')  

Notes
1. We create the pandas dataframe of our dictionary
2. We use the authorize method of the pygsheets package, we define our credentials for accessing the google sheet API that google provides you.
3. We define the key as the sheet we want to input data to.
4. A sheet variable is created to open up the googlesheet with pygsheets.
5. We then invoke the add_worksheet method and give it a name.
6. We then use the set_dataframe() method which uses our newly created pandas dataframe and starts inputting the data at cell A1. 

Conclusions

Now you should be able to interact with the Reddit API easily to obtain comments, learned how to use analysis comments on reddit for use in financial decision making and be able to output this data into either a CSV file or a google sheet. 

Aaron Smith

One Reply to “Web Scraping Tutorial – Reddit Data for Finance”

Comments are closed.