Last Updated on May 10, 2023
Table of contents:
- What is R?
- How can R be used for Finance?
- Why should I use R for Finance?
- Why shouldn’t I use R for Finance?
- What are some R alternatives?
- What is Econometrics?
- Stock problem definition
- How to download stock data with R?
- How to save data to CSV with R?
- What is linear regression?
- What is the null hypothesis?
- What is the p-value?
- How to do a linear regression on stock data with R?
- What is the Granger causality test?
- How to do a Granger causality test in R?
- What is Serial Correlation?
- How to do a Serial Correlation test with R?
- How to remove Serial Correlation from a model in R?
- What is Stationarity?
- How to test for Stationarity with R?
- What is Cointegration?
- How to test for Cointegration with R?
- What are the 3 Common Analysis/Testing Mistakes?
- Final thoughts
- Full code
What is R?
R is a programming language and fee software that is primarily used for statistical computing and graphics. It is mostly used for data analytics and is favored by many scientists and statisticians.
How can R be used for Finance?
The Financial sector has a plethora of data that requires careful and rigorous analysis in order to make good data-driven decisions and R can shine here with its great statistical and graphical packages.
Why should I use R for Finance?
- R is free to use
- Has great packages
- Easy to learn
- Has advanced statistics
- Has a good community
Why shouldn’t I use R for Finance?
- Can have issues with handling Big Data
- Might be confusing for programming beginners
- R is a bit slower language
- R is more of a specialty language and not a multi-purpose one like Python
What are some R alternatives?
Depending on the problem you’re trying to solve, R can be replaced with the following alternatives:
- Python
- Julia
- Matlab
- RapidMiner
- SPSS
- Jasp
- Jamovi
- Orange
- Mathematica
What is Econometrics?
Econometrics involves applying statistical and mathematical methods to economic data in order to uncover meaningful relationships. In other words, it is the quantitative analysis of economic phenomena.
Stock problem definition
In this article, we will address the problem of finding stocks that move in a similar matter over a certain time frame. In order to do this properly some of the main statistical methods that we need to use are the following:
- Linear regression
- Serial correlation
- Stationarity test
- Cointegration test
Each of the above-mentioned procedures will help us to uncover the real relationship between two or more stocks. For the showcase of these procedures, we will use the Box and Dropbox stocks.
The analysis that we’re going to perform can inform your domain knowledge when setting up a pair’s trading strategy. You can read more about it here, and you can also check out our cluster analysis article.
If you’re unfamiliar with the R basics or would like a refresher on it, I’d advise skimming through some bits in this article. We will use R studio without any fancy code for our analysis, and I’ll explain it bit by bit.
How to download stock data with R?
Stock data can be downloaded in many ways with R and we will use the tidyquant library that was built for handling financial data. If the library isn’t installed you can obtain it with the following command:
install.packages("tidyquant")
You’ll see your console being populated and if it gets cluttered at any moment you can press CTRL + L to clear it out. Now that the library is installed let’s load it and obtain stock data for Box and Dropbox.
library(tidyquant)
boxx <- tq_get('BOX',
from = '2020-01-01',
to = '2021-06-06',
get = 'stock.prices')
dropbox <- tq_get('DBX',
from = '2020-01-01',
to = '2021-06-06',
get = 'stock.prices')
Take note: as “box” is a function in R, we named it “boxx”. Now let’s print out the first few columns of our data so you can see what we obtained and graph our stocks.
library(ggplot2)
boxx %>%
ggplot(aes(x = date, y = adjusted)) +
geom_line() +
theme_classic() +
labs(x = 'Date',
y = "Adjusted Price",
title = "Box price chart") +
scale_y_continuous(breaks = seq(0,300,10))
dropbox %>%
ggplot(aes(x = date, y = adjusted)) +
geom_line() +
theme_classic() +
labs(x = 'Date',
y = "Adjusted Price",
title = "Dropbox price chart") +
scale_y_continuous(breaks = seq(0,300,10))
Just by eye-balling, we can state that the stocks have indeed moved in a similar fashion. But our eyes have fooled us enough times in our lives to trust them, especially when it comes to financial matters.
On the bottom right side of your interface, you can browse through the plots. Now, let’s merge the data, rename the columns, and then take what we need from it into a new data frame for analysis.
# Merge
merged <- merge(boxx, dropbox, by=c("date"))
head(merged)
# Subset
df = subset(merged, select=c("date","adjusted.x","adjusted.y"))
# Rename
names(df)[names(df) == "adjusted.x"] <- "box"
names(df)[names(df) == "adjusted.y"] <- "dropbox"
head(df)
How to save data to CSV with R?
To export a data frame to a CSV file in R, you will need to use the write.csv()
command where you specify the data frame and the directory path.
To obtain your default directory path you can write the following:
dir <- getwd()
dir
And now we save our data:
write.csv(df, "box_dropbox.csv")
To load it again simply write the following:
df <- read.csv("box_dropbox.csv")
# Delete index created by loading
df <- df[,-1]
What is linear regression?
Linear regression is a statistical method that attempts to model the relationship between two variables by fitting a straight line to the observed data. Simply said, it helps us understand if we can predict y given X.
Before fitting a linear regression model on our data, we will need to go over some fundamental concepts that will aid our summary statistics interpretation.
What is the null hypothesis?
A null hypothesis states that there is no difference, relationship and or significance between two or more sets of data or measured phenomena.
An example of a null hypothesis would be the following sentence: There is no significant correlation between the observed stock movements of Box and Dropbox in a given time frame.
We always want to disprove the null hypothesis and the most used statistic that is used for that is the p-value.
What is the p-value?
The p-value s the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.
The p-value helps us to determine the significance of our results. By convention, if the p-value is less than 0.05 (5%), we have strong evidence to reject the null hypothesis and vice-versa.
As this can be an article for itself let’s put it this way: the lower our p-value is, the more surprising our evidence is, and the more ridiculous is the null hypothesis.
p-value downfalls
As you might have guessed already, the simplistic nature of the p-value, which is often misunderstood in itself, has led astray many. Some of the most prominent p-value downfalls are the following:
- Puts too much significance onto a single research paper and/or analysis
- Focuses on either/or thinking
- Doesn’t say anything about approving the null hypothesis
- Depends on the quality of the sample (sampling should be random)
- Can easily be tinkered with to show significance and more.
p-value alternatives and supplements
We shouldn’t throw out the baby with the bathwater, and thus the best way is to supplement the p-value with other statistics and procedures that shed more light on our data. Some of these are the following:
- Replication
- Cross-validation
- Confidence Intervals
- Standardized measures of effect
- Statistical Power
- Margin of Error
- Bayesian statistics
- Meta-analysis and more.
How to do a linear regression on stock data with R?
Linear regression on stock data can be done with R by using the in-built lm()
function and stating the appropriate variables. In our case, we will assume that the relationship between our stocks is linear.
Now, let us attach our dataset so R knows that we will pull the data from it and run the linear regression model on our stocks. We will also print out the summary statistics of our regression model.
attach(df)
reg <- lm(box~dropbox)
summary(reg)
If you look at the “Estimate” it tells us how much our dependent variable changes (Box) for each unit increase ($). In this case for each 1$ increase in Dropbox the price of Box changes by $0.807.
Below you can see that the p-value is really low and that we can reject the null hypothesis aka it is unlikely that the results above are due to random chance.
The F-statistic shows if there is a relationship between our predictor (Dropbox) and the response variable. The further it is away from 1 the better the case it makes for rejecting the null hypothesis.
The R-squared states that the movement in Dropbox explains 76% variance in Box. Let’s plot our linear regression to see the fit and check it for homoscedasticity.
par(mfrow=c(2,2))
plot(reg)
The scale location plot is used to check the homogeneity of variance of the residuals (homoscedasticity). A horizontal line with equally spread points is a reliable indication of homoscedasticity. In our case, it isn’t too bad.
The Residuals vs Fitted plot help us to check the linear relationship assumption. If the red line is horizontal or close to being one, the linear relationship is true.
In our case, it has a pattern and this isn’t surprising as we’re using a simple model on stock data. The Normal Q-Q plot shows if the data is normally distributed. If it follows a straight dashed line it is a good indicator.
And finally, the Residuals vs Leverage plot shows us if we have any extreme values that might skew our linear regression results.
Now that you know some basics about the regression, reverse the variables and see what you get. What does the data tell you? For a look at how to do some of this in python, you can check out his article.
What is the Granger causality test?
The Granger causality test is a statistical hypothesis test that shows if one time series can predict another. For example, can we predict Box prices by observing the movement of Dropbox prices with some time lag?
How to do a Granger causality test in R?
To perform a Granger causality test in R you will need to use the lmtest library and the grangertest()
function in which you specify the variables and lags.
Let’s do it for our data:
library(lmtest)
grangertest(box ~ dropbox, order = 3, data=df)
As you can see the p-value and F statistic support the rejection of the null hypothesis and thus we can say that the movement in Dropbox can predict Box. But have in mind the downfalls of the p-value and that it isn’t too small.
Now try to reverse the variables and see what you get.
What is Serial Correlation?
Serial Correlation, also known as autocorrelation, represents a relationship between a variable and a lagged version of itself over a specified time frame. If the variable is serially correlated it implies that it is not random.
How to do a Serial Correlation test with R?
A Serial Correlation test can be done with R by using the Durbin Watson test that comes stored as a dw()
function from the lmtest library. As a parameter you pass the model that you want to test for serial correlation.
We want to do this test as a supplement to our linear regression so we will check our model for serial correlation. This is done as the linear regression might have picked up the noise from the correlated error terms in our time series.
dwtest(reg)
Result:
data: reg
DW = 0.16026, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
The value of the DW test shows that the alternative hypothesis is true and the p-value is almost non-existent. This means that we indeed have a case of serial correlation in our data.
If you look at the residuals from our regression you will see that we have an overall positive trend. The residual is the difference between the observed value and the predicted one.
residuals = reg$residuals
plot(residuals, type='l')
And if we do a lagged regression on the residuals we can see how they impacted the main model.
sub_box <- subset(df, select=c("date","box"))
sub_dropbox <- subset(df, select=c("date","dropbox"))
d_box = diff(as.numeric(unlist(sub_box["box"])))
d_dbox = diff(as.numeric(unlist(sub_dropbox["dropbox"])))
lagged_reg <- lm(d_box~d_dbox)
summary(lagged_reg)
lagged_reg_res = lagged_reg$residuals
plot(lagged_reg_res, type='l')
How to remove Serial Correlation from a model in R?
To eliminate Serial Correlation in your model you can do the Cochrane-Orcutt test. This can be done by using the cochrane.orcutt()
function that is a part of the orcutt package.
Let us run it on our model and see the diffrence:
#install.packages("orcutt")
library(orcutt)
co <- cochrane.orcutt(reg)
summary(co)
dwtest(co)
As you can see, we have quite an improvement on the serial correlation front. The results of the test show a lower estimate value (0.43) from the prior 0.80 one. The Durbin-Watson test isn’t significant anymore so we have eliminated autocorrelation.
Now, let us take the residuals from the first regression and the lagged one so you can see the correlation between each lag. This will get us a sense of stationarity that we will test later on.
acf_residual_reg = acf(residuals)
acf_lag_residual_reg = acf(lagged_reg_res)
acf_residual_reg
acf_lag_residual_reg
On the left chart, you can see some substantial correlation between the residuals, and when we do a regression on them the correlations (right chart) have dropped which we want to be the case.
If you need a practical guide to correlations you can check out this article.
But there is one thing that we need to have in mind and in states that the Cochrane-Orcutt test needs our time series to have constant mean, variance, and to be stationary.
What is Stationarity?
Stationarity in a time series means that the said time series has the same mean, variance, and autocorrelation over time. For example, if your stock data doesn’t show any trends it is stationary.
We need the data to be stationary in order to be able to make accurate predictions. If the data is full of noise and trends, we can’t use it for reasonable predictions.
How to test for Stationarity with R?
Testing for Stationarity can be done by using several tests in R. Some of the tests are the following: ADF, KPSS, Philips-Peron, Zivot-Andrews, and more.
Let’s check out what our data shows us and run some of the mentioned tests in order:
#install.packages("egcm")
#install.packages("tseries")
library(egcm)
library(tseries)
adf.test(as.numeric(unlist(sub_box["box"])))
adf.test(as.numeric(unlist(sub_dropbox["dropbox"])))
adf.test(d_box)
adf.test(d_dbox)
Here we can observe that our data is non-stationary because we can’t reject the null hypothesis. But when we do a test on the difference of our stock data we obtain the opposite.
When we run the Philips-Peron test, it confirms the results of the previous one.
pp.test(as.numeric(unlist(sub_box["box"])))
pp.test(as.numeric(unlist(sub_dropbox["dropbox"])))
pp.test(d_box)
pp.test(d_dbox)
And finally, the KPSS test confirms it once again. The null hypothesis of the KPSS test states that the data IS stationary.
kpss.test(as.numeric(unlist(sub_box["box"])), null="Trend")
kpss.test(as.numeric(unlist(sub_dropbox["dropbox"])), null="Trend")
What is Cointegration?
Cointegration can be viewed as a long-run relationship between two or more variables. For example, if two stocks tend to follow the same trends over enough time, we can say that they are cointegrated.
You can imagine it like walking an overly curious dog on a leash. The dog will move around and go from side to side, but he won’t ever move too much from you as he is on a leash. Well, that is cointegration.
How to test for Cointegration with R?
Testing for Cointegration with R can be done by using the Engle-Granger test that comes as a part of the egcm package.
Let’s see how our two stocks do on the cointegration front:
egcm(as.numeric(unlist(sub_box["box"])), as.numeric(unlist(sub_dropbox["dropbox"])))
egcm(d_box, d_dbox)
plot(egcm(as.numeric(unlist(sub_box["box"])), as.numeric(unlist(sub_dropbox["dropbox"]))))
plot(egcm(d_box, d_dbox))
As we can observe, the pairs seem to be cointegrated while their differences, which I added out of curiosity, seem not to be.
Now let’s look at the plots that will shed more light on how the cointegration looks like:
What are the 3 Common Analysis/Testing Mistakes?
When you run your analysis, there are 3 common mistakes to take note:
- Overfitting
- Look-ahead Bias
- P-hacking
Do check out this lecture PDF to learn more: 3 Big Mistakes of Backtesting – 1) Overfitting 2) Look-Ahead Bias 3) P-Hacking
Final thoughts
When working with a time series you need to test your model for serial correlation and correct it with the Cochrane-Orcutt method. We also want to make sure that the time series is stationary so we can make predictions.
When looking for the correlation between time series you need to use cointegration because it is more informative. Too many people just look for plain correlations that can be meaningless.
And remember these wise words: “Correlation doesn’t mean causation!”.