Hi all, welcome back. This blog post is going to deal with creating the initial stages of our **Python backtesting mean reversion** script – we’re going to leave the “symbol pairs” function we created in the last post behind for a bit (we’ll come back to it a bit later) and use a single pair of symbols to run our first few stages of the backtest to keep it simple.

Once we have the script working for a single input of one pair of symbols, we can very easily adapt it at the end to work with the symbol pairs function created previously.

To give you all a brief outline of what we will be tackling in this post, here’s a quick list of steps:

1) Define our symbol pair, download the relevant price data from yahoo Finance and make sure the data downloaded for each symbol is of the same length.

2) Plot the two ETF price series against each other to get a visual representation, then run a Seaborn “jointplot” to analyse the strength of correlation between the two series.

3) Run an Ordinary Least Squares regression on the closing prices to calculate a hedge ratio. Use the hedge ratio to generate the spread between the two prices, and then plot this to see if it looks in any way mean reverting.

4) Run an Augmented Dickey Fuller test on the spread to confirm statistically whether the series is mean reverting or not. We will also calculate the Hurst exponent of the spread series.

5) Run an Ordinary Least Squares regression on the spread series and a lagged version of the spread series in order to then use the coefficient to calculate the half-life of mean reversion.

Right now let’s get to some code…time to import the relevant modules we will need, set our ETF ticker symbols and download the price data from Yahoo Finance.

#import needed modules from datetime import datetime from pandas_datareader import data import pandas as pd import numpy as np from numpy import log, polyfit, sqrt, std, subtract import statsmodels.tsa.stattools as ts import statsmodels.api as sm import matplotlib.pyplot as plt import seaborn as sns import pprint #choose ticker pairs for our testing symbList = ['EWA','EWC'] start_date = '2012/01/01' end_date = datetime.now() #download data from Yahoo Finance y=data.DataReader(symbList[0], "yahoo", start=start_date, end=end_date) x=data.DataReader(symbList[1], "yahoo", start=start_date, end=end_date) #rename column to make it easier to work with later y.rename(columns={'Adj Close':'price'}, inplace=True) x.rename(columns={'Adj Close':'price'}, inplace=True) #make sure DataFrames are the same length min_date = max(df.dropna().index[0] for df in [y, x]) max_date = min(df.dropna().index[-1] for df in [y, x]) y = y[(y.index>= min_date) & (y.index <= max_date)] x = x[(x.index >= min_date) & (x.index <= max_date)]

Now that we have our price data stored in DataFrames, let’s just bring up a quick plot of the two series to see what information we can gather

plt.plot(y.price,label=symbList[0]) plt.plot(x.price,label=symbList[1]) plt.ylabel('Price') plt.xlabel('Time') plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.show()

It looks like the prices are definitely correlated to a degree, and generally move in the same direction. But just how strong is this correlation? Well there’s an easy way to get a clearer visual representation of this…we can just use a Seaborn “jointplot” as follows:

sns.jointplot(y.price, x.price ,color='b') plt.show()

We can see from the information provided in the jointplot that the Pearson Correlation coefficient is 0.87 – so we can see that there is definitely a pretty strong correlation here between the price series. This sets the pair up as a potentially good fit for a mean reversion strategy.

What we need to do now is create the spread series between the two prices by first running a linear regression analysis between the two price series.

#run Odinary Least Squares regression to find hedge ratio #and then create spread series df1 = pd.DataFrame({'y':y['price'],'x':x['price']}) est = sm.OLS(df1.y,df1.x) est = est.fit() df1['hr'] = -est.params[0] df1['spread'] = df1.y + (df1.x * df1.hr)

Right so we have now managed to run the regression between the ETF price series; the beta coefficient from this regression was then used as the hedge ratio to create the spread series of the two prices.

If we plot the spread series we get the following:

plt.plot(df1.spread) plt.show()

So it looks relatively mean reverting. But sometimes looks can be deceiving, so really it would be great if we could run some statistical tests on the spread series to get a better idea. The test we will be using is the Augmented Dickey Fuller test. You can have a quick read up about it here if you need to refresh your memory:

cadf = ts.adfuller(df1.spread) print 'Augmented Dickey Fuller test statistic =',cadf[0] print 'Augmented Dickey Fuller p-value =',cadf[1] print 'Augmented Dickey Fuller 1%, 5% and 10% test statistics =',cadf[4]

This gets us the following:

Augmented Dickey Fuller test statistic = -3.21520854685 Augmented Dickey Fuller p-value = 0.0191210549486 Augmented Dickey Fuller 1%, 5% and 10% test statistics= {'5%': -2.86419037625175, '1%': -3.436352507699052, '10%': -2.5681811468354598}

From this we can see that the test statistic of -3.215 is larger in absolute terms than the 10% test statistic of -2.568 and the 5% test statistic of -2.864, but not the 1% test statistic of -3.436, meaning we can reject the null hypothesis that there is a unit root in the spread time series, and is therefore not mean reverting, at both the 10% and 5% level of significance, but not at the 1% level.

The p-value of 0.0191 means that we can reject the null hypothesis up to the 1.91% significance level. That’s pretty good in terms of statistical significance, and from this we can be pretty certain that the spread series does in fact posses mean reverting qualities.

The last thing we will do is run a quick function to calculate the Hurst exponent of the spread series.

For info on the Hurst Exponent please refer to: this article

To simplify things, the important info to remember here is that a time series can be characterised in the following manner with regard to the Hurst exponent (H):

H < 0.5 – The time series is mean reverting H = 0.5 – The time series is a Geometric Brownian Motion H > 0.5 – The time series is trending

I have “borrowed” a code snippet of a Hurst Exponent function found on the www.quantstart.com blog (great site by the way – definitely worth checking out).

The post containing the “borrowed” code can be found here:

and here is the code:

def hurst(ts): """Returns the Hurst Exponent of the time series vector ts""" # Create the range of lag values lags = range(2, 100) # Calculate the array of the variances of the lagged differences tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags] # Use a linear fit to estimate the Hurst Exponent poly = polyfit(log(lags), log(tau), 1) # Return the Hurst exponent from the polyfit output return poly[0]*2.0

Now we run the function on the spread series:

print "Hurst Exponent =",round(hurst(df1.spread),2)

and get:

Hurst Exponent = 0.42

The Hurst Exponent is under the 0.5 value of a random walk and we can therefore conclude that the series is mean reverting, which backs up our conclusion based on the Augmented Dickey Fuller test previously. Good! This means that the spread series looks like a definite candidate for a mean reversion strategy, what with the spread series being mean reverting and all.

However just because a time series displays mean reverting properties, it doesn’t necessarily mean that we can trade it profitably – there’s a difference between a series that deviates and mean reverts every week and one that takes 10 years to mean revert. I’m not sure too many traders would be willing to sit and wait around for 10 years to close out a trade profitably.

To get an idea of how long each mean reversion is going to take, we can look into the “half-life” of the time series. Please click here for more info on half-life.

We can calculate this by running a linear regression between the spread series and a lagged version of itself. The Beta coefficient produced by this regression can then be incorporated into the Ornstein-Uhlenbeck process to calculate the half-life.

Please see here for some more info.

The code to calculate this is as follows:

#Run OLS regression on spread series and lagged version of itself spread_lag = df1.spread.shift(1) spread_lag.ix[0] = spread_lag.ix[1] spread_ret = df1.spread - spread_lag spread_ret.ix[0] = spread_ret.ix[1] spread_lag2 = sm.add_constant(spread_lag) model = sm.OLS(spread_ret,spread_lag2) res = model.fit() halflife = round(-np.log(2) / res.params[1],0) print 'Halflife = ',halflife

Which gets us:

Halflife = 40.0

So according to this result, the halflife of mean reversion is 40 days. That’s not too bad, and not so long as to automatically exclude it from consideration for a mean reversion strategy. Ideally the half-life would be as short as possible so as to provide us with more profitable trading opportunities but there you have it, 40 days is what we have.

OK so I think I’ll cut it here as this is getting a little long. Next post we will work on producing a normalised “Z Score” series that allows us to see the deviation away from the local mean in terms of standard deviations. We will then begin the actual backtest itself using Pandas and see if we can produce something that is any way profitable.

If anyone has any comments, please leave them below – always eager to hear the thoughts of others.

Until next time!

## 8 comments

Hello, you are doing a great job here quoting excellent examples, but here i wanted to understand what has changed as I am unable to find update on difference function :

def hurst(ts):

lags=range(2,100)

tau=[np.sqrt(np.std(difference(ts[lag:],ts[:-lag]))) for lag in lags]

poly=polyfir(log(lags),log(tau),1)

return poly[0]*2

when I am running above code, this is throwing below error:

“NameError: name ‘difference’ is not defined”

could you please check and let me understand the change.

thanks

Use This one –

def hurst(ts):

“””Returns the Hurst Exponent of the time series vector ts”””

# Create the range of lag values

lags = range(2, 100)

# Calculate the array of the variances of the lagged differences

tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]

# Use a linear fit to estimate the Hurst Exponent

poly = polyfit(log(lags), log(tau), 1)

# Return the Hurst exponent from the polyfit output

return poly[0]*2.0

[…] I will also define a function for “Halflife” which just recycles some tof the code from my mean reversion pairs trading blog post from a couple of years ago, which can be found here. […]

It would be a nice touch to show how from the OLS regression of the spread and the lag one could use the OU dynamics to simulate the data

Great blog!! I just try to run the Hurst test, following error message:

Hurst Exponent = nan

main:10: RuntimeWarning: divide by zero encountered in logUsing this code:

Hurst Test

def hurst(ts):

“””Returns the Hurst Exponent of the time series vector ts”””

# Create the range of lag values

lags = range(2, 100)

`# Calculate the array of the variances of the lagged differences`

tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]

`# Use a linear fit to estimate the Hurst Exponent`

poly = polyfit(log(lags), log(tau), 1)

`# Return the Hurst exponent from the polyfit output`

return poly[0]*2.0

print (“Hurst Exponent =”,round(hurst(df1.spread),2))

Please help

It seems like one of the values you are passing to numpy.log is zero. Saying as “lags” is just a range with no zero value- it must be one of your “tau” values I would imagine.

That’s the error message you get when this occurs:

Change this line :

tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]

to :

tau = [sqrt(std(subtract(ts.values[lag:], ts.values[:-lag]))) for lag in lags]

What if both of your prices are negatively correlated, and can both be negative prices at times?

I think you change below:

df1[‘hr’] = est.params[0]

instead of….

df1[‘hr’] = – est.params[0]

Would that be correct? Anything else to change?