Home Data Analysis Build a Financial Data Database with Python

Build a Financial Data Database with Python

by s666

Hi all, and welcome back to the site – I appreciate it has been an unexpectedly long time since I last posted…in fact my last post was around this time last year. Hopefully I can get back on the “treadmill” and churn out some articles at a somewhat faster rate than 1 a year over the next couple of months! Well that’s my aim anyway.

Ok so this post will be based on how to build and structure what is often referred to as a “data pipeline”; essentially it is the part of the overall project workflow concerned with the gathering and storing of data, along with any wrangling/munging/preprocessing/transforming of that data for later use.

Ideally this endevour should aim to produce a code module which is robust, efficient, reusable, scalable and maintainable. It should also aim to produce a well structured, easily accessible store of “high-quality”, ready to use financial data sets and series.

I thought I’d start by sketching out a rough mind-map style representation of the concepts and ideas I want to cover and the basic points I want to make; this is shown below.

I’m hoping this post (call it Part 1) will allow me to cover the first 5 subsections:

1) Categories of Data
2) Sources of Data
3) Downloading/Collecting Data
4) Data Cleaning/Wrangling
5) Data Storage System

The remaining subsections will be covered in a subsequent post, or potentially “posts” depending on how detailed and lengthy it all ends up being. On to subsection 1:

Categories of Data:

Of course data can come in all shapes, sizes and formats however when dealing with financial topics and the finance domain in general, we tend to concentrate heavily on quantitative data and certain forms of qualitative data depending on the issue at hand.

Quantitative data basically includes anything that can be measured and represented in numerical format, i.e. any quantifiable information that can be used for mathematical calculations and statistical analysis.

Qualitative data (also referred to as categorical data) on the other hand is information that is non-numerical in nature (with a slight caveat which I will come to in a second), usually text based or word related. For example, if a company has a market capitalisation of $100m that is clearly considered quantitative data, whereas the location of its headquarters, New York for example, would be considered qualitative data. This may seem a bit basic and obvious to many, but I mention it in order to clarify the point that I shall be concentrating heavily on gathering and storing quantitative data sets, with the exception of a small amount of “meta” style data and some company level industry sector categorisations and such.

The caveat mentioned above regarding what is considered quantitative data was alluding to the fact that it is not simply whether a value is numeric or not which determines whether it is to be classed as “quantitative”, but rather whether it can be used for mathematical calculations and analysis etc. For example a phone number is represented numericaly, but is considered to be qualitative data as there is no logical way to perform any calculations or carry out any statistical analysis on a list of phone numbers.

Within quantitative data we have three major categories, namely “time series data”, “cross-sectional data” and “panel data”.

1) Time series refers to observations on a single entity over a number of time periods, the classic example being a financial asset’s price history.

2) Cross-sectional data refers to data that is collected from multiple entities at a single point in time, for example the P/E ratios of the Dow 30 constituents on a certain date.

3) Panel data is somewhere in between and refers to data that is collected from multiple entities over a number of time periods. An example may be the annual GDPs of the UK, US, Russia and China from 2010 to 2020.

Sources of Data:

This section could fill a whole book to be honest, there are so many factors to consider when it come to the question of where to source your data from; whether to use freely available or paid for data providers? Whether you need intra-day granularity or are you satisfied with daily granualarity and above? Do you need a data set which accounts for suvivorship-bias? Do you intend to use the data to develop a trading algo/strategy that you plan to deploy “live” with real capital at stake? etc etc

I don’t want to get too bogged down with those sorts of questions in this article, rather I want to concentrate on the actual process of building the tools necessary to gather and store data from whichever source one ends up deciding to use. For illustration purposes I shall focus exclusively on freely available data sources to allow everyone to follow along and build their pipeline for free.

Below are links to two curated repos of financial data sources – each one has about 25-30 sources listed (although there is most likely a decent bit of overlap between the repo lists)

https://github.com/wilsonfreitas/awesome-quant#data-sources

https://awesomeopensource.com/projects/financial-data

I will choose pandas-datareader and investpy to build my pipeline – these are just used as examples of course and if one would prefer to use different packages and sources that is absolutely acceptable, however it may limit the ability to follow along with the rest of my code as portions will follow package specific syntax.

Information on each of the 2 packages can be found at the following links:

https://github.com/pydata/pandas-datareader

https://github.com/alvarobartt/investpy

Downloading/Collecting Data

To outline which specific data sets I plan to capture, I will begin with the stocks in the S&P500 index looking to gather the daily price history for each stock, along with a collection of “meta-data” for each company and a comprehensive collection of company fundemental data, in a cross sectional format (i.e. I will look to scrape the latest fundemental data values rather than any historic values)

I shall then gather historic price series for a range of spot currency pairs followed by some historic fund prices and related fund “meta-data” style information.

Before we begin writing the code necessary to carry out the actual retrieval and downloading of the planned data, we must first design our database structure; the final structure needs to allow quick and efficient retrieval of data in the future, and also allow future data set updates to be automated – once your data store gets large and complex enough, it would be extremely time consuming to have to update everything manually.

There are a number of options when it comes to choosing a data storage format

1) Flat-File Storage – this is simply a collection of individual files, usually csv files or similar, with each file containing data in a two-dimensional matrix or “table” structure with rows and columns.

2) NoSQL Database – This is mainly used when storing less “structured” data that doesnt fit nicely into the table paradigm and may not follow the “relational” logic of more common SQL databases.

3) Relational Database – This is a database that adheres to the “relational model” and most commonly uses the SQL data definition and query language. This is by far the most common choice for storing financial data as we can usually structure our data sets as various individual items which we separate into tables and define certain relationships between those tables.

It makes sense at this point to identify what we would like to define as our “individual items” as it were, i.e. the data sets that we will map across into the various tables in our database.

Let’s start simple and only consider the structure for the equity data that we are gathering for now – in our case the constituents of the S&P500 index. We can tag on the various tables and possibly whole databases to the mix as and when we need to, i.e. when we start to deal with the spot currency data and fund data that we plan to include.

I plan to segregate my equity data into 4 seperate tables, using unique ids and foreign keys to create the relevant relationships and links between them.

The tables will be as follows:

1) Exchange – the stock exchange on which the stocks trade

2) Company – the company represented by each symbol/ticker

3) Security – the actual share/asset belonging to each company

4) Security Price – the security’s historic price and volume data

And the keys will be set up as follows:

The “company” table will include a foreign key “security_id

The “security” table will include two foreign keys, “company_id” and “exchange_id”

The “security_price” table will include a foreign key “security_id

Ok, let us begin coding! 😉

We start by creating our tables with the following code:

import os
import sqlite3
from sqlite3 import Error


def create_connection(db_file):
    """ create a database connection to the SQLite database
        specified by db_file
    :param db_file: database file
    :return: Connection object or None
    """
    conn = None
    try:
        conn = sqlite3.connect(db_file)
        conn.execute("PRAGMA foreign_keys = 1")
        return conn
    except Error as e:
        print(e)

    return conn


def create_table(conn, create_table_sql):
    """ create a table from the create_table_sql statement
    :param conn: Connection object
    :param create_table_sql: a CREATE TABLE statement
    :return:
    """
    try:
        c = conn.cursor()
        c.execute(create_table_sql)
    except Error as e:
        print(e)


db_name = 'db_pff7.db'
        
def main():
    database = os.path.join('..','data',db_name)
    
    sql_create_exchange_table = """ CREATE TABLE IF NOT EXISTS exchange (
                                        id integer ,
                                        name text NOT NULL,
                                        currency,
                                        code text NOT NULL UNIQUE,
                                        PRIMARY KEY(id)
                                    ); """

    sql_create_company_table = """CREATE TABLE IF NOT EXISTS company (
                                    id integer ,
                                    name text NOT NULL,
                                    industry text,
                                    sector text,
                                    hq_location text,
                                    security_id integer,
                                    PRIMARY KEY(id),
                                    FOREIGN KEY (security_id) REFERENCES security (id)
                                );"""
    
    sql_create_security_table = """CREATE TABLE IF NOT EXISTS security (
                                id integer,
                                ticker text NOT NULL UNIQUE,
                                name text NOT NULL,
                                company_id integer,
                                exchange_id integer,
                                PRIMARY KEY(id)
                                FOREIGN KEY (company_id) REFERENCES company (id),
                                FOREIGN KEY (exchange_id) REFERENCES exchange (id)
                            );"""

    sql_create_security_price_table = """CREATE TABLE IF NOT EXISTS security_price (
                        id integer,
                        date text NOT NULL,
                        open decimal NOT NULL,
                        high decimal NOT NULL,
                        low decimal NOT NULL,
                        close decimal NOT NULL,
                        volume integer,
                        adj_close decimal NOT NULL,
                        security_id integer,
                        PRIMARY KEY(id),
                        FOREIGN KEY (security_id) REFERENCES security (id)
                    );"""
    
    # create a database connection
    conn = create_connection(database)

    # create tables
    if conn is not None:
        # create exchange table
        create_table(conn, sql_create_exchange_table)

        # create company table
        create_table(conn, sql_create_company_table)
        
        # create security table
        create_table(conn, sql_create_security_table)
        
        # create security_price table
        create_table(conn, sql_create_security_price_table)
        
    else:
        print("Error! cannot create the database connection.")



main()

Now we move on to creating our table containing data regarding the various stock exchanges that we may find of interest or may want to include in our universe at some point. Luckily there is a readily available source file containing all the data we need and more; various download formats can be found at https://www.iso20022.org/market-identifier-codes. I have used the download in csv format as I find that to be the best format for our needs in this case. We are able to pass the relevant url directly into the pandas read_csv function as shown in the code below.

First we import the necessary modules, then read in the data from the relevant csv file url. Next we extract the columns of interest and subsequently rename them so they are more suitable to be used as column names in an SQL database (usually the format used is all lower case and any spaces are replaced with underscores).

Lastly we create a new column called ‘id’ – this will be the unique exchange id that will remain static and allow us to reference the exchanges in the future when running queries and exractions from the database. I have simply used the values in the index as they are unique and work nicely as our exchange ids at this point.

import pandas as pd
import requests
import io
import numpy as np

exchange_data = pd.read_csv("https://www.iso20022.org/sites/default/files/ISO10383_MIC/ISO10383_MIC.csv", encoding='iso-8859-1' )

exchange_data = exchange_data[['ISO COUNTRY CODE (ISO 3166)', 'MIC', 'NAME-INSTITUTION DESCRIPTION', 'ACRONYM']]
exchange_data.rename(columns={'ISO COUNTRY CODE (ISO 3166)' : 'country_code', 
                              'MIC':'code', 
                              'NAME-INSTITUTION DESCRIPTION': 'name',
                              'ACRONYM':'acronym'}, inplace=True)

exchange_data['id'] = exchange_data.index

Now we quickly create a partial mapper to use to create our currency data – I am using the country_code and have just filled out the mapping details for USD, GBP and Germany (EUR) – if so inclined, feel free to fill out the rest of the data and complete the mapping.

mapper = {'US': 'USD', 'GB': 'GBP', 'DE': 'EUR'}

exchange_data['currency'] = exchange_data['country_code'].map(mapper)

exchange_data[['id', 'name', 'currency', 'code']]

which gives us a DataFrame that looks as follows:

Ok lets now insert this data into our database table- this will look as in the image below (if everything has worked!)

conn = sqlite3.connect(os.path.join('..','data',db_name))

exchange_data[['id', 'name', 'currency', 'code']].to_sql("exchange", conn, if_exists="append", index=False)

Next we move on to scraping the tickers for the constiuents of the S&P500 and the Dow Jones Industrial Average. We will use these to first download some company data for each stock such as the name of the company, the industry and sector within which it operates and lastly the location of the company HQ.

We can retrieve the necessary information from Wikipedia using Pandas and the read_html method – this will convert the relevant HTML table on the Wikipedia web page into a DataFrame which we can then use as we wish. I have renamed some of the columns in the code below so as to fit the naming convention of our pre-made database tables.

# scrape wiki table with symbols and details of Dow Jones constituents
dj_constituents = pd.read_html('https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average#Components')[1]
dj_constituents.rename(columns={'Company':'name', 'Industry': 'industry', 'Symbol': 'ticker'}, inplace=True)

this gets us the following:

We have to clean up some of the tickers by splitting and removing the “NYSE” part and the colon. The line of code below does that; its a little bit of a hack but it works fine. The “.apply” uses a lambda function to first reverse the ticker string, then apply the “.partition” method which allows us to deal with the fact that some of the tickers don’t have the ‘:’ separator in them. This returns a tuple of 3 values – of which we select the first value, and finally we re-reverse the string back around the right way. The second line of code just strips any whitespace from around the ticker.

dj_constituents['ticker'] = dj_constituents['ticker'].apply(lambda x: x[::-1].partition(':')[0][::-1])

dj_constituents['ticker'] = dj_constituents['ticker'].str.strip()
# scrape wiki table with symbols and details of s&P500 constituents
sp_constituents = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]#.to_csv('constintuents_current.csv', index=False)
sp_constituents.rename(columns={'Symbol': 'ticker', 'Security':'name', 'Headquarters Location': 'hq_location',
                                'GICS Sector': 'sector', 'GICS Sub-Industry':'industry'},inplace=True)

and this gets us the table below:

Next I concatenate the two tables together into one , selecting only the relevant columns from each. Make sure to drop any duplicates in the table as some stocks appear in both of our chosen indices!

company_table = pd.concat([sp_constituents[['name', 'industry', 'sector', 'hq_location']],dj_constituents[['name', 'industry']]]).reset_index(drop=True)
company_table.drop_duplicates(inplace=True)
company_table.sort_values('name', inplace=True)
company_table.reset_index(inplace=True, drop=True)
company_table['id'] = company_table.index

Below is the code to create the security table – similar in logic to the company table creation with the two constituent tables being concatenated together to form the final table.

# security_table
sp_security_table = sp_constituents[['ticker', 'name']].copy()
# sp_security_table['currency'] = 'USD'

dj_security_table = dj_constituents[['ticker', 'name']].copy()
# dj_security_table['currency'] = 'USD'

security_table = pd.concat([sp_security_table,dj_security_table]).reset_index(drop=True)
security_table.drop_duplicates(subset='ticker',inplace=True)
security_table.sort_values('ticker', inplace=True)
security_table.reset_index(inplace=True, drop=True)

security_table['id'] = security_table.index

Now we need to create the columns holding the correct company_id in the security table, and vice versa the correct security_id in the company table. Now in this instance, due to the fact the two tables are populated with data for exactly the same companies, in the same order – this step may seem a little redundant as the two id columns will be exactly the same. It is a vital step however as we need to link the two tables together with their foreign keys so we may extract the correct data corresponding to the correct company whenever needed. You may wonder why we don’t just reference them using their ticker symbols; firstly ticker symbols are not unique across exhanges – different exchanges assign the same letters to different companies – also it is not that uncommon to see a company change its ticker symbol, whether due to a merger or takover etc. SO we need our own imutable ids which we know will remain constant no matter what.

The way I have done it is just to create a dictionary with key, value pairs being the name of the company and the corresponding id within its own table, which will be our foreign key in the second table.

Just to illustrate how it would work in a real world scenario, when everything isnt all perfectly lined up, I have sorted the tables in a different order.

I start by populating the security table with the company_id:

company_id_mapper = pd.Series(company_table.id.values, index=company_table.name).to_dict()

security_table['company_id'] = security_table['name'].map(company_id_mapper)

and then I populate the company table with the security_id:

security_id_mapper = pd.Series(security_table.id.values, index=security_table.name).to_dict()

company_table['security_id'] = company_table['name'].map(security_id_mapper)

The next step is to populate the exchange_id column. For this I shall use information that can be found here https://old.nasdaq.com/screening/company-list.aspx. I downloaded 2 csv files covering all the NYSE and NASDAQ traded stocks – I shall use this info to assign the correct exchange_id for each stock in my security table.

The csv can be downloaded using these links incase the url isn’t stable:

nyse = pd.read_csv('nyse.csv')
nasdaq = pd.read_csv('nasdaq.csv')

security_table['exchange_id'] = None
security_table['exchange_id'] = np.where(security_table['ticker'].isin(nyse['Symbol'].values), 1300, '')

security_table['exchange_id'] = np.where(security_table['ticker'].isin(nasdaq['Symbol'].values), 367, security_table['exchange_id'])

And upload the data into the database tables which will look as shown in the two screenshots below:

security_table.to_sql("security", conn, if_exists="append", index=False)
company_table.to_sql("company", conn, if_exists="append", index=False)

Now it’s time to download and store the price histories for each of our companies and tickers in the tables we have just created. The code below iterates through each of the tickers in the security table and downloads daily data starting on 01/01/2000, or as far back as the price data goes for that stock if it started trading after that date. For this we will use Pandas-Datareader:

from tqdm import tqdm_notebook as tqdm
import pandas_datareader.data as web

stock_pricing_dfs = []
for stock_id in tqdm(security_table['id']):
    try:
        stock_pricing_df = web.DataReader(security_table.iloc[stock_id]['ticker'], 
                           start='2010-1-1', 
                           end='2020-08-31', 
                           data_source='yahoo')
        stock_pricing_df['security_id'] = stock_id
        stock_pricing_dfs.append(stock_pricing_df)
    except:
        pass

security_price_table = pd.concat(stock_pricing_dfs)

# eod_table.to_sql('eod_table', conn, if_exists="replace")

Now we rename the columns to adhere to the database column names and create an ‘id’ columns which at this stage can just be a direct copy of the index values. Once we get to the stage of updating the data and adding more history and more stocks possibly, we will need to be more craeful and make sure that the ‘id’ values are all unique and correctly ordered etc.

Then we just insert the data into our database table and there you go! We have finished the process of adding our various stock data (for now); the final table should look similar to that shown in the image below:

security_price_table.columns = ['high', 'low', 'open', 'close', 'volume', 'adj_close', 'security_id']

security_price_table.reset_index(inplace=True)

security_price_table['id'] = eod_table.index

security_price_table.to_sql("security_price", conn, if_exists="append", index=False)

I’ll leave it here for now as I think this is a good place to take a natural break in the post; when we return we will be looking to augment the data with some spot FX currency data, along with perhaps some economic data series and we wil add to the equity based data by scraping and storing some fundemental company data; balance sheets, cash flows and the like.

I might also at some point go about creating a “suvivorship bias free” (mini) data set using the S&P constituents – i.e. I shall work back historically and store an S&P500 constituent list for each time period, adapted and correctly accounting for all the additions and removals from the list of stocks making up the 500 at any point in time. This allows one to test strategies over data that doesnt suffer from suvivorship bias by just using the current day’s constituent list across the entire historic period.

Ok thanks all for visiting, I shall follow up soon with part 2!

You may also like

4 comments

Nobody Cares 24 October 2020 - 22:13

Well, I lost interest when you stated that company hq location was categorical data not numerical data. Clearly location can be coded as numerical data (latitude, longitude) and used numerically to calculate many items of interest.

Reply
s666 25 October 2020 - 17:40

I take your point that cities exist in some location, and that a location can indeed be represented in a format which allows calculations etc, and therefore that particular representation of location would indeed be classed as numerical data, yes. But the fact remains that a city name is a label and labels fall firmly in the categorical data category – so you’re not quite correct on this one I’m afraid 😉

Reply
Moe 31 October 2020 - 18:38

Thank you so much! As a fellow CFA looking to develop my python/programming skills for my career, this is all amazing. Looking forward to part 2.

Reply
Mark Atlantic 19 November 2020 - 13:28

Wow, your educational background is impressive! I mean really impressive. The article was amazing. Is there a particular book on the subject you can recommend? Is there a particular database you favor for financial data? I am just starting out but do not want to waste my time with defunct databases.

(While I doubt anyone else is confused by “NOBODY CARES” comments, let me say the author is 100% accurate. “NOBODY CARES” is referring to GPS coordinates, NOT city names and has confused the two. The author clearly knows this but wanted to keep the focus on the article and not some other topic).

Reply

Leave a Reply

%d bloggers like this: