Josh Schertz

Database Big League

Having spent the past year using SQLite for the pySecMaster's database engine, I feel that it is time to upgrade to a more powerful database engine. I want to move the pySecMaster onto a database engine that can be highly customized along with having the ability to scale with a massive amount of data.

Current Structure

Currently, I have the pySecMaster building a SQLite database for the price data. This is very convenient because it builds the database as an independent object in a specified location. I can then move these databases around simply by copy and pasting them to different locations (to USB drives, to offsite servers, etc.). This structure is great for small databases, but starts to breakdown when databases get over a reasonable size (about 10GB).

Well, the Quandl WIKI and Quandl GOOG ETF database with daily data is around 11GB. If you want to take advantage of the source cross validation code, that requires downloading data for multiple sources. So for three data sources (Quandl WIKI, Quandl GOOG and Yahoo Finance), the size of the daily prices database gets to around 22GB. This is before the cross validated prices are stored, which would add an additional 10GB to the database. We would now be dealing with a SQLite database around 30GB in size. Now, say you want to also store minute data for main US stocks and ETFs in this database. Within a year you'll be dealing with a SQLite database of around 90 GB in size! And forget it if you want to also store tick, options or fundamental data.

A single object this large becomes unfeasible to work with, especially across multiple machines. It is time to consider a more powerful database engine. For the most part, the data the pySecMaster is dealing with is structured, so I don't think it makes sense to jump to NoSQL or MongoDB just yet. Also, I want to stick with an open source database engine, so that immediately cuts out any Microsoft or Oracle variant. This is setting the stage for PostgreSQL, which is a powerful open source SQL database that can be highly customized to each environment; just what I'm looking for.

Postgres to the rescue

I will be using the psycopg2 Python library to connect the Python code to the PostgreSQL database. Installing this library on my Windows 10 Python 3.5 environment was no easy task, however. It is unfortunate that Continuum's Anaconda doesn't include it by default, and the "conda install psycopg2" command was not able to successfully install it. Furthermore, "pip install psycopg2" was also unable to install it. Fortunately, Christoph Gohlke at UC Irvine provides it in a wheel package for easy installation in a Windows Python 3.5 environment.

Converting the existing SQL execution strings from SQLite to PostgreSQL compliant was relatively painless, albeit there were a few instances where the SQLite compatible command was too loose for PostgreSQL. The largest offender involved foreign keys between tables. To be SQL compliant, all foreign keys must reference unique columns (by indicating that a column is a primary key or has a unique constraint). Amazingly, the current pySecMaster code was able to create many foreign keys to non-unique columns. SqliteBrowser indicates that the foreign keys were created, but I don't know if they were ever used. I think it is concerning that SQLite has this lack of transparency. When I tried to create these same foreign keys with PostgreSQL, the command was rejected.

The foreign key issue basically revolves around the source and source_id columns of the symbology table. The tricky part was that with the current database design, it is impossible for the source_id column to be unique. But, it is possible to make the combination of the source and source_id columns unique. I therefore created a unique index of the two columns, which was able to handle the foreign key unique constraint. Using two columns to indicate uniqueness required additional changes to how every other foreign key constraint operated, as each one had to link two respective columns to these referenced columns. Since the SQL structure does not allow for static values to be used in a foreign key constraint, I had to create a new source column for each table to had this symbology foreign key. For most of the tables, I will put a static 'tsid' value in this source column (as most tables utilize a tsid to indicate uniqueness). A good benefit of using the source/source_id structure is that it will allow for the baskets table to include symbology values from multiple sources, not just tsids. This would allow for a user to provide a list of Yahoo Finance tickers and have the system recognize the source and be able to convert them to tsid values.

At this point in time, I don't know how the performance of PostgreSQL will perform against SQLite, but I'm expecting that the write concurrency will be perfect for a majority of the pySecMaster's commands. I also want to mention that I am excited about the remote host capabilities of PostgreSQL databases, along with the backup/replication abilities. I know that these will make working with the database much easier than passing a single, massive database object around. I'll make sure to write a new post when I have more experience with these changes.