While the explosion of noSQL database offerings of late can be daunting, each of them is typically suited for a particular purpose. Most CRUD web-applications can be comfortably done with either noSQL or a RDBMS, however for true high performance applications, the choice of database is of make-or-break importance.
In this post I'll explain what we typically want from a database to store historical financial data, from both a technical and operational viewpoint.
Query Flexibility is largely unimportant. We will always be querying for a range of ticks over some time window, and we are primarily interested in two types of query. When running a backtest, we want to know all events that happen in an exchange over a timeframe, and when live-trading we want to be able to pull up a range of events for a particular symbol that we're interested in, for example to quickly calculate technical indicators.
Query Speed: Exceptionally important. The most common use for our database will be pulling up a range of data. We know the queries we want to run in advance, so we can optimise for those cases. This is a classic noSQL tradeoff, where you often need to know the format of your queries before you design your tables.
Distributed by Design: This is probably one of the most important points. The server for live trading will not be hosted locally -- it's much better to be sitting on a VPS closer to the data sources, for both network latency and uptime reliability. This presents a problem because we need the data available locally for backtesting. Our chosen database must be synchronised or replicated automatically.
Timeseries Resolution: Obviously dealing with tick data, we want high resolution -- at least milliseconds. Microseconds and lower would be important at a "real" HFT shop, but I'm only interested in minutes-to-hours timeframes.
Summary: We're looking for blazing fast reads, reasonably quick writes, and we're not too fussed about losing query flexibility to pick up speed. We definitely need a database that can be replicated or synchronised between a local and remote database easily.
Where does that leave us?
There are only a handful of contenders in this space. Relational databases aren't well suited as they were built for completely different tasks. We will almost never need the capability for "joins", and hardly relational lookups. We know our lookups, and our data can be stored contiguously on disk, so we should make the most of that benefit by using a database designed for that kind of usage.
kdb+ is the kingpin of this space, because of speed, speed and speed. There are two primary features that allow this to happen. It is a column store database and makes large use of in-memory caching. Financial timeseries data is stored very neatly on disk, which allows the user to make full use of column storage. It has effectively been designed ground up to work beautifully with the exact data we have.
Sadly, kdb+ is out of the question for, well, basically everyone non-institutional. If you need to ask how much it costs, you can't afford it. They do have a non-commercial license available for their 32bit system, but there's no point in limiting ourselves to 32 bits of addressable memory.
Architecturally, Cassandra shares many similarities that make it quite an attractive option. Working out how to structure our data and queries to leverage what Cassandra can provide does require some lower level understanding of how it is built.
In short, there are a few basics to understand in order to get the most value from Cassandra:
- Cassandra is a highly distributed mix between a key-value store and a column database.
- Rows in Cassandra are indexed by a primary key, which can be a composite key made up of values for an entry.
- Each row can (sort of*) contain two billion columns.
- The primary key will determine where on the disk the data is written.
- Cassandra Query Language (CQL) gives an impression of similarity with SQL, but don't be tricked into thinking that.
- You are strongly encouraged not to perform inefficient queries. Cassandra will literally fail and tell you to design your data model better if you try to do something bad. You can still force it, but it's best to sit back and think for a further 10 minutes.
- Tables are made to only serve a single query. To query on different criteria or using different ordering fields, extra Tables or Materialized Views must be created for those queries.
I'll walk through the basic queries that I wanted my system to support and show what a basic starting point for storing either end-of-day bar data, or intraday bars, can look like.
A particularly neat way of sharing your data between a local and remote server (for backtest & live trading), is to make use of a NetworkTopologyStrategy in your Keyspace.
Locally, I have several dedicated SSD drives set up as a virtual cluster. This is done by launching one Cassandra Docker image per drive, with that drive mounted as the data volume into the container.
This virtual cluster is treated as a Data Center, and allows for all reads to be completed locally. Data Center Replication can then be set up so the live cluster will mirror the local data. Live market events are fed into the live cluster, which are propagated through to the local virtual cluster through replication.
This example schema shows a good starting point for storing bar data. It can be easily modified to suit tick data, but would need more work to store full order-book information.
CREATE TABLE finance.bars ( exchange text, year text, bar_length text, end_timestamp timestamp, ticker text, adj_close bigint, close bigint, high bigint, low bigint, open bigint, volume int, PRIMARY KEY ((exchange, year, bar_length), end_timestamp, ticker) ) WITH CLUSTERING ORDER BY (end_timestamp ASC, ticker ASC)
To select all daily bars on the ASX for 2015-08-24, we'd do:
SELECT * FROM bars WHERE exchange='ASX' AND year='2015' AND bar_length='24h' AND end_timestamp='2015-08-24 16:00:00';
To get all daily bars on the ASX from 2010 until now, we'd do:
SELECT * FROM bars WHERE exchange='ASX' AND year IN('2010', '2011', '2012', '2013', '2014', '2015', '2016') AND bar_length='24h';
Now of course, you probably want to query this table for a particular stock, say,
CBA. Not so fast! Remember how data is stored contiguously on disk in Cassandra? Doing this would require our read query to 'skip over' every bar that is not CBA. Cassandra will fail with an error, telling us not to do inefficient queries.
So, we create a materialized view, feature released in Cassandra 3. This view will be a copy of the previous table, but with data stored on disk in a way that allows us to perform very fast reads.
CREATE MATERIALIZED VIEW finance.bars_by_ticker AS SELECT * FROM finance.bars WHERE exchange IS NOT NULL AND year IS NOT NULL AND bar_length IS NOT NULL AND ticker IS NOT NULL AND end_timestamp IS NOT NULL PRIMARY KEY ((exchange, year, bar_length), ticker, end_timestamp) WITH CLUSTERING ORDER BY (ticker ASC, end_timestamp ASC)
This materialized view allows us to query a particular ticker. You can't perform ticker queries on the original schema because the composite primary key requires all preceding values to be filled, and
ticker is the last value in the key. Now, we can do:
SELECT * FROM bars_by_ticker WHERE exchange='ASX' AND year='2011' AND bar_length='24h' AND ticker='CBA);
Should we need to pull a
range of data to calculate some kind of technical indicator, or to do a backtest on a particular month, we may do:
SELECT * FROM bars_by_ticker WHERE exchange='ASX' AND year='2011' AND bar_length='24h' AND end_timestamp>'2011-01-01' AND end_timestamp<'2011-06-01');
By now, we have a pretty clear understanding of how, and why, a column-oriented database is able to perform well in this particular application. While initially very confusing, Cassandra has quite an elegant design, with rewarding features and "a-ha!" moments as you begin using it more and more.
*: Why do we have
year as part of the key?
Remember the point about a Cassandra partition key being able to have 2 billion cells? Before designing your tables, do a quick back-of-the-envelope calculation to understand how many items you will be storing in a partition key.
PRIMARY KEY((one, two), three, four), the partition key will be
(one, two). Note our
bar examples uses
PRIMARY KEY ((exchange, year, bar_length), end_timestamp, ticker), hence a partition key of
(exchange, year, bar_length). We are allowed a maximum of two billion cells to be stored under this partition key.
If we didn't store
year, our partition key would be only
(exchange, bar_length). If we stored minute bars for 60 minutes in 6 hours in 250 trading days across 3000 instruments over 8 years, we'd hit our 2 billion cell limit. This limit would be hit relatively quickly if storing multiple ticks per second, so it's useful to partition the table by either year, (or year-month for high frequency data)