Data Storage Formatting

I have been battling an internal war. The question: In which format should data be stored in? Twenty years ago, the question was fairly simple. Much was moving from proprietary storage systems to relational database management systems. That was a great move, or so I thought at the time. Everything was sorted. Simple queries were simple. Hard queries were hard. Queries that needed to be performant could be tuned.

But data came in different formats, and had to be parsed. What could easily be parsed was placed in columns. The big stuff was placed in blobs. Specialized parsers could then be built to look at the blobs if they wanted to.

There were plenty of issues to resolve. How do you keep multiple databases in sync? How do you handle changes in the incoming metadata? What do you do with old, rarely used, but perhaps important data? The questions kept coming.

Those questions and many more have answers that are not easy. And like most things in life, the answer is often “it depends.”

The current situation is the need to pour over mountains of different data, looking for specific things that happened over a specific time.

Time-Series Market Data

Sample frequency is an issue here. The typical OHLC data that comes from the financial markets is in a recognizable format, easily parsed and placed in rows and columns, indexed, and with management systems like kdb+ you can capture data at a fast sample rate, or even down to the tick, but summaries are quickly calculated.

Report Filings

Is a typical RDBMS a good fit for financial reports, such as what comes from EDGAR? Not so much. A “big data” system may be a better fit, or perhaps an API that knows how to parse directly from EDGAR itself.

Market News

Sentiment is very difficult to quantify. The “mood” of the market may be extrapolated after the fact by certain indicators, but how is the oil sector affected when good Tesla numbers come out the same day as a bad US Manufacturing report? How do you build that query?

Summary

There are many more categories and sub-categories. But the basic result is: there are many factors to consider when looking at data. Attempt to pick the correct way, and be prepared to change it if it doesn’t work. Don’t be afraid to use more than one solution at the same time. Avoid complexity, but remember that not all systems are small. Consider micro services. Keep an open mind, but avoid “analysis paralysis”.

Once again, “it depends.”