In a previous post we went through the rigorous modeling of foreign currency exchange returns to forecast the potential movement of the US Dollar. However, a key step we skipped in that discussion is how to actually start building the database that would be fit for purpose based on a specific objective. The previous example focused on FX; however, this time around, let’s shift to a more profitable exchange—the US stock market.  The ability to predict stock prices is a valuable skill that relies heavily on the quality and structure of the underlying data. This crash course aims to provide a roadmap for setting up a robust database to support predictive modeling and may be generalized to a certain degree.

Before getting into the actual process, it would be apt to familiarize ourselves first on the different types of data modeling one would need to undertake to align each stakeholder around the why, how, and what of your data model. Generally, data models are divided into various categories but for brevity we will only focus on the following at this time: conceptual, logical, and physical data models. These three are usually sufficient in establishing the utility that can be derived from a database built for a specific purpose, in this case a selection of stocks.

Conceptual Data Models

A conceptual data model is put together in order to visualize the concepts behind the process that the data will be used for, as stated previously our example would be for predictive modeling of stock prices. This would start out by mapping the types of information that we’ll need to collect and how each of the data and its various sources may be merged and/or transformed for predictive modeling. 

This type of data model can help initially design the structure of the database in terms of aligning timestamps and stock price information (as stock prices will be built as a time series initially before evolving to a panel data as the database grows). This involves creating multiple tables, such as placing stock price data identified by ticker symbols into a ‘Returns Table’ and additional company details into a ‘Stocks Table’ (e.g., industry type, key management, etc.).

Logical Data Models

Once the general requirements are visualized in a conceptual data model, the next step is to define the elements and relationships in each data table identified in the conceptual model. This type of data model will help formalize the names and data attributes of specific entries in the database. Data architects use this model to plan the implementation of the data management system, encompassing storing, defining, retrieving, and managing the database. 

Physical Data Models

The physical data model is the final type we’ll need to focus on as this data model emphasizes the technical requirements of the database and would be a good starting point in creating a data dictionary. A key element in database management is knowing the type of data that you have as that will greatly impact the limitations of the analysis that can be gleaned from your compiled database. 

This model delves into the specifics of whether a data point is a stock or flow variable, an integer or float. It details storage requirements and addresses data redundancy.

Keep in mind that the outlining of these data models is just to paint an understanding of the key steps required to build a database fit for purpose.

Another bonus of going through the data model types is it takes care of the initial key steps in database design: understanding data requirements, database design principles, and setting up the database. 

The building of each of the above would provide you the information necessary to gauge the initial requirements of your database and could guide you on choosing the appropriate database management system for your desired outcomes (DBMS). For at-home projects that would just typically comprise of Google Office or Microsoft Office, but for projects that would be handling massive amounts of data (like predictive modeling for high frequency trading), it would include tools such as SQL and Python.

Once the technical requirements of the database are established, the next step for stock price data is to address time series considerations. Given the temporal nature of this data, handling time series data effectively is paramount. 

Raw data (even from the sources) is mostly unforgiving when mapping specific data points to a timestamp. For those using various sources, a key pain point would be the treatment of some missing data points in your time series. Depending on the type of data, the data architect can opt to fill in the missing data using extrapolation methods, the mean or median, or just leaving it blank. The data treatment will of course vary depending on what type of mathematical models your data will be put into use when making stock price predictions.

Once the time series considerations are taken into account in building the database, it would be ready for use and testing (types of testing were discussed in this previous post and applies to stock returns as well). Afterwards, you may opt to explore integration with analytical tools such as R, Python, or other specialized finance modeling platforms. Data models serve as practical blueprints, preventing costly mistakes in data infrastructure by ensuring upfront consideration of data needs, relationships, architecture, and project viability. Beyond providing technical clarity, they are essential for data governance and legal compliance, ensuring data integrity. Establishing standards from the beginning prevents conflicts in datasets and the need for extensive cleanup efforts. In essence, data models ensure a straightforward transition from ideas to usable insights.