# Part 2: Deep Learning and Long-Term Investing, Structuring the Data

## The Setup (Revisited)

In Part 1 of this series we discussed the background and problem setup for how one can apply deep learning to predicting whether a stock will outperform the median performance of all stocks over a one-year period. To make this prediction, we feed the model historical company fundamental and price data. By fundamental data we mean information that can be found in a company’s financial statements. Because we use a recurrent neural network (RNN), on each time step (month) the model can make a prediction using (if needed) all of the historical price and fundamental data up until that time. In Part 1 we used the following diagram to visualize this setup:

In the above, the model uses the data in the columns “Fundamental Data at Time t” (called training inputs) to predict the outcome in the column “Outcomes at Time t+12” (called training targets). Here, “+1” means the company outperformed the median performance for the period t to t+12, and “-1” means it did not. The actual output of the model is a probability that the outcome will be +1.

## The Data Universe

The data universe we consider includes any stock that has been public for at least 36 months and traded on the NYSE, NASDAQ, or AMEX exchanges between January 1970 and December 2015. However, non-US-based companies, companies in the financial sector, and companies with market capitalizations that, when adjusted by the S&P500 Index Price to January 2010, are less than 100 million US dollars are excluded from the dataset.

It is common practice in quantitative studies to exclude financials. Researchers generally give the rationale that balance sheet leverage at a financial company has a very different meaning than at an operating company. Whether or not that is a good reason, we do it here for comparability to existing research.

The dataset spans forty-five years (540 months) from 1970 to 2015. For each month, there are approximately 1,300 to 5,000 companies. Because many companies have come and gone over the last forty-five years, the entire dataset actually represents approximately 10,000 individual companies.

## Structuring the Data for Deep and Recurrent Neural Networks

In the above table, each row in the data represents a step in the historical sequence of a company’s evolution, with both inputs and target outputs for training the model. But what specific inputs should we use and how should they be represented?

In a typical quantitative investment project, this is the stage where factor (feature) engineering begins. We might investigate ratios that are derived from fundamental data, such as price-to-earnings, book-to-market, return-on-equity, return-on-invested-capital, and debt-to-equity. We might explore factors others have shown to work, factors we believe will work (say, for economic reasons), and perhaps what successful investors and analysts think is predictive. We might even employ “feature selection algorithms” to empirically test which factors have the most predictive power.

With this project, using deep learning, we took a different approach. One of the appealing qualities of deep learning models is their ability to discover successful features directly from raw data. That is, instead of training a model on engineered ratios like price-to-book, price-to-earnings, etc., we simply provide the model with earnings, prices, book values, and other fundamental measures, and allow the model to figure out how to combine these measures mathematically in a way that produces the best result. The advantage of this approach is that: (1) we don’t bias the model; and (2) the deep learning model may find features that we would never otherwise consider and discover.

In addition, by using recurrent neural networks, we allow the learning process to discover the time horizon for which various pieces of company fundamentals are most relevant. As an example, consider the concept of return-on-equity. With a more traditional approach, we might choose a factor that is defined by trailing twelve-month operating income (as reported in the income statement) divided by total shareholders’ equity (as reported on the balance sheet). However, what if it turned out that what is really indicative of a company’s inherent value is not just last year’s return-on-equity but the average return-on-equity over several years and/or the consistency of return-on-equity over a different period of years? In the past, we would have had to construct hundreds of different factors, viewed over different time periods, to attempt to answer this type of question.  Now, with deep learning, we can approach these questions more directly.

## The Source Data, Preprocessing, and Normalization

Despite the discussion above, it should be made clear that we do not feed the deep learning models the value for income statement and balance sheet items exactly as they are found in a company’s public filings. We still do a lot of preprocessing and data normalization to make companies’ fundamentals easily digestible by the learning process and to prevent the model from memorizing the profiles of specific companies during training.

The source of our data for this project is Standard & Poor’s Compustat database. From that, we selected the following source fields.

We then pre-processed the source data fields in a series of steps to generate the model input features. There are five categories of model input features: momentum features, valuation features, normalized fundamental features, year-over-year change in fundamental features, and missing value indicator features.

## Momentum Features

Because we are attempting to predict the relative performance of a stock, it seems reasonable to provide the model with the relative past performance of the stock over varying time intervals. To do this, we calculate the percentile ranking, among all companies within the same month (time-step), of the trailing 1-, 3-, 6-, and 9-month stock price change adjusted for splits. This type of feature is typically referred to as “momentum.”

## Valuation Features

Again, because we are attempting to learn to predict the relative performance of a stock, it also seems reasonable to provide the model with the relative valuation of the stock as input. For this, we use two very common valuation metrics—Book-to-Market and Earnings Yield (which is the reciprocal of Price-to-Earnings). The raw form of these features are calculated as follows:

$$\textrm{Book to Market} = \frac{\textrm{Shareholders' Equity}}{\textrm{Market Capitalization}}$$

and

$$\textrm{Earnings Yield} = \frac{\textrm{Operating Income}}{\textrm{Enterprise Value}} .$$

From these two derived values we then compute their respective relative percentile rankings and use these values, along with the raw values, as feature inputs to the model during training.

## Normalized Fundamental Features

As mentioned above, we don’t feed the fundamental fields directly, as reported in financial statements, to the model. Instead we normalize them. There are two reasons for this. The first is that neural network learning, for technical reasons related to how the training algorithms work, simply behaves better when inputs are nicely distributed within a domain. The second reason is related to overfitting and generalization. To illustrate, take the problem of attempting to predict someone’s height from their Social Security number (SSN). Sounds impossible, right? Learning algorithms, however, are very single-minded, in that they simply want to minimize the prediction error during training. Since a learning algorithm would have access to both the SSNs and the associated heights, it could take the strategy of building a table that maps the training data SSNs to heights—effectively memorizing the data—and it would achieve 100% accuracy on the training data. Of course, the model would do terribly on the testing data, and our project (which was doomed from the start) would prove futile.

How does this example relate to the potential perils of using company fundamental data? A given balance sheet or income statement for a company gives a very specific fingerprint (like an SSN) of a company, and many of the items don’t change much from quarter to quarter or even from year to year. The learning algorithm could store these fingerprints and use them to separate good outcomes from bad ones in the training data. Again, though, a model trained on this data would not perform well on new, unseen data.

The goal, then, is to normalize the data so that the specific identities of the companies are removed but the relative proportions of the fundamental items are retained. In that way, the neural network is still able internally compose arbitrary ratios such as debt-to-equity or equity-to-assets as it sees fit. To do this for each fundamental data item, we create a normalized version which is the item divided by the L2 norm of all the fundamental items.

## Year-over-Year Change in Fundamental Features

In addition to all the features above, we provide the models with the year-over-year “log” change in value — i.e., log[v(t)/v(t-1)] — for balance sheet and income statement items that do not take on a negative value. We use logarithms here to ensure that outlier changes (very large changes) don’t have a disproportionate impact on the factor values (similar to the way we normalize above for the purpose of making the learning algorithm behave better).

## Missing Value Indicator Features

Missing values in the data are an important problem that must be addressed when executing a machine learning project. For our purposes, missing values in the data might stem from, among other reasons, an unreported item, a fiscal year change that prevents the creation of a trailing twelve-month sequence, a data collection issue, or a division by zero when a ratio is computed.

We take a similar approach to the one used here, where it proved very successful in an experimental setup where RNNs are used for binary classification (as we are doing here). In particular, for every input feature described in the prior sections, there is a corresponding binary indicator feature that is equal to 1 if the feature’s value is missing for a given company-month; otherwise it is equal to 0. In addition, if the value is missing but it is not missing in the prior time-step (for the same company), then the prior time step value is pulled forward into the current time-step. However, the missing value indicator field value is still set to 1. If no value can be pulled forward, the missing value is set to zero.

## Summary

To summarize the above, for each company-month, we extract approximately 20 source fundamental items from the Compustat database. These source items are used to derive four momentum features and two relative valuation features. Then, for each source fundamental item, two features are created: one being a normalized version of the source item and the other is the year-over-year change it the source item. Finally, for every feature thus created, another binary indicator feature is created that indicates whether the feature’s value is missing for the particular company-month. In total, all of the derivations result in approximately 80 features that will be available to the models for each company-month. This process is depicted in the diagram above.

In the next post (Part 3), we will discuss the specific deep learning model architectures we use in this research project.