Part 3: Deep Learning and Long-Term Investing, a Comparison to Factor Models

By: John Alberg and Michael Seckler

Recently we have explored the use of deep learning for systematic long-term investing. We first introduced this idea in our Q1 investor letter. This post is part 3 of a series that dives into the details of our approach. In Part 1 and Part 2 of this series, we discussed the problem setup, data sources, and data pre-processing steps we use to apply deep learning to predicting whether a stock will outperform the median performance of all stocks over a one-year period. In this post, we explain in more detail the relationship and differences between deep learning models and the more traditional models used by quantitative investors.

Factor Models

To understand the advantages of using deep learning for systematic long-term investing, it is worthwhile to understand how it compares to more traditional approaches to systematic or quantitative investing. There are a great variety of quantitative investing and trading methods. But, as we explained in Part 1, we are specifically interested in their application to long-term investing using publicly available company fundamental information. For this, most quantitative investors employ a “factor model.”

Let us use the well-studied “value factor” model as an example. By BE/ME, we denote a company’s book equity (as found on a balance sheet) divided by its market equity (stock price times number of common shares outstanding). Imagine now that you sort all public companies by BE/ME and divide them into three portfolios of stocks: (1) the value portfolio being the top 30% of companies when sorted by BE/ME; (2) the neutral portfolio being the middle 40%; and (3) the growth portfolio being the bottom 30%. The three portfolios are typically referred to as the H, N, and L portfolios for High, Neutral, and Low book-to-market equity, respectively. The H portfolio is considered the value portfolio because high book value relative to a market value implies the stock is inexpensive, and the L portfolio is called the growth portfolio because, presumably, investors must be paying for future growth if they are paying a relatively high premium to book value. 

By taking the returns of the value portfolio minus the growth portfolio, you obtain a return sequence of a hypothetical portfolio (called HML for high minus low) that buys value stocks and shorts growth stocks. HML is a portfolio that goes up when value stocks outperform growth stocks and down when growth stocks outperform value stocks. The expected value of the return sequence of HML is called the value premium, and it is a proxy for the type of returns one would have obtained had they held a portfolio of stocks similar to HML. The value premium has been tested over many time periods, on many different stock universes, and in many different markets. It is generally a positive number with statistical significance (albeit its efficacy has waned in recent years, and we will discuss this observation in the next blog post).

HML is, of course, just one factor. In one of the seminal papers on factor models entitled “The Cross Section of Expected Stock Returns”, Kenneth French and Nobel laureate Eugene Fama show that HML, along with SML (which stands for small firms minus big firms) and another factor called the “market factor,” all have positive expected premia with statistical significance. This is commonly referred to as the Fama and French three-factor model. More recently, Fama and French evolved their model into a five-factor model.

In the years since Fama and French’s 1991 paper, hundreds of factors have been proposed and tested. The statistical significance of a factor’s premium is typically reported as the t-statistic, where a t of greater than about 2.0 is considered significant. It is not necessary to explain the details of the t-statistic, but what is important to note is that a level of 2.0 is deemed significant because such a value is unlikely to occur by chance. But it is important to observe that if 100 factors (for example) are tested, then the probability that one of the 100 tests of significance has a t-statistic of greater than 2.0, solely by chance, is higher than if only one test was performed. Therefore, the barrage of statistical testing performed on newly proposed factors raises an important problem with using a significance test to validate the efficacy of a factor model. The authors of the 2014 paper titled “. . . and the Cross-Section of Expected Returns” do an excellent job of explaining this issue in their abstract:

Hundreds of papers and hundreds of factors attempt to explain the cross-section of expected returns. Given this extensive data mining, it does not make any economic or statistical sense to use the usual significance criteria for a newly discovered factor, e.g., a t-ratio greater than 2.0 … Echoing a recent disturbing conclusion in the medical literature, we argue that most claimed research findings in financial economics are likely false. 

C. Harvey, Y. Lui, and H. Zhu, “. . . and the Cross-Section of Expected Returns”

In addition to the paper referenced above, many authors have recently investigated this issue in detail (see here, here, and here). The general solution takes a form that is similar to the one proposed in “. . . and the Cross-Section of Expected Returns,” which calculates new levels required for the t-statistic to be significant based on the number of factors that have been tested. The following diagram from their paper gives a great visualization of this state of affairs.

In the next section we introduce deep learning as an alternative to factor-based models. We will also discuss the widespread use of out-of-sample testing as a way of validating machine learning models as a solution to the problems that arise from statistical significance testing.

Deep Learning

As mentioned in Part 1 and Part 2, deep neural networks (DNNs) allow us to avoid the process of factor engineering and instead employ a learning algorithm which programmatically searches for methods of selecting, weighting, and transforming a company’s financial data that (in the context of its current price) best predict how the company will perform as an investment.

To understand this process, it is sufficient to think of a DNN as an adjustable black box. That is, the DNN takes the financial items described in Part 2 as inputs and produces, as output, a number between 0.0 and 1.0. This output number is a function of the inputs and a large number of “trainable parameters” (the adjustable part of the black box). If we change the value of a trainable parameter, the DNN will produce a different output on a given set of inputs. So if the inputs represent IBM in March 1983 and the DNN produces the number 0.65, then if we adjust the trainable parameters, the DNN will produce a new output (say 0.73) on the same set of inputs (again, IBM in March of 1983). 

The ability to adjust these parameters to modify the DNN’s output allows us to train the DNN to achieve a certain objective. In this project, we are teaching the DNN to produce an output number that is a good estimate of the probability that the company (represented by the input) will have a total return (price change plus dividends reinvested) greater than the median total return of all stocks. Recall the data layout diagram from Part 1:

To understand the above goal in the context of this diagram, we would like the DNN, when given a set of inputs (“Fundamental Data at Time t” in the diagram) to produce an output probability that is close 1.0 when the target output (“Outcomes at Time t+12” in the diagram) is +1 and an output probability that is close to 0.0 when the target output is -1.

How does this training process actually work? Initially, the trainable parameters are set to random values. So at first, the actual output probabilities of the DNN are likely to be bad estimates of the target outputs. As learning proceeds, however, the algorithm repeatedly shows the DNN a set of inputs, the DNN produces new outputs, the learning algorithm evaluates how well the output probability predicts the target output, and then adjusts the trainable parameters to improve the prediction (if necessary). As a result, the DNN becomes better and better at estimating the probability of a company’s outperformance with each iteration of the training process. When training is complete, we hope that the resulting DNN has some explanatory power with respect to future stock price changes.

This description of the learning process is accurate not just for DNNs but for any “supervised” machine learning algorithm. The use of deep learning, as previously described, gives the model the expressive power to untangle important relationships in a hierarchical way from the basic company financial data – which makes learning possible without extensive feature engineering of the inputs. Furthermore, in the case of deep recurrent neural networks, it allows the model to construct these relationships over different historical lookback periods.  As a simple example, and as described more fully in Part 2, the model can evaluate whether to look at a company’s capital structure as it is today, how it evolved over the past year, or perhaps how it evolved over many years. It simply seeks the qualities and sequences of fundamental data that have shown the most predictive power in getting the correct target output. If we were to employ more traditional machine learning techniques, much more feature engineering would need to be done, fewer parameters would be evaluated, and the factors selected – as they are “selected” – would reflect some level of our biases

The Relationship Between Factor Models and Deep Learning

How does the deep learning approach differ from the traditional factor models described in the previous section? With factor methods, as discussed above, the factors themselves (or portfolios constructed from them) are tested for their explanatory power of expected returns. In the machine learning approach, the DNN’s trainable parameters are adjusted, through a learning algorithm, to maximize the explanatory power of the DNN’s output. In this way you can think of the output of a DNN as a factor itself. As with a factor model, the output can be used to sort stocks, construct portfolios, and produce a cross-section of returns analysis. But the difference is that the DNN’s output is algorithmically engineered instead of posited as a hypothesis (i.e., factor) to be tested.

Machine learning has been described as the automation of the scientific method, and this comparison is relevant here. In the factor method, we construct a hypothesis (e.g., the factor HML explains expected returns), gather data to test the hypothesis, test the hypothesis on the data, and reject the hypothesis if the test result is not significant. Machine learning automates this process by iteratively modifying the DNN’s trainable parameters and testing the effectiveness of those modifications during learning. This is analogous to iteratively positing and testing hypotheses until one is found that is satisfactory. 

Viewing machine learning as a process that is iteratively testing many, many hypotheses to find the best answer brings us back to the same challenge we encounter when constructing factor models. That is, if you test enough models on the same data you will eventually (by chance) find one that works only on the specific data sample, but has no real explanatory power outside of the sample. The model will look great “in-sample” but utterly fail “out of sample.” 

In the economic literature, the proposed strategy for overcoming this challenge is to demand a higher t-statistic. In machine learning, however, the most widely employed strategy is the use of out-of-sample testing. In this approach, a randomly chosen “holdout” dataset is set aside prior to learning. The data that has not been set aside (typically called the training data) is used to train the model. Because the resulting model has been optimized specifically to perform well on the training data, its performance on the training data is biased. Therefore, once the training process is complete, the model is evaluated on the unseen holdout data to get an unbiased estimate of its future performance.

This approach works very well for many applications. However, there are unique challenges to its execution when applied to time series data (such as the kind used in stock market investing). One example is that the nature of the problem may change over time. This might be caused by how CFOs change their preferences for reporting data within the guidelines of GAAP (generally accepted accounting principles) or how assets are interpreted by the model given the transition from a manufacturing economy to a service economy. In the next blog post, we will dissect these challenges and describe an effective approach to learning and validating deep learning models in domains with time series data in general, and with stock market investing data in particular.