Systematic Investing and Deep Learning - Part 2

Over the last three years, Euclidean has engaged in an intense research and development effort to assess whether the great advances in machine learning achieved in the past decade can improve our systematic, long-term investing processes. Much of this research has been published in two peer-reviewed papers and presented at both the NeurIPS and ICML machine learning conferences. The papers can be found here and here. This article is the second in a several-part series designed for general readership, describing what we have learned from this research.

In the first part of this series, we demonstrated that a perfect forecast of future earnings would result in spectacular returns. We call this perfect forecast the “clairvoyant model,” a hypothetical model that has access to future earnings. Obviously, we are not clairvoyant. We cannot see into the future. However, we can simulate [1] what would happen if we could see the future, and that simulation provides an upper bound on the performance we would realize when attempting to forecast the future. In simulation, a perfect forecast (the clairvoyant model) of earnings 12-months into the future achieved an annualized return of about 40%, as shown in the figure below [2]. With this, we hypothesized that, if we could forecast earnings with reasonable accuracy, we might be able to realize some of these gains.

Figure 1: Portfolio performance when future fundamentals can be accessed clairvoyantly.

To make such forecasts, we borrowed ideas from the use of machine learning in natural language processing and computer vision—applications of machine learning that have seen remarkable results. Some of these examples can be found here, here, here, and here. Since financial data is sequential in nature, these types of deep learning models (e.g., recurrent neural networks [RNNs]) are aptly suited for the task. We also introduced the idea of quantifying the uncertainty or confidence in our forecast. If a model had low confidence in a particular forecast, we would adjust its impact accordingly.

In this post, we first describe the quantitative factor models that employ the above mentioned ideas. We then discuss the portfolio simulation process and how we train the deep learning models. Finally, we show that deep learning models are more successful at forecasting than other methods. Factor models that use deep learning forecasts achieved, in simulation, a 17.7% annualized return over the last 20 years compared to 14% for a standard factor model over the same period. Simultaneously, we controlled the risk in our investment process by incorporating the estimated uncertainty of these deep learning forecasts into the factor models.

Quantitative Factor Models

Typical quantitative investment strategies use factors, such as EBIT/EV, to construct portfolios, sorting the universe of stocks by such factors and investing in the stocks that rank the highest. Here, EBIT stands for Earnings Before Interest and Taxes, and EV stands for the Enterprise Value. The EV is the sum necessary to acquire a company, a combination of market capitalization and total liabilities.

While a standard factor model uses current EBIT, we are interested in investigating strategies that use forecast EBIT, denoted by EBIT_fcst. We use these EBIT forecasts to construct what we call lookahead factor models (LFMs) -- factor models that use EBIT_fcst/EV rather than EBIT/EV. We investigate the effectiveness of various LFM models, including auto-regressive models, linear regression models, and deep neural network (DNN) models. As mentioned above, we also generated uncertainty-aware models, denoted by the prefix UQ, by scaling the forecast in inverse proportion to its variance.

Data Normalization

Our expectation in training a deep learning model was that, by providing a large number of examples of time series sequences spanning thousands of companies, the model could learn patterns to predict future earnings and other fundamentals. However, there can be wide differences in the absolute value of these fundamental features when compared between companies and across time. For example, Walmart’s annual revenue for 2020 was $559 billion USD, while GoPro had a revenue of $892 million USD for the same period. Intuitively, these statistics are more meaningful when scaled by company size.

Hence, we scaled all fundamental features in a given time series by the market capitalization in the last time step of the series. We also scaled all time steps by the same value so that the deep neural network could assess the relative change in fundamental values between time steps. While other notions of size have been used—such as EV and book value—we chose to avoid these measures because they can, although rarely, take negative values. We then further scaled the features so that each had zero mean and unit standard deviations.

Portfolio Simulation

Before moving to the training process and discussing the results, it is important to understand the portfolio simulation process. The goal of the simulator is to recreate as accurately as possible the investment returns an investor would have achieved had they been using the model over a specific period of time and within a specific universe of stocks. To this end, the simulation must incorporate transaction costs, liquidity constraints, bid-ask spreads, and other types of friction that exist in the management of a real-life portfolio of stocks.

The simulation algorithm works as follows:

We construct portfolios by ranking all stocks according to the factor of interest —depending on which we are simulating—investing equal amounts of capital into the top 50 stocks and re-balance in this way annually. We limit the number of shares of a security bought or sold in a month to no more than 10% of the monthly volume for a security. Simulated prices for stock purchases and sales are based on the volume-weighted daily closing price of the security during the first 10 trading days of each month. If a stock paid a dividend during the period it was held, the dividend was credited to the simulated fund in proportion to the shares held. Transaction costs are factored in as $0.01 per share, plus an additional slippage factor that increases as a square of the simulation’s volume participation in a security. Specifically, if participating at the maximum 10% of monthly volume, the simulation buys at 1% more than the average market price and sells at 1% less than the average market price. This form of slippage is common in portfolio simulations as a way of modeling the fact that as an investor’s volume participation increases in a stock, it has a negative impact on the price of the stock for the investor.

Due to how a portfolio is initially constructed and the timing of cash flows, two portfolio managers can get different investment results over the same period using the same quantitative model. To account for this variation, we run 300 portfolio simulations for each model where each portfolio is initialized from a randomly chosen starting state. The portfolio statistics, such as compound annual return and Sharpe ratio, that are presented in this article are the mean of statistics generated by the 300 simulations.

Training the Deep Learning Model

In part 1 of this blog series, we demonstrated that, if we could forecast EBIT perfectly (as shown in Figure 1), the portfolios built using the lookahead factor would far outperform the standard factor models. Of course, perfect knowledge of future EBIT is impossible, but we speculated that, by forecasting the future EBIT, we could also realize some of these gains and outperform a standard factor model.

However, the question arose as to how far into the future we should forecast. Clearly, forecasting becomes more difficult the further we forecast into the future. To examine this question, we plotted the out-of-sample mean squared error (MSE) for different forecast periods. The further we tried to predict, the less accurate our model became. However, at the same time, the clairvoyance study (Figure 1) told us that the value of a forecast increases monotonically as we see further into the future. In our experiments, and as shown by the blue curve in Figure 3, the best trade-off is achieved with a forecasting period of 12 months. That is, simulated returns increase as the forecasting window lengthens up until 12 months, after which the returns start to fall.

Figure 3: MSE (red) of the out-of-sample period (2000–2019) increased with forecast period length. The forecasting model became less accurate the further we went out into the future.

We, therefore, set our goal to forecast 12 months into the future. Since no forecast is perfect, we asked how the forecasting accuracy related to the portfolio performance. It seemed intuitive that higher accuracy would yield a better portfolio performance, and we tested this experimentally. However, before examining this relationship, it is important to understand how a deep learning model is trained.

As with many machine learning techniques, training a DNN is an iterative process. In this case, we began with some initial model parameters used to make a forecast. We evaluated the respective forecast error and used it to intelligently tune the model parameters such that the forecast error for the next iteration was lower. We continued this process and iterated the model parameters until the forecast error could not be improved any further. The final model contained the set of parameters from the last iteration and was expected to provide the highest accuracy.

The above description simplifies many nuances about training a DNN, but it highlights the key elements for understanding how we established a correspondence between forecast accuracy and portfolio performance. After every iteration, we saved the model parameters and made predictions with an out-of-sample dataset. We evaluated the forecast error for these predictions as measured by the MSE. Additionally, we simulated the portfolio performance of the factors built using these predictions.

As the training progressed with each iteration, the forecast error (MSE) decreased. As forecasting accuracy improved, we expected the portfolio performance to improve. This is what we see in Figure 4. The first iteration (bottom-rightmost) had the highest MSE and lowest return. As we trained the model, the points moved toward better returns (upper left corner). This experiment validated our hypothesis that returns strongly depend on the accuracy of the forecasting model.

Figure 4: Correspondence between DNN model accuracy and portfolio returns. The bottom-rightmost point was evaluated after the first epoch. As training progressed, points in the graph moved toward the upper left corner. Portfolio returns increased as model accuracy improved (out-of-sample MSE decreased).

Results

As a first step in evaluating the forecast produced by the neural networks, we compared the MSE of the predicted fundamentals on out-of-sample data with a naïve predictor, where the predicted fundamentals were assumed to be the same as trailing twelve-month values. This is basically the standard factor model. In nearly all the months, however turbulent the market, the neural networks outperformed the naïve predictor.

Figure 5: MSE over out-of-sample time period for DNN (red) and standard factor model or naïve predictor (blue)

Table 1 demonstrates a clear advantage of using LFMs instead of standard models, improving the annualized return by 3.7%. Deep learning LFMs achieve higher accuracy than linear or auto-regression models and, thus, yield better portfolio performance. Figure 6 shows the cumulative return of all portfolios across the out-of-sample period.

Additionally, we care, not only about the return of a portfolio, but also about the risk undertaken as measured by volatility. We estimated uncertainty in our forecast to lower this form of risk. The risk-adjusted return or Sharpe ratio was meaningfully higher for the uncertainty aware LFM DNN model, which reduced the risk by scaling the EBIT forecast in inverse proportion to the total variance.

Table 1: Out-of-sample performance for the 2000–2019 time period.

Strategy	MSE	CAR	Sharpe Ratio
S&P 500	n/a	6.05%	0.32
Standard Factor Model	0.65	14.00%	0.52
LFM Auto Regression	0.58	14.20%	0.56
LFM Linear	0.52	15.50%	0.64
LFM DNN	0.48	16.20%	0.68
Uncertainty Aware LFM UQ DNN	0.47	17.70%	0.84

Figure 6: Cumulative return of different strategies for the out-of-sample period. The LFM UQ LSTM consistently outperformed all other models throughout the entire period.

To establish the significance of the improved Sharpe ratio, we performed the pairwise t-statistics test [3], as shown in Table 2. In this table, a t-statistic value > 2 means that the Sharpe ratio of the model in the corresponding column is significantly higher than that in the corresponding row. For example, the LFM UQ DNN model returned a higher Sharpe ratio than the linear model, and the t-statistic value for that improvement was 3.12. This improvement was statistically significant — meaning not likely to happen by chance. The last column of Table 2 compares the improvement in the Sharpe ratio of the UQ DNN models over all other models. The t-statistic value was > 2 for every comparison. This provided strong evidence that the Sharpe ratio was significantly improved by using estimated uncertainty to reduce risk.

Table 2: Pairwise t-statistic for Sharpe ratio. The models are organized in increasing order of Sharpe ratio values. The t-statistics for the LFM UQ DNN model are marked in bold if they are significant, at a significance level of 0.05.

	Auto-Reg	Linear	DNN	UQ DNN
Standard	0.76	2.52	2.96	5.57
Auto-Reg		1.89	2.36	5.1
Linear			0.46	3.12
DNN				2.66

In addition to providing simulation results for concentrated 50-stock portfolios (Table 1), we also provide the cross-section [4] of returns generated for the LFM DNN and LFM UQ DNN models in the out-of-sample period (Table 3). The cross-section shows the efficacy of the factor when looked at across the entire investment universe, where monthly returns increased almost monotonically as we moved from the bottom decile to the top decile. The difference between the top and bottom decile (high minus low or H − L) is called the “factor premium.” The t-statistic for the factor premium is significant and greater for UQ DNN than for DNN and the standard model (Table 3).

Table 3: Cross-section of monthly returns. The universe of stocks was ranked by the given factor and divided into ten groups of equally weighted stocks. The top decile (marked as High) was formed by the top 10% of the stocks ranked by the factor, and the bottom decile (marked as Low) was formed by the bottom 10% of the rankings. H – L represents the factor premium.

Decile	Standard	DNN	UQ-DNN
High 1	1.39	1.38	1.47
2	1.24	1.21	1.18
3	1.15	1.12	1.13
4	1.16	1.08	1.04
5	1.06	1.14	0.97
6	1	1.04	0.98
7	0.95	0.94	0.98
8	0.85	0.75	0.9
9	0.78	0.79	0.74
Low 10	0.73	0.57	0.64
H - L	0.66	0.8	0.83
t-statistic	2.31	2.78	3.57

Conclusion

Our clairvoyant study suggested that, if we could forecast future fundamentals perfectly, we would realize annualized returns of 40%. Of course, this is not possible because we do not know the future. However, motivated by this analysis, we attempted to forecast the future as accurately as possible. We demonstrated that, by predicting fundamental data with deep learning, we could construct an LFM that would significantly outperform equity portfolios based on traditional factors. We also achieved further gains by incorporating uncertainty estimates to avert risk. We demonstrated the superiority of deep learning LFM over standard factor approaches for both absolute and risk-adjusted returns.

Footnotes:

[1] All results presented in the article are simulated results. Historical simulated results presented herein are for illustrative purposes only and are not based on actual performance results. Historical simulated results are not indicative of future performance.

[2] The clairvoyance period refers to how far into the future we were looking. A 12-month clairvoyance period meant we had access to fundamentals 12 months into the future.

[3] We ran 300 simulations with varying initial start states for each model. Additionally, we randomly restricted the universe of stocks to 70% of the total universe, making the significance test more robust for different portfolio requirements. We aggregated the monthly returns of these 300 simulations by taking the mean, and we performed bootstrap resampling on the monthly returns to generate the t-statistic values for the Sharpe ratio, shown in Table 2.

[4] The cross-section was constructed via sorting stocks by each factor and splitting them into ten equally sized portfolios, ranging from the top decile (highest factor values) to the bottom decile (lowest factor values). The portfolios were rebalanced quarterly according to the factor sorts. More details on this methodology are presented in the original 1992 Fama-French publication.