This the companion notebook for the session on validation and tuning machine learning models for factor investing.

In the financial applications below, the choice of dependent variable is not optimal. It was chosen to minimise code complexity and ease understanding of basic principles. Results may seem disappointing, but they are realistic! It take experience/experimentation to make ML algorithms deliver good results. That’s up to you!

\(\rightarrow\) ABOUT THE NOTEBOOK:

the code chunks are sequential. They must be executed in the correct order.
text between stars appears in bold in html format.

1 Variance-bias tradeoff

In this section, we illustrate how model complexity affect the average error and the variance of predictors. To that purpose, we use simple trees.

1.1 Data preparation

We start by loading packages and importing the data.

library(tidyverse)      # Data wrangling (and so much more) package
library(rpart)          # Simple decision tree package
library(rpart.plot)     # Tree plot package
load("data.RData")      # Load the data
data <- data  %>%       # Compute the returns
  select(-ESG_rank) %>% # Remove ESG: not enough points!
  group_by(Tick) %>%                          
  mutate(F_Return = lead(Close) / Close - 1,              # Adding future 1M returns
         FL_Return = lead(Close, 12) / Close - 1) %>%     # Adding future 1Y returns
  ungroup() %>%
  group_by(Date) %>% 
  mutate(FR_Return = F_Return - mean(F_Return),           # Relative return vs EW market
         FLR_Return = FL_Return - mean(FL_Return)) %>%    # Relative return vs EW market (long)
  ungroup()

Then, we format the inputs we will feed the tree algorithm.

normalise <-  function(v){  # This is a function that 'uniformalises' a vector
  v <- v %>% as.matrix()
  return(ecdf(v)(v))
}

data_f <- data %>%                                  # Data with normalised values
  select(-Close) %>%                                # Closing prices are not predictors
  group_by(Date) %>%                                # Normalisation takes place for each given date
  mutate_at(vars(Vol_1M:Prof_Marg), normalise) %>%  # Apply normalisation on numerical columns
  ungroup() 

sep_date <- as.Date("2012-01-01")                   # Train vs test separation date
train_sample <- data_f %>%                          # Train sample
  filter(Date <= sep_date)
test_sample <- data_f %>%                           # Test sample
  filter(Date > sep_date) %>%
  na.omit()                                         # Remove missing data

1.2 Simple tree

Finally, we train and test simple trees.

fit <- rpart(FLR_Return ~ Mkt_Cap + P2B + Vol_1M + D2E + Prof_Marg, # Model
             data = train_sample,                                   # Data source
             cp = 0.0001,                                           # Tree complexity
             maxdepth = 2                                           # Nb levels
)                                                                   # The model is then trained!
rpart.plot(fit)                                                     # Plot the tree

In the above code, we impose a small complexity (cp) so that we can manually control the depth (and hence, the complexity) of the tree. Given the trained parameters (splits), we test the model on the recent data.

pred <- predict(fit, test_sample)           # Predictions
mean(pred)-mean(test_sample$FLR_Return)     # Bias

## [1] 0.008821996

var(pred)                                   # Variance

## [1] 0.009755871

The first number shows the error in average prediction while the second one measures the dispersion of predictions.
## A deeper tree Below, we test the same configuration, but with a maximum depth of five (beyond five, it is hard to read the values inside the tree).

fit <- rpart(FLR_Return ~ Mkt_Cap + P2B + Vol_1M + D2E + Prof_Marg, # Model
             data = train_sample,                                   # Data source
             cp = 0.0001,                                           # Tree complexity
             maxdepth = 5                                           # Nb levels
)                                                                   # The model is then trained!
rpart.plot(fit)                                                     # Plot the tree

Graphically, the model is evidently more complex. And the predictions follow.

pred <- predict(fit, test_sample)
mean(pred)-mean(test_sample$FLR_Return) # Bias

## [1] -0.007871907

var(pred)                               # Variance

## [1] 0.01415666

The changes in both values follow opposite paths. The magnitude of bias is divided by 1.13 (from 0.88% to 0.78%) while the variance was multiplied by 1.45 (from 0.97% to 1.4%). This is clear evidence of the variance-bias tradeoff.

2 Grid search

In this section, we have a look on the impact of hyperparameters on model performance. Because of computational constraints, we work with boosted trees (neural networks take too much time!).

2.1 Naive approach

We start by preparing the inputs. Because we work with the simple (scaled) features, the dependent variable is the 12M (smooth) return. Additionally, we will in fact use the return relative to the average performance (FLR_Return variable).

library(xgboost)                                                         # Load package
features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg")            # Predictors
train_features <- train_sample %>% select(features) %>% as.matrix()      # Independent variables
train_label <- train_sample$FLR_Return                                   # Dependent variable
train_matrix <- xgb.DMatrix(data = train_features, label = train_label)  # XGB format
test_features <- test_sample %>% select(features) %>% as.matrix()        # Independent variables
test_label <- test_sample$FLR_Return                                     # Dependent variable

We recall some key hyperparameters of boosted trees:

eta, the learning rate: a scaling factor applied to each new tree added to the model
lambda, the weight regulariser: it penalises the objective function through the total sum of squared weights/scores
number of rounds: number of trees in the model

Note: some other parameters are also important (and related), like maximum depth, or gamma (complexity penalisation: the loss reduction required to make a further partition on a leaf node of the tree; lambda penalises the number of leaves in the tree), but we stick to 3 of them for simplicity purposes.

Below, we want to test five values for each parameter. The set of feasible values is not straightforward at first. For instance, gamma should not be too high: if it were, the trees would only consist in the original node. Likewise, eta should not be too close to 1 because then the model would have a tendency to overfit the data - and also not for far from 1 (else the model does not learn): a tough choice. Typically, eta and the number of rounds of boosting are related: it eta is small, a large number of rounds is useless.

We start by defining (agnostically at first) the sets of parameter values.

eta <- c(0.2, 0.4, 0.6, 0.8, 0.9)         # eta values
nrounds <- c(10, 25, 50, 90)              # nb of trees 
lambda <- c(0.01, 0.1, 1, 10, 100)        # lambda values
pars <- expand.grid(eta, nrounds, lambda) # Exploring all combinations!
eta <- pars[,1]
nrounds <- pars[,2]
lambda <- pars[,3]

Below, we code the function that we will use for the grid search. It takes train and test data as inputs, as well as the parameters we are interested in.

grid_par <- function(eta, nrounds, lambda, train_matrix, test_features, test_label){
  fit <- train_matrix %>% 
    xgb.train(data = .,                       # Data source (pipe input)
              eta = eta,                      # Learning rate
              objective = "reg:squarederror", # Objective function
              max_depth = 4,                  # Maximum depth of trees
              lambda = lambda,                # Penalisation of leaf values
              gamma = 0,                      # Penalisation of number of leaves
              nrounds = nrounds,              # Number of trees used
              verbose = 0                     # No comment
    )
  
  pred <- predict(fit, test_features)         # Preditions based on fitted model & test values
  return(mean((pred-test_label)^2))           # Mean squared error
}

Finally, we test the function on one triplet of parameter values and then on all combinations, using the pmap() function. From the output, we create a graph.

grid_par(0.1, 10, 1 ,train_matrix, test_features, test_label) # Test of the function

## [1] 0.08081849

grd <- pmap(list(eta, nrounds, lambda),               # Parameters for the grid search
            grid_par,                                 # Function on which to apply the grid search
            train_matrix = train_matrix,              # Input for function: training data
            test_features = test_features,            # Input for function: test features
            test_label = test_label                   # Input for function: test labels (returns)
)
grd <- data.frame(eta, nrounds, lambda, error = unlist(grd)) # Dataframe with all results
grd$eta <- as.factor(eta)                                    # Parameters as categories (for plotting)
grd %>% ggplot(aes(x = eta, y = error, fill = eta)) +        # Plot!
  geom_col() + theme_light() + 
  facet_grid(rows = vars(nrounds), cols = vars(lambda)) +
  geom_hline(yintercept = var(test_label), linetype = 2)

var(test_label)                                              # Benchmark = zero prediction

## [1] 0.04183968

The final value is a good benchmark. It’s the quadratic error we would make if we predicted a zero return at all times.

There is one emerging pattern: nrounds (from top to bottom barplots) and lamda (from left to right) seem to be the main drivers of the error at this scale.

Overall, the performance of the models is not impressive. One possible reason for that is that the dataset is not wide enough: with only 30 assets, it is hard to learn patterns. A second explanation would be that the chosen model is not a good option. A third option is that we are not looking at the right metric and that we are maybe facing a false negative. Finally, a fourth option is that the label we work with is not consistent with the chosen features.

Going further with caret: https://topepo.github.io/caret/random-hyperparameter-search.html
Going further with h2o: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html Going further with the tidymodels: https://www.tidymodels.org

2.2 Separating loss from performance

This is where we choose to shift from the ML metric to the financial metric. Since we are rebalancing monthly, we will need an indicator that is best suited. Below, we record the hit ratio for the one-month ahead return, i.e., the percentage of times that the prediction is in the right direction. We perform a trick below. The predictions are 12M forward relative returns, but we compare them to 1M forward simple returns! We test the sensitivity to 3 parameters.

grid_par_hit <- function(eta, nrounds, lambda, train_matrix, test_features, test_label){
  fit <- train_matrix %>% 
    xgb.train(data = .,                       # Data source (pipe input)
              eta = eta,                      # Learning rate
              objective = "reg:squarederror", # Objective function
              max_depth = 4,                  # Maximum depth of trees
              lambda = lambda,                # Penalisation of leaf values
              gamma = 0,                      # Penalisation of number of leaves
              nrounds = nrounds,              # Number of trees used
              verbose = 0                     # No comment
    )
  
  pred <- predict(fit, test_features)                # Preditions based on fitted model & test values
  return(mean(pred*test_label>0))                    # Hit ratio!
}

test_label_hit <- test_sample$FR_Return              # Dependent variable => very different!!! Big trick!

grid_par_hit(eta = 0.5, nrounds = 12, lambda = 1,    # A test!
             train_matrix, test_features, test_label)

## [1] 0.4833333

grd <- pmap(list(eta, nrounds, lambda),              # Parameters for the grid search
            grid_par_hit,                            # Function of the grid search
            train_matrix = train_matrix,             # Input for function: training data
            test_features = test_features,           # Input for function: test features
            test_label = test_label_hit              # Input for function: test labels (returns)
)
grd <- data.frame(eta, nrounds, lambda, hit = unlist(grd)) # Dataframe with all results
grd$eta <- as.factor(eta)                                  # Parameters as categories (for plotting)
grd %>% ggplot(aes(x = eta, y = hit, color = eta)) +       # Plot!
  geom_point() + theme_light() + 
  ylim(0.48,0.52) +
  geom_hline(yintercept = mean(test_label_hit>0), color = "#646464") + # Benchmark value (horiz line)
  facet_grid(rows = vars(nrounds), cols = vars(lambda))

mean(test_label_hit>0)     # Benchmark: assuming constant positive return (e.g., EW portfolio!)

## [1] 0.5006803

max(grd$hit)               # Best hit ratio from models

## [1] 0.5112245

The validation metric is the hit ratio over the FR_Return. The latter is the 1M forward return over the cross-sectional average return. This means that the hit ratio measures the proportion of times we bet on a stock that will outperform the average return of all stocks in the dataset.

All hit ratios are above those of the benchmark, which is a good sign. Many of them are above 0.50, which is even better. In practice, a hit ratio below 0.52 is probably not worth much (due to transaction costs).

Below, we test the performance metrics that is most obvious to portfolio managers: the average return. The only change is in the return() argument in the function grid_par_ret(). We invest in the stock if its prediction is strictly positive.

grid_par_ret <- function(eta, nrounds, lambda, train_matrix, test_features, test_label){
  fit <- train_matrix %>% 
    xgb.train(data = .,                       # Data source (pipe input)
              eta = eta,                      # Learning rate
              objective = "reg:squarederror", # Objective function
              max_depth = 4,                  # Maximum depth of trees
              lambda = lambda,                # Penalisation of leaf values
              gamma = 0,                      # Penalisation of number of leaves
              nrounds = nrounds,              # Number of trees used
              verbose = 0                     # No comment
    )

  pred <- predict(fit, test_features)         # Preditions based on fitted model & test values
  return(mean(test_label[pred>0]))            # Mean return when prediction > 0 !
}

test_label_ret <- test_sample$F_Return        # Dependent variable => very different!!!

grd <- pmap(list(eta, nrounds, lambda),       # Parameters for the grid search
            grid_par_ret,                     # Function of the grid search
            train_matrix = train_matrix,      # Input for function: training data
            test_features = test_features,    # Input for function: test features
            test_label = test_label_ret       # Input for function: test labels (returns)
)
grd <- data.frame(eta, nrounds, lambda, ret = unlist(grd)) # Dataframe with all results
grd$eta <- as.factor(eta)                                  # Parameters as categories (for plotting)
grd %>% ggplot(aes(x = eta, y = ret, color = eta)) +       # Plot!
  geom_point() + theme_light() + 
  ylim(0.006,0.012) +
  geom_hline(yintercept = mean(test_label_ret), color = "#646464") + # Benchmark value
  facet_grid(rows = vars(nrounds), cols = vars(lambda))

## Warning: Removed 1 rows containing missing values (geom_point).

mean(test_label_ret)                                                 # Benchmrk: average EW return

## [1] 0.008838865

Disappointing results overall. In addition, this does not take into account transaction costs. They should be computed to make sure that the performance does not vanish with trading fees.

3 Hyperparameters through time

3.1 Setup

In this section, we investigate the impact of time on the validity of hyperparameters. We split the sample into three parts:

Part	Training starts	Validation starts	Validation ends
Part 1	2000-01-01	2004-01-01	2006-01-01
Part 2	2005-01-01	2009-01-01	2011-01-01
Part 3	2010-01-01	2014-01-01	2016-01-01

The parts have a five year lag. The training sample is 4 year long, while validation is performed over 2 years. Hence, each study requires 6 years of data. We will separate the validation stage from the testing stage.

train_start = c("2000-01-01", "2005-01-01", "2010-01-01")
val_start = c("2004-01-01", "2009-01-01", "2014-01-01")
test_start = c("2006-01-01", "2011-01-01", "2016-01-01")
stop_date = c("2009-01-01", "2014-01-01", "2019-01-01")
# Below, the parameters
nrounds <- c(5, 10, 20, 30, 50)         # gamma values
lambda <- c(0.01, 0.1, 1, 10, 50)       # lambda values
pars <- expand.grid(nrounds, lambda)    # Exploring all combinations!
nrounds <- pars[,1]
lambda <- pars[,2]

Working with time series adds one dimension, hence we restrict our study to only two parameters: gamma and lambda. This will ease the visual representation of the results. We fix the learning rate eta to a reasonable value of 0.5.

3.2 Validation

Because we want to compare outcomes at three different points in time, we work with the hit-ratio, the average level of which is not dependent on economic cycles (as are returns, for instance). Hence, we recycle the grd_par_hit() function built above. For simplicity, we resort to a simple loop over the three periods.

grd <- c()       # Initiate output
for(i in 1:3){   # Loop over the parts
  # FIRST, WE PREPARE THE DATE
  train_features <- data_f %>%                               # Data
    filter(Date >= train_start[i], Date < val_start[i]) %>%  # Date filter
    select(features) %>% as.matrix()                         # Keep features
  train_label <- data_f %>%                                  # Data
    filter(Date >= train_start[i], Date < val_start[i]) %>%  # Date filter
    select(FLR_Return) %>% as.matrix()                       # Keep dependent variable
  train_matrix <- xgb.DMatrix(data = train_features, label = train_label)  # XGB format
  val_features <- data_f %>%                                 # Data
    filter(Date >= val_start[i], Date < test_start[i]) %>%   # Date filter
    select(features) %>% as.matrix()                         # Keep features
  val_label <- data_f %>%                                    # Data        
    filter(Date >= val_start[i], Date < test_start[i]) %>%   # Date filter
    select(FR_Return)                                        # Keep dependent variable: BIG TRICK here!
  # SECOND, WE COMBINE pmap() AND grid_par_hit() & AGGREGATE ITERATIVELY
  grd_temp <- pmap(list(0.5, nrounds, lambda),                # Parameters for the grid search
                   grid_par_hit,                            # Function of the grid search
                   train_matrix = train_matrix,             # Input for function: training data
                   test_features = val_features,            # Input for function: test features
                   test_label = val_label                   # Input for function: test labels (returns)
  )
  grd <- rbind(grd,                                                         # Output of previous part(s)
               data.frame(part = i, nrounds, lambda, hit = unlist(grd_temp))) # Output of latest part
}

Finally, we format and plot.

grd %>% 
  mutate_at(vars(part:lambda), as.factor) %>%  # Change numbers into factors for plotting
  ggplot(aes(x = part, y = hit)) + 
  geom_point() + theme_light() +
  ylim(0.48,0.6) +
  geom_hline(yintercept = 0.5, color = "grey") +
  facet_grid(nrounds ~ lambda)

grd %>% 
  mutate_at(vars(part:lambda), as.factor) %>%  # Change numbers into factors for plotting
  ggplot(aes(x = nrounds, y = hit, color = part)) + 
  geom_point() + theme_light() + 
  ylim(0.48,0.6) +
  geom_hline(yintercept = 0.5, color = "grey") +
  facet_grid(part ~ lambda)

We provide two alternative representation of the results. In the first one, the x-axis is linked to the chronology to assess stability of results over time. In the second one, parts are grouped in color so that we can assess which choices would have been made in a ‘live’ situation.

From the first graph, we see that apart for lambda = 1, the values are relatively stable. It also shows that lambda = 0.01 is not the best choice we can make.

From the second graph, we infer the best choices at the three different periods:

in the first period (red points), the best choices are nrounds = 5-10 and lambda = 0.1-10. The hit ratio is 55% and surrounded by values firmly above 52%.
in the second period (green points), the highest point is reached for nrounds = 5-10 and lambda = 0.1-10. The value is slightly below 55%.
finally, in the third period (blue points), the best choice is lambda = 10.

3.3 Test

For simplicity, we will work with constant parameters in what follows. A full rigorous treatment would impose that the validation be carried out at each period (month) of the backtest in order to choose optimal hyperparameters. We posit that there is some stability in the performance of hyperparameter values and start by looking at the average values over the 3 periods.

In order to choose the parameters, we compute averages.

grd %>% group_by(nrounds, lambda) %>% 
  summarise(avg_hit = mean(hit)) %>%
  pivot_wider(names_from = "lambda", values_from = "avg_hit")

## `summarise()` has grouped output by 'nrounds'. You can override using the
## `.groups` argument.

## # A tibble: 5 × 6
## # Groups:   nrounds [5]
##   nrounds `0.01` `0.1`   `1`  `10`  `50`
##     <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1       5  0.526 0.530 0.517 0.535 0.510
## 2      10  0.531 0.536 0.538 0.543 0.532
## 3      20  0.526 0.534 0.526 0.534 0.519
## 4      30  0.527 0.544 0.538 0.527 0.528
## 5      50  0.529 0.539 0.536 0.535 0.531

Three stable values for nrouds are 10, 20 & 30. The best choices for lambda are 0.1, 1 and 10.

To avoid look-ahead bias, we will only work with data posterior to the last validation point (December 2015). But since we have been working with 12M forward dependent variable, we must shift the starting data to January 2017. This removes any possible overlap!

We then recycle code from the session on Decision Trees below and start with data preparation and initialisation.

tick <- unique(data$Tick)                     # Set of assets
sep_date <- as.Date("2017-01-01")             # This date separates in-sample vs out-of-sample
t_oos <- data$Date[data$Date > sep_date] %>%  # Out-of-sample dates (i.e., testing set)
  unique() %>%
  as.Date(origin = "1970-01-01")
Tt <- length(t_oos) - 1                                         # Nb of dates, avoid T = TRUE
nb_port <- 2                                                    # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick)))   # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port)           # Store returns

Below, we adjust the weight function, with the equally-weighted (EW) portfolio as benchmark. We will invest in the 15 stocks with the highest (predicted) relative 12M outperformance.

weights_multi <- function(train_data, test_data, j){
  N <- test_data$Tick %>% n_distinct()                                       # Number of stocks
  features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg")              # Hard-coded
  if(j == 1){                                                                # j = 1 => EW
    return(rep(1/N,N))
  }
  if(j == 2){                                                                # j = 2 => tree-based
    # Below, we quickly format the data
    train_features <- train_data %>% select(features) %>% as.matrix()        # Indep. variable
    train_label <- train_data$FLR_Return                                     # Dependent variable
    train_matrix <- xgb.DMatrix(data = train_features, label = train_label)  # XGB format
    fit <- train_matrix %>% 
      xgb.train(data = .,                       # Data source (pipe input)
                eta = 0.6,                      # Learning rate
                objective = "reg:squarederror", # Number of random trees
                max_depth = 4,                  # Maximum depth of trees
                nrounds = 20,                   # Number of trees used
                lambda = 1,                     # Fixed value of hyperparameter
                gamma = 1                       # Fixed value of hyperparameter
      )
    xgb_test <- test_data %>%                   # Test sample => XGB format
      select(features) %>% 
      as.matrix() %>%
      xgb.DMatrix()
    
    pred <- predict(fit, xgb_test)              # Single prediction
    return((pred>median(pred))/N*2)             # Fifteen largest values, equally-weighted
  }
}

And we proceed to the backtesting loop.

train_window <- 365 * 6                               # We train on rolling windows of 6-1=5 years
for(t in 1:Tt){                                       # Loop on dates!
  train_data <- data_f %>% 
    filter(Date < t_oos[t] - 365,                     # One year offset because FLR_Return  
           Date >= t_oos[t]- train_window)            # Rolling window !!!
  test_data <- data_f %>% filter(Date == t_oos[t])    # Current values
  realized_returns <- data %>%                        # Computing returns via:
    filter(Date ==  t_oos[t]) %>%                     # Filtering
    select(F_Return)                                  # Selecting
  for(j in 1:nb_port){                                     
    portf_weights[t,j,] <- weights_multi(train_data, test_data, j)  # Hard-coded params, beware!
    portf_returns[t,j] <- sum(portf_weights[t,j,] * realized_returns)
  }
}
apply(portf_returns, 2, mean) / apply(portf_returns, 2, sd)         # Sharpe ratios!

## [1] 0.2092178 0.2454613

The ML-driven strategy is well above the EW portfolio in terms of raw Sharpe ratio.

Finally, below, we plot the evolution of the two portfolios.

plot_data <- (portf_returns+1) %>% apply(2,cumprod) # Values
plot_data <- rbind(1, plot_data)                    # Initiate at one
plot_data <- data.frame(t_oos, plot_data)           # Adding the date
colnames(plot_data) <- c("Date", "EW", "ML")        # Changing column names
plot_data %>%                                       # Final formatting & plot
  pivot_longer(-Date, names_to = "strategy", values_to = "value") %>%
  ggplot(aes(x = Date, y = value, color = strategy)) + 
  geom_line() + theme_light() + geom_point() +
  scale_color_manual(values = c("#EE6611", "#2299EE"))

4 Exercises

4.1 Variance-bias trade-off

Train a neural network with roughly 30 parameters and compute the bias and variance on a testing set. Perform the same operation on a network with 300 parameters and see the difference.

4.2 Random Forests

Explore the sensitivity of performance of random forest using grid search for the two key parameters: ntree (number of trees in the forest) and mtry (number of variables). We recommend to:

work on the large dataset (10 features)
keep mtry between 4 and 8 and ntree between 10 and 200.

Don’t forget to change/update the feature space!

4.3 Boosted tree: the forgotten parameters

Complete the grid search by incorporating n_rounds and maxdepth in the grid.

Validation & tuning