This the companion notebook for the session on validation and tuning machine learning models for factor investing.
In the financial applications below, the choice of dependent variable is not optimal. It was chosen to minimise code complexity and ease understanding of basic principles. Results may seem disappointing, but they are realistic! It take experience/experimentation to make ML algorithms deliver good results. That’s up to you!
\(\rightarrow\) ABOUT THE NOTEBOOK:
In this section, we illustrate how model complexity affect the average error and the variance of predictors. To that purpose, we use simple trees.
We start by loading packages and importing the data.
library(tidyverse) # Data wrangling (and so much more) package
library(rpart) # Simple decision tree package
library(rpart.plot) # Tree plot package
load("data.RData") # Load the data
data <- data %>% # Compute the returns
select(-ESG_rank) %>% # Remove ESG: not enough points!
group_by(Tick) %>%
mutate(F_Return = lead(Close) / Close - 1, # Adding future 1M returns
FL_Return = lead(Close, 12) / Close - 1) %>% # Adding future 1Y returns
ungroup() %>%
group_by(Date) %>%
mutate(FR_Return = F_Return - mean(F_Return), # Relative return vs EW market
FLR_Return = FL_Return - mean(FL_Return)) %>% # Relative return vs EW market (long)
ungroup()
Then, we format the inputs we will feed the tree algorithm.
normalise <- function(v){ # This is a function that 'uniformalises' a vector
v <- v %>% as.matrix()
return(ecdf(v)(v))
}
data_f <- data %>% # Data with normalised values
select(-Close) %>% # Closing prices are not predictors
group_by(Date) %>% # Normalisation takes place for each given date
mutate_at(vars(Vol_1M:Prof_Marg), normalise) %>% # Apply normalisation on numerical columns
ungroup()
sep_date <- as.Date("2012-01-01") # Train vs test separation date
train_sample <- data_f %>% # Train sample
filter(Date <= sep_date)
test_sample <- data_f %>% # Test sample
filter(Date > sep_date) %>%
na.omit() # Remove missing data
Finally, we train and test simple trees.
fit <- rpart(FLR_Return ~ Mkt_Cap + P2B + Vol_1M + D2E + Prof_Marg, # Model
data = train_sample, # Data source
cp = 0.0001, # Tree complexity
maxdepth = 2 # Nb levels
) # The model is then trained!
rpart.plot(fit) # Plot the tree
In the above code, we impose a small complexity (cp) so that we can manually control the depth (and hence, the complexity) of the tree. Given the trained parameters (splits), we test the model on the recent data.
pred <- predict(fit, test_sample) # Predictions
mean(pred)-mean(test_sample$FLR_Return) # Bias
## [1] 0.008821996
var(pred) # Variance
## [1] 0.009755871
The first number shows the error in average prediction while the second one measures the dispersion of predictions.
## A deeper tree Below, we test the same configuration, but with a maximum depth of five (beyond five, it is hard to read the values inside the tree).
fit <- rpart(FLR_Return ~ Mkt_Cap + P2B + Vol_1M + D2E + Prof_Marg, # Model
data = train_sample, # Data source
cp = 0.0001, # Tree complexity
maxdepth = 5 # Nb levels
) # The model is then trained!
rpart.plot(fit) # Plot the tree
Graphically, the model is evidently more complex. And the predictions follow.
pred <- predict(fit, test_sample)
mean(pred)-mean(test_sample$FLR_Return) # Bias
## [1] -0.007871907
var(pred) # Variance
## [1] 0.01415666
The changes in both values follow opposite paths. The magnitude of bias is divided by 1.13 (from 0.88% to 0.78%) while the variance was multiplied by 1.45 (from 0.97% to 1.4%). This is clear evidence of the variance-bias tradeoff.
In this section, we have a look on the impact of hyperparameters on model performance. Because of computational constraints, we work with boosted trees (neural networks take too much time!).
We start by preparing the inputs. Because we work with the simple (scaled) features, the dependent variable is the 12M (smooth) return. Additionally, we will in fact use the return relative to the average performance (FLR_Return variable).
library(xgboost) # Load package
features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg") # Predictors
train_features <- train_sample %>% select(features) %>% as.matrix() # Independent variables
train_label <- train_sample$FLR_Return # Dependent variable
train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format
test_features <- test_sample %>% select(features) %>% as.matrix() # Independent variables
test_label <- test_sample$FLR_Return # Dependent variable
We recall some key hyperparameters of boosted trees:
Note: some other parameters are also important (and related), like maximum depth, or gamma (complexity penalisation: the loss reduction required to make a further partition on a leaf node of the tree; lambda penalises the number of leaves in the tree), but we stick to 3 of them for simplicity purposes.
Below, we want to test five values for each parameter. The set of feasible values is not straightforward at first. For instance, gamma should not be too high: if it were, the trees would only consist in the original node. Likewise, eta should not be too close to 1 because then the model would have a tendency to overfit the data - and also not for far from 1 (else the model does not learn): a tough choice. Typically, eta and the number of rounds of boosting are related: it eta is small, a large number of rounds is useless.
We start by defining (agnostically at first) the sets of parameter values.
eta <- c(0.2, 0.4, 0.6, 0.8, 0.9) # eta values
nrounds <- c(10, 25, 50, 90) # nb of trees
lambda <- c(0.01, 0.1, 1, 10, 100) # lambda values
pars <- expand.grid(eta, nrounds, lambda) # Exploring all combinations!
eta <- pars[,1]
nrounds <- pars[,2]
lambda <- pars[,3]
Below, we code the function that we will use for the grid search. It takes train and test data as inputs, as well as the parameters we are interested in.
grid_par <- function(eta, nrounds, lambda, train_matrix, test_features, test_label){
fit <- train_matrix %>%
xgb.train(data = ., # Data source (pipe input)
eta = eta, # Learning rate
objective = "reg:squarederror", # Objective function
max_depth = 4, # Maximum depth of trees
lambda = lambda, # Penalisation of leaf values
gamma = 0, # Penalisation of number of leaves
nrounds = nrounds, # Number of trees used
verbose = 0 # No comment
)
pred <- predict(fit, test_features) # Preditions based on fitted model & test values
return(mean((pred-test_label)^2)) # Mean squared error
}
Finally, we test the function on one triplet of parameter values and then on all combinations, using the pmap() function. From the output, we create a graph.
grid_par(0.1, 10, 1 ,train_matrix, test_features, test_label) # Test of the function
## [1] 0.08081849
grd <- pmap(list(eta, nrounds, lambda), # Parameters for the grid search
grid_par, # Function on which to apply the grid search
train_matrix = train_matrix, # Input for function: training data
test_features = test_features, # Input for function: test features
test_label = test_label # Input for function: test labels (returns)
)
grd <- data.frame(eta, nrounds, lambda, error = unlist(grd)) # Dataframe with all results
grd$eta <- as.factor(eta) # Parameters as categories (for plotting)
grd %>% ggplot(aes(x = eta, y = error, fill = eta)) + # Plot!
geom_col() + theme_light() +
facet_grid(rows = vars(nrounds), cols = vars(lambda)) +
geom_hline(yintercept = var(test_label), linetype = 2)
var(test_label) # Benchmark = zero prediction
## [1] 0.04183968
The final value is a good benchmark. It’s the quadratic error we would make if we predicted a zero return at all times.
There is one emerging pattern: nrounds (from top to bottom barplots) and lamda (from left to right) seem to be the main drivers of the error at this scale.
Overall, the performance of the models is not impressive. One possible reason for that is that the dataset is not wide enough: with only 30 assets, it is hard to learn patterns. A second explanation would be that the chosen model is not a good option. A third option is that we are not looking at the right metric and that we are maybe facing a false negative. Finally, a fourth option is that the label we work with is not consistent with the chosen features.
Going further with caret: https://topepo.github.io/caret/random-hyperparameter-search.html
Going further with h2o: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html Going further with the tidymodels: https://www.tidymodels.org
This is where we choose to shift from the ML metric to the financial metric. Since we are rebalancing monthly, we will need an indicator that is best suited. Below, we record the hit ratio for the one-month ahead return, i.e., the percentage of times that the prediction is in the right direction. We perform a trick below. The predictions are 12M forward relative returns, but we compare them to 1M forward simple returns! We test the sensitivity to 3 parameters.
grid_par_hit <- function(eta, nrounds, lambda, train_matrix, test_features, test_label){
fit <- train_matrix %>%
xgb.train(data = ., # Data source (pipe input)
eta = eta, # Learning rate
objective = "reg:squarederror", # Objective function
max_depth = 4, # Maximum depth of trees
lambda = lambda, # Penalisation of leaf values
gamma = 0, # Penalisation of number of leaves
nrounds = nrounds, # Number of trees used
verbose = 0 # No comment
)
pred <- predict(fit, test_features) # Preditions based on fitted model & test values
return(mean(pred*test_label>0)) # Hit ratio!
}
test_label_hit <- test_sample$FR_Return # Dependent variable => very different!!! Big trick!
grid_par_hit(eta = 0.5, nrounds = 12, lambda = 1, # A test!
train_matrix, test_features, test_label)
## [1] 0.4833333
grd <- pmap(list(eta, nrounds, lambda), # Parameters for the grid search
grid_par_hit, # Function of the grid search
train_matrix = train_matrix, # Input for function: training data
test_features = test_features, # Input for function: test features
test_label = test_label_hit # Input for function: test labels (returns)
)
grd <- data.frame(eta, nrounds, lambda, hit = unlist(grd)) # Dataframe with all results
grd$eta <- as.factor(eta) # Parameters as categories (for plotting)
grd %>% ggplot(aes(x = eta, y = hit, color = eta)) + # Plot!
geom_point() + theme_light() +
ylim(0.48,0.52) +
geom_hline(yintercept = mean(test_label_hit>0), color = "#646464") + # Benchmark value (horiz line)
facet_grid(rows = vars(nrounds), cols = vars(lambda))
mean(test_label_hit>0) # Benchmark: assuming constant positive return (e.g., EW portfolio!)
## [1] 0.5006803
max(grd$hit) # Best hit ratio from models
## [1] 0.5112245
The validation metric is the hit ratio over the FR_Return. The latter is the 1M forward return over the cross-sectional average return. This means that the hit ratio measures the proportion of times we bet on a stock that will outperform the average return of all stocks in the dataset.
All hit ratios are above those of the benchmark, which is a good sign. Many of them are above 0.50, which is even better. In practice, a hit ratio below 0.52 is probably not worth much (due to transaction costs).
Below, we test the performance metrics that is most obvious to portfolio managers: the average return. The only change is in the return() argument in the function grid_par_ret(). We invest in the stock if its prediction is strictly positive.
grid_par_ret <- function(eta, nrounds, lambda, train_matrix, test_features, test_label){
fit <- train_matrix %>%
xgb.train(data = ., # Data source (pipe input)
eta = eta, # Learning rate
objective = "reg:squarederror", # Objective function
max_depth = 4, # Maximum depth of trees
lambda = lambda, # Penalisation of leaf values
gamma = 0, # Penalisation of number of leaves
nrounds = nrounds, # Number of trees used
verbose = 0 # No comment
)
pred <- predict(fit, test_features) # Preditions based on fitted model & test values
return(mean(test_label[pred>0])) # Mean return when prediction > 0 !
}
test_label_ret <- test_sample$F_Return # Dependent variable => very different!!!
grd <- pmap(list(eta, nrounds, lambda), # Parameters for the grid search
grid_par_ret, # Function of the grid search
train_matrix = train_matrix, # Input for function: training data
test_features = test_features, # Input for function: test features
test_label = test_label_ret # Input for function: test labels (returns)
)
grd <- data.frame(eta, nrounds, lambda, ret = unlist(grd)) # Dataframe with all results
grd$eta <- as.factor(eta) # Parameters as categories (for plotting)
grd %>% ggplot(aes(x = eta, y = ret, color = eta)) + # Plot!
geom_point() + theme_light() +
ylim(0.006,0.012) +
geom_hline(yintercept = mean(test_label_ret), color = "#646464") + # Benchmark value
facet_grid(rows = vars(nrounds), cols = vars(lambda))
## Warning: Removed 1 rows containing missing values (geom_point).
mean(test_label_ret) # Benchmrk: average EW return
## [1] 0.008838865
Disappointing results overall. In addition, this does not take into account transaction costs. They should be computed to make sure that the performance does not vanish with trading fees.
In this section, we investigate the impact of time on the validity of hyperparameters. We split the sample into three parts:
Part | Training starts | Validation starts | Validation ends |
---|---|---|---|
Part 1 | 2000-01-01 | 2004-01-01 | 2006-01-01 |
Part 2 | 2005-01-01 | 2009-01-01 | 2011-01-01 |
Part 3 | 2010-01-01 | 2014-01-01 | 2016-01-01 |
The parts have a five year lag. The training sample is 4 year long, while validation is performed over 2 years. Hence, each study requires 6 years of data. We will separate the validation stage from the testing stage.
train_start = c("2000-01-01", "2005-01-01", "2010-01-01")
val_start = c("2004-01-01", "2009-01-01", "2014-01-01")
test_start = c("2006-01-01", "2011-01-01", "2016-01-01")
stop_date = c("2009-01-01", "2014-01-01", "2019-01-01")
# Below, the parameters
nrounds <- c(5, 10, 20, 30, 50) # gamma values
lambda <- c(0.01, 0.1, 1, 10, 50) # lambda values
pars <- expand.grid(nrounds, lambda) # Exploring all combinations!
nrounds <- pars[,1]
lambda <- pars[,2]
Working with time series adds one dimension, hence we restrict our study to only two parameters: gamma and lambda. This will ease the visual representation of the results. We fix the learning rate eta to a reasonable value of 0.5.
Because we want to compare outcomes at three different points in time, we work with the hit-ratio, the average level of which is not dependent on economic cycles (as are returns, for instance). Hence, we recycle the grd_par_hit() function built above. For simplicity, we resort to a simple loop over the three periods.
grd <- c() # Initiate output
for(i in 1:3){ # Loop over the parts
# FIRST, WE PREPARE THE DATE
train_features <- data_f %>% # Data
filter(Date >= train_start[i], Date < val_start[i]) %>% # Date filter
select(features) %>% as.matrix() # Keep features
train_label <- data_f %>% # Data
filter(Date >= train_start[i], Date < val_start[i]) %>% # Date filter
select(FLR_Return) %>% as.matrix() # Keep dependent variable
train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format
val_features <- data_f %>% # Data
filter(Date >= val_start[i], Date < test_start[i]) %>% # Date filter
select(features) %>% as.matrix() # Keep features
val_label <- data_f %>% # Data
filter(Date >= val_start[i], Date < test_start[i]) %>% # Date filter
select(FR_Return) # Keep dependent variable: BIG TRICK here!
# SECOND, WE COMBINE pmap() AND grid_par_hit() & AGGREGATE ITERATIVELY
grd_temp <- pmap(list(0.5, nrounds, lambda), # Parameters for the grid search
grid_par_hit, # Function of the grid search
train_matrix = train_matrix, # Input for function: training data
test_features = val_features, # Input for function: test features
test_label = val_label # Input for function: test labels (returns)
)
grd <- rbind(grd, # Output of previous part(s)
data.frame(part = i, nrounds, lambda, hit = unlist(grd_temp))) # Output of latest part
}
Finally, we format and plot.
grd %>%
mutate_at(vars(part:lambda), as.factor) %>% # Change numbers into factors for plotting
ggplot(aes(x = part, y = hit)) +
geom_point() + theme_light() +
ylim(0.48,0.6) +
geom_hline(yintercept = 0.5, color = "grey") +
facet_grid(nrounds ~ lambda)
grd %>%
mutate_at(vars(part:lambda), as.factor) %>% # Change numbers into factors for plotting
ggplot(aes(x = nrounds, y = hit, color = part)) +
geom_point() + theme_light() +
ylim(0.48,0.6) +
geom_hline(yintercept = 0.5, color = "grey") +
facet_grid(part ~ lambda)
We provide two alternative representation of the results. In the first one, the x-axis is linked to the chronology to assess stability of results over time. In the second one, parts are grouped in color so that we can assess which choices would have been made in a ‘live’ situation.
From the first graph, we see that apart for lambda = 1, the values are relatively stable. It also shows that lambda = 0.01 is not the best choice we can make.
From the second graph, we infer the best choices at the three different periods:
For simplicity, we will work with constant parameters in what follows. A full rigorous treatment would impose that the validation be carried out at each period (month) of the backtest in order to choose optimal hyperparameters. We posit that there is some stability in the performance of hyperparameter values and start by looking at the average values over the 3 periods.
In order to choose the parameters, we compute averages.
grd %>% group_by(nrounds, lambda) %>%
summarise(avg_hit = mean(hit)) %>%
pivot_wider(names_from = "lambda", values_from = "avg_hit")
## `summarise()` has grouped output by 'nrounds'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 6
## # Groups: nrounds [5]
## nrounds `0.01` `0.1` `1` `10` `50`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 0.526 0.530 0.517 0.535 0.510
## 2 10 0.531 0.536 0.538 0.543 0.532
## 3 20 0.526 0.534 0.526 0.534 0.519
## 4 30 0.527 0.544 0.538 0.527 0.528
## 5 50 0.529 0.539 0.536 0.535 0.531
Three stable values for nrouds are 10, 20 & 30. The best choices for lambda are 0.1, 1 and 10.
To avoid look-ahead bias, we will only work with data posterior to the last validation point (December 2015). But since we have been working with 12M forward dependent variable, we must shift the starting data to January 2017. This removes any possible overlap!
We then recycle code from the session on Decision Trees below and start with data preparation and initialisation.
tick <- unique(data$Tick) # Set of assets
sep_date <- as.Date("2017-01-01") # This date separates in-sample vs out-of-sample
t_oos <- data$Date[data$Date > sep_date] %>% # Out-of-sample dates (i.e., testing set)
unique() %>%
as.Date(origin = "1970-01-01")
Tt <- length(t_oos) - 1 # Nb of dates, avoid T = TRUE
nb_port <- 2 # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick))) # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port) # Store returns
Below, we adjust the weight function, with the equally-weighted (EW) portfolio as benchmark. We will invest in the 15 stocks with the highest (predicted) relative 12M outperformance.
weights_multi <- function(train_data, test_data, j){
N <- test_data$Tick %>% n_distinct() # Number of stocks
features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg") # Hard-coded
if(j == 1){ # j = 1 => EW
return(rep(1/N,N))
}
if(j == 2){ # j = 2 => tree-based
# Below, we quickly format the data
train_features <- train_data %>% select(features) %>% as.matrix() # Indep. variable
train_label <- train_data$FLR_Return # Dependent variable
train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format
fit <- train_matrix %>%
xgb.train(data = ., # Data source (pipe input)
eta = 0.6, # Learning rate
objective = "reg:squarederror", # Number of random trees
max_depth = 4, # Maximum depth of trees
nrounds = 20, # Number of trees used
lambda = 1, # Fixed value of hyperparameter
gamma = 1 # Fixed value of hyperparameter
)
xgb_test <- test_data %>% # Test sample => XGB format
select(features) %>%
as.matrix() %>%
xgb.DMatrix()
pred <- predict(fit, xgb_test) # Single prediction
return((pred>median(pred))/N*2) # Fifteen largest values, equally-weighted
}
}
And we proceed to the backtesting loop.
train_window <- 365 * 6 # We train on rolling windows of 6-1=5 years
for(t in 1:Tt){ # Loop on dates!
train_data <- data_f %>%
filter(Date < t_oos[t] - 365, # One year offset because FLR_Return
Date >= t_oos[t]- train_window) # Rolling window !!!
test_data <- data_f %>% filter(Date == t_oos[t]) # Current values
realized_returns <- data %>% # Computing returns via:
filter(Date == t_oos[t]) %>% # Filtering
select(F_Return) # Selecting
for(j in 1:nb_port){
portf_weights[t,j,] <- weights_multi(train_data, test_data, j) # Hard-coded params, beware!
portf_returns[t,j] <- sum(portf_weights[t,j,] * realized_returns)
}
}
apply(portf_returns, 2, mean) / apply(portf_returns, 2, sd) # Sharpe ratios!
## [1] 0.2092178 0.2454613
The ML-driven strategy is well above the EW portfolio in terms of raw Sharpe ratio.
Finally, below, we plot the evolution of the two portfolios.
plot_data <- (portf_returns+1) %>% apply(2,cumprod) # Values
plot_data <- rbind(1, plot_data) # Initiate at one
plot_data <- data.frame(t_oos, plot_data) # Adding the date
colnames(plot_data) <- c("Date", "EW", "ML") # Changing column names
plot_data %>% # Final formatting & plot
pivot_longer(-Date, names_to = "strategy", values_to = "value") %>%
ggplot(aes(x = Date, y = value, color = strategy)) +
geom_line() + theme_light() + geom_point() +
scale_color_manual(values = c("#EE6611", "#2299EE"))
Train a neural network with roughly 30 parameters and compute the bias and variance on a testing set. Perform the same operation on a network with 300 parameters and see the difference.
Explore the sensitivity of performance of random forest using grid search for the two key parameters: ntree (number of trees in the forest) and mtry (number of variables). We recommend to:
Don’t forget to change/update the feature space!
Complete the grid search by incorporating n_rounds and maxdepth in the grid.