This notebook is dedicated to the application of decision trees in quantitative investment (low frequency). This notebook will require the following non-standard packages:
In the financial applications below, the choice of dependent variable is not optimal. It was chosen to minimise code complexity and ease understanding of basic principles. Results may seem disappointing, but they are realistic! It take experience/experimentation to make ML algorithms deliver good results. That’s up to you!
\(\rightarrow\) ABOUT THE NOTEBOOK:
First, we provide some intuition on how decision trees work using the diamond database. We show a first simple tree and then explain how it’s built. We start by looking at the data.
if(!require(rpart)){install.packages(c("rpart", "rpart.plot"))} # The packages for trees!
library(tidyverse) # Data manipulation package
library(rpart) # Tree packages
library(rpart.plot) # Tree plots package
head(diamonds) # Having a look at the diamonds database (embedded in ggplot2/tidyverse)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
summary(diamonds) # Descriptive stats of relevant columns
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Briefly, there are some very small diamonds (0.2 carats) and some very big ones (5 carats) and the price range is 326 to 18,823. Diamonds are characterised mostly by their cut, clarity and color - which are categorical variables. We now turn to the first tree.
fit <- rpart(price ~ carat + color + clarity + cut, # First tree: simple syntax!
data = diamonds, # Data source
maxdepth = 3, # Number of levels of splits
cp = 0.001) # Complexity parameter
rpart.plot(fit) # Plot
We then provide elements of interpretation. First, the data is split in two: small diamonds (less than 1 carat, to the left of the tree) and big diamonds (more than 1 carat, to the right). The first group makes up 65% of the original sample, versus 35% for the other one. In the first group, the average price is 1633 while in the second one, it is 8142.
In the rpart representation, the splitting condition is written below the split cluster. When the condition is true, the instances go to the left cluster and when it is wrong, they go to the cluster to the right.
Each group is then separately split in two. The first group (small diamonds) is again split according to carat weight:
- the very small diamonds (less than 0.63 carat, with average price of 1052 and representing 46% of the whole sample)
- the small diamonds (between 0.63 and 1 carat, with average price of 3059 and accounting for 19% of the whole sample)
Large diamonds are split in two:
- very large diamonds (more than 1.5 carats, 12% of sample, with average price of 12000)
- large diamonds (between 1 and 1.5 carats, 24% of sample, with average price of 6140)
The interesting part is that the last group is then split according to another variable: clarity. The large diamonds are homogeneous in size (no ‘big’ difference between 1 and 1.5 carats), hence quality comes into play:
- large low quality diamonds: I1 (stands for included, with many big imperfections), SI2, SI1 and VS2, have an average price of 5371
- large high quality diamonds: VS1, VVS1, VVS2 and IF (internally flawless) have an average price of 8535
Now the question is: how did the rpart() function grow this tree? Let’s start by looking at how to choose a splitting value for one variable. For simplicity, we work with the carat variable only: we will thus split the diamonds according to this variable: small diamonds versus large diamonds.
The core idea of decision trees is to create homogeneous clusters. Here we want clusters of diamonds with prices that have the smallest possible dispersion. But let’s not forget there has to be 2 clusters, so we have to take into account the dispersion over the two clusters: \[V(c)= \underbrace{\sum_{i:carat> c} \left( p_i-\bar{p}_{carat > c} \right)^2}_{\text{large diamond cluster }(>c)}+ \underbrace{ \sum_{i:carat< c} \left( p_i-\bar{p}_{carat < c} \right)^2}_{\text{small diamond cluster }(<c)},\] where \(\bar{p}_{carat > c}\) is the price of all diamonds with weight larger than \(c\) and \(\bar{p}_{carat < c}\) is the average price of the small diamond cluster. Hence, one optimal way of choosing c is to pick the value that minimises V(c).
Below, we code a didactic version of the computation of V(c) - i.e., using a loop syntax.
total_var <- c() # Initialization
threshold <- seq(0.3, 4, by = 0.1) # Defining carat values
for(i in 1:length(threshold)){ # Loop over the threshold values
small <- filter(diamonds, carat <= threshold[i]) # Firms below the threshold (in carat)
var_small <- var(small$price) * nrow(small) # Corresponding dispersion in diamond price
large <- filter(diamonds, carat > threshold[i]) # Firms above the threshold (in carat)
var_large <- var(large$price) * nrow(large) # Corresponding dispersion in diamond price
total_var[i] <- var_small + var_large # Total variation
}
ggplot(data.frame(threshold, total_var)) + theme_light() +
geom_line(aes(x = threshold, y = total_var)) # Plot
Here we go! The minimum is indeed reached for carat = 1! This partly explains the first split in the first tree above. The full explanation requires an analysis of the cross-section of predictors. Indeed, why was carat chosen, and not cut, color or clarity?
The answer is that the above study was also carried out for these variables, but they were no match for carat.
We now introduce some additional math notations. We are given a sample of (\(Y_i\),\(X_i^{(j)}\)) of size \(N\), where \(Y_i\) is the predicted variable (diamond price) and the \(X_i^{(j)}\) are the predictors: carat, color, cut and clarity. i is the index of the diamond and j the index of the predictor.
A regression tree seeks the splitting point as \(\underset{c}{\text{argmin}} \ V_N^{(j)}(c)\) with \[\begin{equation} V_N^{(j)}(c)= \sum_{X_i^{(j)}<c}\left(Y_i-m_N^{(j)-}(c) \right)^2 + \sum_{X_i^{(j)}>c}\left(Y_i-m_N^{(j)+}(c) \right)^2, \label{eqn:node} \end{equation}\] where \(m_N^{(j)-}(c)=\frac{1}{\#\{i,X_i^{(j)}<c\}}\sum_{X_i^{(j)}<c}Y_i\) and \(m_N^{(j)+}(c)=\frac{1}{\#\{i,X_i^{(j)}>c\}}\sum_{X_i^{(j)}>c}Y_i\) are the average values of \(Y\), conditional on \(X^{(j)}\) being smaller or larger than \(c\). The minimalisation can be performed in two steps:
This is how carat ends up as the winner splitting variable: it is the one that creates the most homogeneous clusters.
This process is re-iterated at each node of the tree until sufficient precision is reached or until the maximum depth requirement is satisfied. An additional complexity parameter cp can also be used to determine how far the algorithm should go.
We now turn to the construction of trees on our small financial dataset.
We start by looking at the conditional average of future returns, given the market capitalisation. Conditional averages can be useful in detecting patterns. For instance, if the variable of interest is monotonous in the predictor, this probably means that the predictor is valuable.
The highly non-linear nature of trees makes it nontheless difficult to discard predictors. Indeed, a predictor may be very important in one lower-level cluster.
load("data.RData") # Load the data
data <- data %>% arrange(Date,Tick) # Just making sure all is in order
tick <- unique(data$Tick) # Set of assets
data <- data %>%
select(-ESG_rank) %>% # Remove ESG: too few points
group_by(Tick) %>%
mutate(F_Return = lead(Close) / Close - 1) %>% # Adding forward/future returns
na.omit() %>% # Take out missing data
ungroup()
data %>% ggplot(aes(x = Mkt_Cap, y = F_Return)) + #geom_point() +
geom_smooth() + theme_light() # Plot
Problem: scale effects. There are very few points with capitalisations above 500B$, thus resulting in high uncertainty (shown by the grey area around the blue curve).
Solution: we normalise the characteristics.
normalise <- function(v){ # This is a function that 'uniformises' a vector
v <- v %>% as.matrix()
return(ecdf(v)(v))
}
data_f <- data %>% # Data with normalised values
dplyr::select(-Close) %>% # Closing prices are not predictors
group_by(Date) %>% # Normalisation takes place for each given date
mutate_if(is.numeric, normalise) %>% # Apply normalisation on numerical columns
ungroup() %>% # Ungroup
arrange(Date, Tick) # Just making sure the order is still the same as before
data_f$F_Return <- data$F_Return # We retrieve the original returns: better for interpretation
data_f %>% ggplot(aes(x = Mkt_Cap, y = F_Return)) +
geom_smooth() + theme_light()
data_f %>% ggplot(aes(x = D2E, y = F_Return)) +
geom_smooth() + theme_light()
The pattern is quite different in this case. The overall trend is clearly descending, as the Size anomaly would suggest, even though our very small sample can by no means support or contradict this stylised fact.
It’s not obvious to the eye which binary partition will be the best (i.e., where to set the cursor in terms of Mkt_Cap to find to homogeneous clusters). Just like for diamonds, we must resort to brute force and span all possibilities.
Below, we create 91 partitions depending on an increasing threshold. In each partition, there will be two groups: the small caps and the large caps. The small (resp. large) caps are the firms with capitalisation score below (resp. above) the threshold. What we are interested in is the dispersion of future returns inside each group and more importantly in the sum of the dispersions of the two groups (just like we did for diamonds).
While we could develop a dedicated function to perform this task, we choose to keep the very old fashioned loop approach.
total_var <- c()
threshold <- (5:95)/100 # Looking at all percents from 5% to 95%
for(i in 1:length(threshold)){
small <- filter(data_f, Mkt_Cap <= threshold[i]) # Firms below the threshold (in Mkt_Cap)
var_small <- var(small$F_Return) * nrow(small) # Corresponding dispersion in future return
large <- filter(data_f, Mkt_Cap > threshold[i]) # Firms above the threshold (in Mkt_Cap)
var_large <- var(large$F_Return) * nrow(large) # Corresponding dispersion in future return
total_var[i] <- var_small + var_large # Total variation
}
ggplot(data.frame(threshold, total_var)) + theme_light() +
geom_line(aes(x = threshold, y = total_var)) # Plot
What this graph says is that the highest level of homogeneity intra-clusters is obtained for a split of Mkt_Cap around 0.48.
Now, we can perform this exercise on other predictors. Below, we create a function that tests all of them and plots the result.
pts <- (5:95)/100 # Splitting points
features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg") # Predictor list
tot_var <- function(indep, dep, points, data){
total_var <- c()
for(i in 1:length(points)){
low <- filter(data, (!!sym(indep)) <= points[i]) # Firms below the threshold
var_low <- var(select(low,dep)) * nrow(low)
high <- filter(data, (!!sym(indep)) > points[i]) # Firms above the threshold
var_high <- var(select(high,dep)) * nrow(high)
total_var[i] <- var_low + var_high
}
return(total_var)
}
tot_var_res <- map(features, tot_var, points = pts, dep = "F_Return", data = data_f) %>%
unlist() %>%
matrix(ncol = length(features)) %>%
data.frame(split = pts)
colnames(tot_var_res) <- c(features, "Split")
tot_var_res %>%
pivot_longer(names_to = "Feature", values_to = "Total_variation", -Split) %>%
ggplot(aes(x = Split, y = Total_variation, color = Feature)) +
theme_light() + geom_line() # Plot
Clearly, the patterns are not as marked as for Mkt_Cap. For some variables, the impact of splitting is minuscule (Prof_Marg). Hence, if we had to choose only one splitting variable, Mkt_Cap would be the better pick.
As we underlined above, in order to build the best possible tree, the algorithm needs to performs such studies over all variables (predictors) and find the one that achieves the smallest total dispersion in future returns.
We pursue our use of the package rpart to build and plot decision trees. We compare two trees: one is built on the original dataset and the other on the normalised data.
fit <- data %>% # Original data
select(-Tick, -Date, -Close) %>% # Take out the non-predictive variables
rpart(F_Return ~ ., # Model: F_Return as a function of all other variables
cp = 0.001, # Complexity: lower means more nodes/levels
maxdepth = 2, # Maximum depth (nb levels) of the tree
data = .) # Data source: from the pipe
rpart.plot(fit) # Plot
Again, most important part is interpretation. The cluster information is synthesised in the rounded rectangles. There are two figures: the first is the average value of the dependent variable (to be forecast) and the second is the proportion of the sample inside the cluster. The cluster at the top is the whole sample.
The bold text details the splitting variables and points and are displayed just below the clusters which they divide in two. The first split is made according to the price-to-book (P2B) variable at level 0.23. Those instances with P2B larger or equal to 0.23 go to the cluster to the left and those with P2B strictly smaller than 0.23 go to the cluster to the right.
Sadly, this choice creates two very imbalanced clusters: one with almost all the data (left) and one with a handful of instances Looking at what happens is instructive (see below): the split captures a micro-phenomenon related to two banks: Citigroup and Bank of America, which is essentially their rebound after the subprime crisis. There are only 8 observations in this set, which is a perfect illustration of the variance-bias trade-off: the bias is small (we stick to the data), but the variance is huge. This is the problem with non-normalised data: it allows to form cluters of very high average values (0.3 return!) which will not be meaningful in terms of generalisation out-of-sample.
data %>% filter(P2B < 0.23)
## # A tibble: 8 × 9
## Tick Date Close Vol_1M Mkt_Cap P2B D2E Prof_Marg F_Return
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BAC 2009-01-30 5.73 178. 41837. 0.206 390. -11.4 -0.400
## 2 BAC 2009-02-27 3.44 237. 25115. 0.124 390. -11.4 0.731
## 3 C 2009-02-27 13.2 244. 8175. 0.101 657. -106. 0.687
## 4 BAC 2009-03-31 5.96 212. 43657. 0.229 387. 11.9 0.309
## 5 C 2009-03-31 22.3 269. 13855. 0.176 512. 4.95 0.206
## 6 C 2009-04-30 26.8 142. 16814. 0.212 512. 4.95 0.220
## 7 C 2009-06-30 26.1 39.7 16373. 0.184 474. 11.6 0.0673
## 8 C 2009-07-31 27.9 71.7 35954. 0.197 474. 11.6 0.577
Thus, below, we perform the same analysis on the formatted/normalised data. We also add one layer to the tree.
fit <- data_f %>% # Normalised data
select(-Tick, -Date) %>% # Take out the non-predictive variables
rpart(F_Return~., # Model: F_Return as a function of all other variables
cp = 0.001, # Complexity: lower means more nodes/levels
maxdepth = 4, # Maximum depth (nb levels) of the tree
data = .) # Data source: from the pipe
rpart.plot(fit) # Plot!
In this case, the initial split is much more reasonable as the two resulting groups encompass 53% and 47% of the original sample each. We are not suprised by the splitting value of 0.48 for Mkt_Cap, as we calculated it just above. Since the dependent variable is exactly future returns, the values in rounded rectangles show the average returns in each cluster.
The average values are coded with color intensity: white (very light blue) for low values to blue for high values. Note that even after the normalisation of features, some very small clusters persist: the second split leads to one such group (the one to the right).
Finally, the scheme shows the highly nonlinear nature of the tree: the first three splits are performed via three different variables.
Machine learning tools are often referred to as black boxes but simple decision trees are quite transparent. In order to characterise the way they work, researchers have introduced metrics that evaluate the preponderance of variables in the models, i.e., in the prediction process - once the models have been trained (in the case of trees, we refer to Atkinson and Therneau (2015) for a deeper treatment). Each time a variable is chosen for a split, it reduces the (training) loss generated by the model. The variable importance sums these gains for each variable.
Below, we show how to extract such information using rpart. We recycle the last model that was trained.
fit$variable.importance # Immediate values
## Prof_Marg D2E Mkt_Cap P2B Vol_1M
## 0.3450010 0.2442460 0.2408473 0.1479721 0.1320031
y <- data.frame(fit$variable.importance)[,1] # Extracting the numbers
tree_importance <- data.frame(features = features, # A nice dataframe with the values
importance = y/sum(y), # Normalised importance
model = "Simple tree") # Model name
tree_importance %>% # Graph
ggplot(aes(x = features, y = importance)) +
geom_col() + theme_light()
Given our sample, Mkt_Cap and P2B have roughly the same importance in the simple tree model.
NOTE: often, variable importance only makes sense if the features have the same scale. This is yet another incentive for pre-processing and rescaling.
Finally, for the sake of completeness, we provide the syntax for predictions below. We again rely on the previously trained model.
test <- data_f[c(1,1000,3333),] # Test sample (in sample, though)
predict(fit, test) # Prediction
## 1 2 3
## 0.01007559 0.01007559 0.03057492
This means that the test instance will have (according to the model’s forecast) a return slightly above zero.
Trees work well for classification tasks. The error is not measured with the MSE, but with impurity scores: the goal is to have clusters that are as pure as possible. Several choices are possible and allow to fine-tune the relative importance of errors: misclassification error, Gini index, cross-entropy are common examples. Below, we show how to implement trees with categorical data, though it’s sufficiently well done in R that it changes almost nothing in syntax.
data_c <- data_f %>% mutate(FC_Return = F_Return > 0) # Binary categories: + vs - returns
fit <- rpart(FC_Return ~ . - Tick - Date - F_Return, # The model: remove the irrelevant variables
method = "class", # Classificartion task
data = data_c, # Data source
cp = 0.001, # Complexity
maxdepth = 4 # Maximum number of levels
)
rpart.plot(fit)
In the above syntax, ‘method’ can be chosen by the function. For regression analysis, enforce method = “anova”.
The only difference with the regression trees is that the number within the rounded squares is the proportion of item that are TRUE, i.e., with positive future return. Note that the variable is not perfectly balanced in the whole sample, as there are 55% of such instances. Again, the first split is performed over the Mkt_Cap variable, but the secondary splits involve other features. The intensity of the color of each cluster codes the purity of the cluster.
predict(fit, data_c[c(1, 1000, 3333),])
## FALSE TRUE
## 1 0.4122939 0.5877061
## 2 0.5735294 0.4264706
## 3 0.4122939 0.5877061
predict(fit, data_c[c(1, 1000, 3333),], type = "vector")
## 1 2 3
## 2 1 2
The prediction simply yields the probabilities associated to each class.
One example with 3 classes below.
data_c2 <- data_f %>%
mutate(FC_Return = if_else(F_Return > 0.02, "Buy",
if_else(F_Return < -0.02, "Sell", "Hold")))
fit <- rpart(FC_Return ~ . - Tick - Date - F_Return, # The model: remove the irrelevant variables
method = "class", # Classificartion task
data = data_c2, # Data source
cp = 0.001, # Complexity
maxdepth = 4 # Maximum number of levels
)
rpart.plot(fit)
predict(fit, data_c[c(1, 1000, 3333),])
## Buy Hold Sell
## 1 0.4676949 0.2263443 0.3059608
## 2 0.3785311 0.1581921 0.4632768
## 3 0.4676949 0.2263443 0.3059608
predict(fit, data_c[c(1, 1000, 3333),], type = "vector")
## 1 2 3
## 1 3 1
For decision trees, the maximum number of splitting levels (i.e., the depth) is usually 6 (corresponding to a maximum of 2^6 = 64 clusters). When the investment universe comprises many assets (say 1,000), all instances will be disseminated into one of these 64 groups (assuming the tree does indeed define 64 branches). Now, some branches will have a handful of instances and others will have a lot. But when it comes to prediction, all stocks will end up in one ‘bucket’. The problem is that this will not be practical in terms of investment decision. Typically, if the portfolio policy sets a fixed number of assets for diversification purposes, then this will induce problems because the cluster sizes are defined by the tree.
In the next section, we present enhancements and generalisations of trees that help solve these issues and that are also known to perform better in prediction exercises.
Decision trees have been extended improved in two directions related to ensemble learning: the idea of combining learning/predicting tools in order to diversify (and reduce) sources of error.
Random forests, as their name suggests, are compilations of decision trees with randomly chosen predictors. In our sample, we have 5 predictors (Mkt_Cap, P2B, Vol_1M, D2E, Prof_Marg). The idea of random forests is to grow many trees based on combinations of subsets of these 6 variables. Obviously, when the initial number of predictors is small, all combinations could be tested, but with hundreds of factors, this becomes infeasible. For each instance, the final output is a mix (average or voting decision) of all outputs provided by the trees in the forest.
We provide a simple/compact example of implementation below. We show the syntax using the randomForest package for prediction and variable importance.
if(!require(randomForest)){install.packages("randomForest")}
library(randomForest)
set.seed(42) # Fixing random seed
fit <- data_f %>% # Normalised data
select(-Tick, -Date) %>% # Take out non predictive variables
randomForest(F_Return~., # Same formula as for simple trees!
data = ., # Data source (pipe input)
ntree = 12, # Number of random trees
mtry = 4, # Number of predictive variables used for each tree
replace = FALSE, # All different instances
sampsize = 3000 # Training sample size for each tree
)
predict(fit, test) # Prediction
## 1 2 3
## -0.05031840 -0.03475235 0.22706953
fit$importance # Variable importance
## IncNodePurity
## Vol_1M 3.274516
## Mkt_Cap 3.004441
## P2B 2.818915
## D2E 2.576378
## Prof_Marg 2.896019
RF_importance <- data.frame(features = features, # We store these results for later on
importance = matrix(fit$importance)/sum(fit$importance),
model = "Random Forest") # Model names
Both the prediction and variable importances are rather different compared to the simple tree. Note that random forest can provide suboptimal results because they give the same weight to all trees, but some are more relevant than others. When a tree is built on variables that are not great predictors, the resulting forecasts will perform poorly and will have the same weight as the more consistent trees. Boosted trees, which are presented below, do not suffer from this drawback.
Let’s have a look at classification.
set.seed(42) # Fixing random seed
fit_c <- data_c %>% # Normalised data
select(-Tick, -Date, -F_Return) %>% # Take out non predictive variables
mutate(FC_Return = as.factor(FC_Return)) %>%
randomForest(FC_Return~., # Same formula as for simple trees!
data = ., # Data source (pipe input)
ntree = 12, # Number of random trees
mtry = 4, # Number of predictive variables used for each tree
replace = FALSE, # All different instances
sampsize = 3000 # Training sample size for each tree
)
predict(fit_c, test) # Prediction
## 1 2 3
## TRUE FALSE TRUE
## Levels: FALSE TRUE
predict(fit_c, test, type = "prob") # Prediction
## FALSE TRUE
## 1 0.25000000 0.7500000
## 2 0.66666667 0.3333333
## 3 0.08333333 0.9166667
## attr(,"class")
## [1] "matrix" "array" "votes"
fit_c$importance # Variable importance
## MeanDecreaseGini
## Vol_1M 366.4950
## Mkt_Cap 256.0144
## P2B 283.9017
## D2E 247.1468
## Prof_Marg 299.1531
The predictions are the probability of positive return.
While random forests aggregate trees in a naive fashion, boosted trees are constructed to minimise the training error: each new tree is added/constructed in an optimal way. In this section, we will work with one such approach called eXtreme Gradient boosting (XGBoost). We refer to the original paper for details (Chen and Guestrin (2016)), as well as to the didactic explanations of https://xgboost.readthedocs.io/en/latest/tutorials/model.html
The full derivation for quadratic loss is given in the last section of the notebook.
Below, we provide an example of how xgboost can be used.
In contrast with the previous tools, additional formatting is required for xgboost (with functions provided by the package).
if(!require(xgboost)){install.packages("xgboost")} # Install package
library(xgboost) # Load package
train_features <- data_f %>% select(features) %>% as.matrix() # Independent variable
train_label <- data_f$F_Return # Dependent variable
train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format!
And here we go!
fit <- train_matrix %>%
xgb.train(data = ., # Data source (pipe input)
eta = 0.4, # Learning rate
objective = "reg:squarederror", # Objective function
max_depth = 4, # Maximum depth of trees
lambda = 0.1, # Penalisation of leaf values
gamma = 0.1, # Penalisation of number of leaves
subsample = 0.8, # Proportion of sample used for new trees
colsample_bytree = 0.8, # Proportion of columns used
nrounds = 50, # Number of trees used
verbose = 0 # No comment
)
xgb_test <- test %>% # Test sample => XGB format
select(features) %>%
as.matrix() %>%
xgb.DMatrix()
predict(fit, xgb_test) # Single prediction
## [1] 0.017173918 -0.007077186 0.048499655
We refer to the derivation of the algorithm for details on the role of eta, lambda and gamma.
The predicted value is different from the first two models! Variable importance is also supported in xgboost, as we show below. Be careful, if the gamma penalisation is too harsh, no tree will be built and variable importance will fail (variables won’t matter in this case: the model is constant, and equal to the sample average).
importance <- xgb.importance(model = fit) # Variable importance
XGB_importance <- data.frame(features = features, # Same as above, reformatted
importance = importance$Gain,
model = "XGBoost") # Model name
Importance <- rbind(tree_importance, RF_importance, XGB_importance) # All models!
Importance %>% ggplot(aes(x = features, y = importance, fill = model)) + # Plot
geom_col(position = "dodge") + theme_light()
For the simple tree, variable importance is highly concentrated, while for random forests, it is more uniformly distributed (which mechanically makes sense). The boosted tree lies somewhere in the middle but agrees with the simple tree that Mkt_Cap and P2B should be more prevalent.
For a detailed account of the richness of possibilities/parameters of xgboost, we recommend the documentation page: https://xgboost.readthedocs.io/en/latest/parameter.html
Below, we show the syntax to perform classification tasks with xgboost. If the dependent variable is an ordered factor, then the simplest approach is to translate the categories into numbers 0, 1, 2, etc. Starting at zero is important. We take the more general approach (though we should not because performance is always ordered and monotonous responses are important in the structure of the model).
Thus, we set a threshold and build 3 categories.
r_thresh <- 0.02 # Return threshold
data_f$FC_Return <- data_f$F_Return # Duplicate return
data_f$FC_Return[data_f$F_Return <= (-r_thresh)] <- 0 # Low return
data_f$FC_Return[(data_f$F_Return > (-r_thresh)) & (data_f$F_Return < r_thresh)] <- 1 # Soso return
data_f$FC_Return[data_f$F_Return >= r_thresh] <- 2 # High return
train_features <- data_f %>% select(features) %>% as.matrix() # Independent variable
train_matrix <- xgb.DMatrix(data = train_features, # XGB matrix format!
label = data_f$FC_Return,
silent = TRUE)
In any case, the label must be converted into numerical format. When dealing with categorical input, it has to be exactly 0, 1, 2, etc. The difference with regression analysis lies in the definition of the objective function. There are many functions coded in xgboost. We class a few of them:
fit <- train_matrix %>%
xgb.train(data = ., # Data source (pipe input)
eta = 0.3, # Learning rate
objective = "multi:softprob", # Objective function
num_class = 3, # Number of classes
max_depth = 4, # Maximum depth of trees
nrounds = 10, # Number of trees used
verbosity = 0 # No warning message
)
xgb_test <- test %>% # Test sample => XGB format
select(features) %>%
as.matrix() %>%
xgb.DMatrix
predict(fit, xgb_test) %>% #Predictions
matrix(data = ., nrow = nrow(xgb_test), byrow = TRUE)
## [,1] [,2] [,3]
## [1,] 0.3525243 0.1776711 0.4698047
## [2,] 0.4430076 0.1766661 0.3803263
## [3,] 0.2909596 0.2133605 0.4956798
Since we chose multi:softprob as objective function, the output gives the probabilities for each class.
We now turn to applications of tree-based methods in portfolio construction. We proceed in three steps:
When the number of assets is large, then simple regression trees may not be a good solution because the number of leaves will be smaller than the number of assets. Hence, several (possibly many) stocks, will share the same forecast. This is of course not a problem for categorical data, but can cause numerical problems when working with continuous variables (typically: portfolio sizes). Hence, we directly work with boosted trees, and doing so solves this issue.
Just as in the previous sessions, we start by formatting the data and creating all necessary variables.
sep_date <- as.Date("2008-01-01") # This date separates in-sample vs out-of-sample
t_oos <- data$Date[data$Date > sep_date] %>% # Out-of-sample dates (i.e., testing set)
unique() %>%
as.Date(origin = "1970-01-01")
Tt <- length(t_oos) - 1 # Nb of dates, avoid T = TRUE
nb_port <- 2 # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick))) # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port) # Store returns
We now turn to the definition of the weighting scheme. We wish to compare a tree-based strategy to the usual equally-weighted (EW) portfolio. We use the xgboost algorithm to forecast future returns and form an EW portfolio of the ten stocks with the highest predictions (or equivalently, the top 33%, because the sample has 30 stocks).
weights_multi <- function(train_data, test_data, j){
N <- test_data$Tick %>% unique() %>% length() # Number of stocks
features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg") # Hard-coded, BEWARE!
if(j == 1){ # j = 1 => EW
return(rep(1/N,N))
}
if(j == 2){ # j = 2 => tree-based
# Below, we quickly format the data
train_features <- train_data %>% select(features) %>% as.matrix() # Indep. variable
train_label <- train_data$F_Return # Dependent variable
train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format
fit <- train_matrix %>%
xgb.train(data = ., # Data source (pipe input)
eta = 0.5, # Learning rate
objective = "reg:squarederror", # Number of random trees
max_depth = 4, # Maximum depth of trees
nrounds = 15 # Number of trees used
)
xgb_test <- test_data %>% # Test sample => XGB format
select(features) %>%
as.matrix() %>%
xgb.DMatrix()
pred <- predict(fit, xgb_test) # Single prediction
w <- pred > quantile(pred, 2/3)
return(w / sum(w)) # Ten largest values, equally-weighted
}
}
We are ready to launch the backtesting loop.
for(t in 1:(length(t_oos)-1)){ # Stop before the last date: no fwd return!
if(t%%12==0){print(t_oos[t])} # Just checking the date status
train_data <- data_f %>% filter(Date < t_oos[t]) # Expanding window: all past values!!!
test_data <- data_f %>% filter(Date == t_oos[t]) # Current values
realized_returns <- data %>% # Computing returns via:
filter(Date == t_oos[t]) %>% # Filtering
select(F_Return) # Selecting
for(j in 1:nb_port){
portf_weights[t,j,] <- weights_multi(train_data, test_data, j) # Hard-coded params, beware!
portf_returns[t,j] <- sum(portf_weights[t,j,] * realized_returns)
}
}
## [1] "2008-12-31"
## [1] "2009-12-31"
## [1] "2010-12-31"
## [1] "2011-12-30"
## [1] "2012-12-31"
## [1] "2013-12-31"
## [1] "2014-12-31"
## [1] "2015-12-31"
## [1] "2016-12-30"
## [1] "2017-12-29"
## [1] "2018-12-31"
## [1] "2019-12-31"
## [1] "2020-12-31"
And finally, a brief look at relative performance.
apply(portf_returns, 2, mean) # Average monthly return
## [1] 0.00952691 0.01179321
apply(portf_returns, 2, sd) # Monthly volatility
## [1] 0.04694945 0.05411423
apply(portf_returns, 2, mean) / apply(portf_returns, 2, sd) # Sharpe ratio
## [1] 0.2029185 0.2179318
t_test <- t.test(portf_returns[,2]-portf_returns[,1]) # t-test
t_test
##
## One Sample t-test
##
## data: portf_returns[, 2] - portf_returns[, 1]
## t = 1.488, df = 155, p-value = 0.1388
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.0007422896 0.0052748976
## sample estimates:
## mean of x
## 0.002266304
The two strategies are quite comparable: a t-stat of 1.488 is not statistically significant. This is not surprising, given the small disparity in the (above) classical indicators. The boosted portfolio has better return, but higher volatility. All in all, it does not strongly improve the allocation process.
Use the reference guide if need be: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Reminder: one important degree of freedom is the horizon on which this return is computed.
In this section, we provide the more details on how xgboost works. We assume a quadratic loss function.
What we seek to minimise is
\[O=\underbrace{\sum_{i=1}^I \text{loss}(y_i,\tilde{y}_i)}_{\text{error term}}+ \underbrace{\sum_{j=1}^J\Omega(T_j)}_{\text{regularisation term}}\] The first term (over all instances) measures the distance between the true label and the output from the model. The second term (over all trees) penalises models that are too complex.
For simplicity, we will work with \(\text{loss}(y,\tilde{y})=(y-\tilde{y})^2\): \[O=\sum_{i=1}^I \left(y_i-m_{J-1}(\mathbf{x}_i)-T_J(\mathbf{x}_i)\right)^2+ \sum_{j=1}^J\Omega(T_j)\]
We built tree \(T_{J-1}\) (and hence model \(m_{J-1}\)): how to choose tree \(T_J\)? \[\begin{align*} O&=\sum_{i=1}^I \left(y_i-m_{J-1}(\mathbf{x}_i)-T_J(\mathbf{x}_i)\right)^2+ \sum_{j=1}^J\Omega(T_j) \\ &=\sum_{i=1}^I\left\{y_i^2+m_{J-1}(\mathbf{x}_i)^2+T_J(\mathbf{x}_i)^2 \right\} + \sum_{j=1}^{J-1}\Omega(T_j)+\Omega(T_J) \\ & \quad -2 \sum_{i=1}^I\left\{y_im_{J-1}(\mathbf{x}_i)+y_iT_J(\mathbf{x}_i)-m_{J-1}(\mathbf{x}_i) T_J(\mathbf{x}_i))\right\}\quad \text{cross terms} \\ &= \sum_{i=1}^I\left\{2 y_iT_J(\mathbf{x}_i)-2m_{J-1}(\mathbf{x}_i) T_J(\mathbf{x}_i))+T_J(\mathbf{x}_i)^2 \right\} +\Omega(T_J) + c \end{align*}\] All terms known at step \(J\) (i.e., indexed by J-1) vanish because they do not enter the optimisation scheme.
\(\rightarrow\) things are simple with quadratic loss. For more complicated loss functions, Taylor expansions are used (see the original paper).
We need to take care of the penalisation now. For a given tree \(T\), we specify tree \(T(x)=w_{q(x)}\), where \(w\) is the output value of some leaf and \(q(\cdot)\) is the function that maps an input to its final leaf. We write \(l=1,\dots,L\) for the indices of the leafs of the tree. In XGBoost, complexity is defined as: \[\Omega(T)=\gamma L+\frac{\lambda}{2}\sum_{l=1}^Lw_l^2,\] where
Below, we omit the index (\(J\)) of the tree of interest (i.e., the last one).
We aggregate both sections of the objective. We write \(I_l\) for the set of the indices of the instances belonging to leaf \(l\). Then,
\[\begin{align*}
O&= 2\sum_{i=1}^I\left\{ y_iT_J(\mathbf{x}_i)-m_{J-1}(\mathbf{x}_i) T_J(\mathbf{x}_i))+\frac{T_J(\mathbf{x}_i)^2}{2} \right\} + \gamma L+\frac{\lambda}{2}\sum_{l=1}^Lw_l^2 \\
&=2\sum_{i=1}^I\left\{ y_iw_{q(\mathbf{x}_i)}-m_{J-1}(\mathbf{x}_i)w_{q(\mathbf{x}_i)})+\frac{w_{q(\mathbf{x}_i)}^2}{2} \right\} + \gamma L+\frac{\lambda}{2}\sum_{l=1}^Lw_l^2 \\
&=2 \sum_{l=1}^L \left(w_l\sum_{i\in I_l}(y_i -m_{J-1}(\mathbf{x}_i))+ \frac{w_l^2}{2}\sum_{i\in I_l}\left(1+\frac{\lambda}{2}\right)\right)+ \gamma L
\end{align*}\]
The function is of the form \(aw_l+\frac{b}{2}w_l^2\), which has minimum values \(-\frac{a^2}{2b}\) at point \(w_l=-a/b\). Thus, writing #(.) for the cardinal function that counts the number of items in a set, \[\begin{align*}
\mathbf{\rightarrow} \quad w^*_l&=-\frac{\sum_{i\in I_l}(y_i -m_{J-1}(\mathbf{x}_i))}{\left(1+\frac{\lambda}{2}\right)\#\{i\in I_l\}}, \\
O_L(q)&=-\frac{1}{2}\sum_{l=1}^L \frac{\left(\sum_{i\in I_l}(y_i -m_{J-1}(\mathbf{x}_i))\right)^2}{\left(1+\frac{\lambda}{2}\right)\#\{i\in I_l\}}+\gamma L,
\end{align*}\] where we added the dependence of the objective both in \(q\) (structure of tree) and \(L\) (number of leaves). Indeed, the meta-shape of the tree remains to be determined.
Final problem: the tree structure! This is coded the \(q\) function essentially: what’s the best depth? Do we continue to grow the tree or not?
\(\text{Gain}_O\) is the original gain (no split). About the \(-\gamma\) adjustment: there is one unit of new leaves (two new minus one old)! This makes a one leaf difference, hence \(\Delta L =1\)
NOTE: XGBoost also applies a learning rate: the \(j^{th}\) tree has a scale of \(\eta^j\) with \(\eta \in [0,1]\).