This is the companion notebook dedicated to data processing. This stage can easily be overlooked but is crucial in machine learning. We briefly discuss categorical variables and logistic regression. We also make a few steps towards ML-driven factor investing in the last section.
\(\rightarrow\) ABOUT THE NOTEBOOK:
First, one of the most important issue: MISSING DATA!
Below, we import packages & the data.
if(!require(naniar)){install.packages("naniar", "visdat", "zoo")}
library(tidyverse) # Loading THE package/environment
library(naniar) # Missing data exploration
library(visdat) # Missing data visualization
library(zoo) # Great package for time-series
load("data_full.RData")
Let’s determine which variables are the most affected… (this step takes a bit of time because we use the large dataset)
gg_miss_var(data_full) # Proportion of missing points per column
vis_dat(data_full, warn_large_data = F) # Visualization of data types per column
The two most affected columns are indeed P2B (price-to-book) and D2E (debt-to-equity).
Let’s deal with them using 2 different techniques: times-series imputation and cross-sectional imputation.
First, let us have a look closer.
data_full %>% filter(is.na(P2B)) # Where are the missing P2B ratios?
## # A tibble: 22,664 x 7
## Tick Date Close Vol_1M Mkt_Cap P2B D2E
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ADMT 2006-06-30 0.3 84.0 16.2 NA NA
## 2 ADMT 2006-07-31 0.256 85.7 13.8 NA NA
## 3 ADMT 2006-08-31 0.28 84.9 15.1 NA NA
## 4 ADMT 2006-09-29 0.33 126. 17.8 NA NA
## 5 ADMT 2006-10-31 0.33 148. 17.8 NA NA
## 6 ADMT 2006-11-30 0.28 106. 15.1 NA NA
## 7 ADSK 2019-08-30 143. 39.0 31366. NA NA
## 8 ADSK 2019-09-30 148. 34.3 32433. NA NA
## 9 ADSK 2019-10-31 147. 29.1 32358. NA NA
## 10 ADSK 2019-11-29 181. 21.8 39723. NA NA
## # … with 22,654 more rows
data_full %>%
filter(Tick == "ADMT") %>%
ggplot(aes(x = Date, y = P2B)) + geom_line() + theme_light()
For time-series imputation, we will use the na.locf function from the zoo package. Let’s see what it does.
seq_na <- sin(1:24) %>% round(3)
seq_na[10:12] <- NA
seq_na
## [1] 0.841 0.909 0.141 -0.757 -0.959 -0.279 0.657 0.989 0.412 NA
## [11] NA NA 0.420 0.991 0.650 -0.288 -0.961 -0.751 0.150 0.913
## [21] 0.837 -0.009 -0.846 -0.906
na.locf(seq_na)
## [1] 0.841 0.909 0.141 -0.757 -0.959 -0.279 0.657 0.989 0.412 0.412
## [11] 0.412 0.412 0.420 0.991 0.650 -0.288 -0.961 -0.751 0.150 0.913
## [21] 0.837 -0.009 -0.846 -0.906
The missing points were replaced by the last preceding value. Handy!
data_full %>%
group_by(Tick) %>% # Perform the operation for EACH stock
mutate(P2B = na.locf(P2B, na.rm = F)) %>% # Replace P2B by imputed P2B
ungroup() %>% # Ungroup
filter(Tick == "ADMT", # Let's check!
Date > "2004-01-01",
Date < "2009-01-01") %>%
ggplot(aes(x = Date, y = P2B)) + geom_line() + theme_light()
No more holes!
Another way to proceed is to assign a mean or median score, as if the missing value for the company is equal to some “average” or representative value. The median is more robust than the mean because it is less affected by outliers.
data_full %>%
group_by(Date) %>% # Perform the operation for EACH date
mutate(med_P2B = median(P2B, na.rm = T)) %>% # Compute the median P2B
ungroup() %>% # Ungroup
mutate(P2B = if_else(is.na(P2B), med_P2B, P2B)) %>% # If P2B is NA, replace it! If not, don't.
filter(Tick == "ADMT", # Let's check!
Date > "2004-01-01",
Date < "2009-01-01") %>%
ggplot(aes(x = Date, y = P2B)) + geom_line() + theme_light()
Here again, the holes have been filled, but the new values are very different.
We choose to skip the complicated topic of outlier detection to focus directly on feature engineering (we refer to Aggarwal (2016) for an in-depth treatment of outlier analysis). Below, we start with artificially generated data and introduce simple rescaling methods.
On the particular topic of feature engineering, we recommend the book http://www.feat.engineering.
Length <- 100 # Length of the sequence
x <- exp(sin(1:Length)) # Original data
data <- data.frame(index = 1:Length, x = x) # Data framed into dataframe
ggplot(data, aes(x = index, y = x)) + # Plot
geom_col() + theme_light()
Nothing special about this sample. We choose to work with positive values, but this is without much loss of generality. Below, we scale the variable using 3 methods:
standard <- (x - mean(x)) / sd(x) # Standardisation
norm_0_1 <- (x - min(x)) / (max(x)-min(x)) # [0,1] reduction
unif <- ecdf(x)(x) # Uniformisation
data_norm <- data.frame( # Formatting the data
index = 1:Length, # Index of point/instance
standard = standard,
norm_0_1 = norm_0_1,
unif = unif) %>% pivot_longer(-index, names_to = "Type", values_to = "value")
ggplot(data_norm, aes(x = index, y = value, fill = Type)) + # Plot!
geom_bar(stat = "identity", position = "dodge") + theme_light()
ggplot(data_norm, aes(x = index, y = value, fill = Type)) + # Plot!
geom_col() + theme_light() +
facet_grid(Type~.) # This option creates 3 concatenated graphs to ease comparison
By definition, standardisation produces both positive and negative values.
In terms of distribution of the data, this gives the following differences:
ggplot(data_norm, aes(x = value, fill = Type)) +
geom_histogram(position = "dodge") + theme_light() +
facet_grid(Type ~.)
With respect to shape, the green and red distributions are close to the original one. It is only the support that changes: the min/max rescaling ensures all values lie in the [0,1] interval. In both cases, the smallest values (on the left) display a spike in distribution. By construction, this spike disappears under the uniformisation: the points are evenly distributed over the unit interval.
In this section, we investigate the impact of these techniques and choices on simple analyses. We compare the impact of rescaling on basic regression results.
We start by importing the smaller dataset and computing returns (our dependent variable).
load("data.RData")
data <- data %>% arrange(Date,Tick) # Just making sure all is in order
tick <- unique(data$Tick) # Set of assets
data <- data %>%
group_by(Tick) %>% # Perform analysis stock-by-stock
mutate(F_Return = lead(Close) / Close - 1) %>% # Adding forward/future returns
select(-ESG_rank) %>% # Too many missing points for ESG
na.omit() %>% # Take out missing data
ungroup()
summary(data) # Descriptive statistics
## Tick Date Close Vol_1M
## Length:7620 Min. :1999-12-31 Min. : 0.217 Min. : 5.867
## Class :character 1st Qu.:2005-03-31 1st Qu.: 19.134 1st Qu.: 15.954
## Mode :character Median :2010-07-15 Median : 32.233 Median : 21.811
## Mean :2010-07-15 Mean : 49.032 Mean : 27.047
## 3rd Qu.:2015-10-30 3rd Qu.: 57.090 3rd Qu.: 31.861
## Max. :2021-01-29 Max. :442.987 Max. :268.563
## Mkt_Cap P2B D2E Prof_Marg
## Min. : 4467 Min. : 0.1013 Min. : 0.00 Min. :-106.260
## 1st Qu.: 67368 1st Qu.: 1.4214 1st Qu.: 31.69 1st Qu.: 5.021
## Median : 137288 Median : 2.3444 Median : 66.57 Median : 10.318
## Mean : 160012 Mean : 10.9708 Mean : 217.29 Mean : 11.736
## 3rd Qu.: 210463 3rd Qu.: 4.2090 3rd Qu.: 212.85 3rd Qu.: 19.163
## Max. :2255969 Max. :1678.9053 Max. :14585.61 Max. : 145.583
## F_Return
## Min. :-0.578853
## 1st Qu.:-0.032371
## Median : 0.008876
## Mean : 0.008480
## 3rd Qu.: 0.050793
## Max. : 1.273783
First, the descriptive statistics. Looking at the dispersion between the median and both the minimum and maximum gives a first impression on outliers. For instance, the maximum D2E is 242 times the median: this is a big number. The same observation holds for P2B. Sometimes, it is worthwhile to understand if these values are errors or due to very small values in the denominator. For instance, the outlier in P2B comes from very low book values for Boeing in 2017.
Then, we rescale the predictors (features). We use two normalising functions. For consistency reasons, we will compare the two methods that scale inputs into the [0,1] interval. One secondary reason for that is that some ML tools work well/better with bounded inputs (neural networks for instance). Below, we create the two dataframes used in the analysis.
NOTE: it is imperative to rescale the data AT EACH DATE.
norm_unif <- function(v){ # This is a function that uniformalises a vector.
v <- v %>% as.matrix()
return(ecdf(v)(v))
}
norm_0_1 <- function(v){ # This is a function that uniformalises a vector.
return((v-min(v))/(max(v)-min(v)))
}
data_0 <- data %>% # Data with normalised values
group_by(Date) %>% # Normalisation takes place for each given date
mutate_if(is.numeric, norm_0_1) %>% # Apply normalisation on numerical columns
ungroup() %>% # Ungroup
select(-Tick, -Date, -Close) # Take out superfluous columns
data_0$F_Return <- data$F_Return # We recover the original returns
data_u <- data %>% # Data with normalised values
group_by(Date) %>% # Normalisation takes place for each given date
mutate_if(is.numeric, norm_unif) %>% # Apply normalisation on numerical columns
ungroup() %>% # Ungroup
select(-Tick, -Date, -Close) # Take out superfluous columns
data_u$F_Return <- data$F_Return # We recover the original returns
Below, we check the distribution of the variables.
data_0 %>%
select(-F_Return) %>% # Remove returns
pivot_longer(cols = everything(),
names_to = "Attribute", values_to = "Value") %>% # Convert to 'compact' column mode
ggplot(aes(x = Value, fill = Attribute)) +
geom_histogram() + theme_light() + # Plot histograms
facet_grid(Attribute~., scales = "free") # Stack the histograms
data_u %>%
select(-F_Return) %>% # Remove returns
pivot_longer(cols = everything(),
names_to = "Attribute", values_to = "Value") %>% # Convert to 'compact' mode
ggplot(aes(x = Value, fill = Attribute)) +
geom_histogram() + theme_light() + # Plot histograms
facet_grid(Attribute~.) # Stack the histograms
Apart for a minor issue with D2E (due to outliers?), the scaling seems ok. But clearly, the distribution of all variables are strongly impacted by the scaling choice.
We are now ready to perform the regressions. We start with the scaled data.
fit <- lm(F_Return ~ ., data = data_0) # Regression on first set
summary(fit) # Regression output
##
## Call:
## lm(formula = F_Return ~ ., data = data_0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59932 -0.04049 0.00110 0.04230 1.25884
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0148156 0.0026586 5.573 2.6e-08 ***
## Vol_1M 0.0074568 0.0037455 1.991 0.04653 *
## Mkt_Cap -0.0082188 0.0040300 -2.039 0.04144 *
## P2B -0.0068701 0.0038150 -1.801 0.07177 .
## D2E -0.0002801 0.0043232 -0.065 0.94834
## Prof_Marg -0.0107288 0.0035458 -3.026 0.00249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08141 on 7614 degrees of freedom
## Multiple R-squared: 0.003526, Adjusted R-squared: 0.002871
## F-statistic: 5.388 on 5 and 7614 DF, p-value: 5.975e-05
If we set the significance threshold at 5%, then some variables stand out: Prof_Marg (strongly), Mkt_Cap (weakly) and Vol_1M (weakly). With this database, future returns are negatively related to firm size. This is true in all analyses that we can/will do (with trees, notably).
We now turn to the ‘uniformised’ data.
fit <- lm(F_Return ~ ., data = data_u) # Regression on second set
summary(fit) # Regression output
##
## Call:
## lm(formula = F_Return ~ ., data = data_u)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59535 -0.04085 0.00132 0.04245 1.26286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.016038 0.004292 3.737 0.000188 ***
## Vol_1M 0.003753 0.003294 1.139 0.254706
## Mkt_Cap -0.017952 0.003759 -4.776 1.82e-06 ***
## P2B -0.003728 0.003286 -1.135 0.256620
## D2E -0.005834 0.003400 -1.716 0.086226 .
## Prof_Marg 0.009142 0.003645 2.508 0.012163 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0814 on 7614 degrees of freedom
## Multiple R-squared: 0.003665, Adjusted R-squared: 0.003011
## F-statistic: 5.602 on 5 and 7614 DF, p-value: 3.7e-05
In this case, Mkt_Cap is still strongly significant, buth RSI_1M’s impact has faded, while D2E and Prof_Marg emerge as weak drivers. This underlines the importance of feature engineering. For nonlinear models, we refer to Galili and Meilijson (2016 - arxiv) for an analysis of the impact of feature transformation on tree splits.
To illustrate the difference between the two rescaling methods, we build a simple dataset, comprising 3 firms and 3 dates.
firm <- c(rep(1,3), rep(2,3), rep(3,3)) # Firms (3 lines for each)
date <- rep(c(1,2,3),3) # Dates
cap <- c(10, 50, 100, # Market capitalisation
15, 10, 15,
200, 120, 80)
return <- c(0.06, 0.01, -0.06, # Return values
-0.03, 0.00, 0.02,
-0.04, -0.02,0.00)
data_toy <- data.frame(firm, date, cap, return) # Aggregation of data
data_toy <- data_toy %>% # Transformation of data
group_by(date) %>%
mutate(cap_0 = norm_0_1(cap), cap_u = norm_unif(cap))
data_toy # Display the data
## # A tibble: 9 x 6
## # Groups: date [3]
## firm date cap return cap_0 cap_u
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 10 0.06 0 0.333
## 2 1 2 50 0.01 0.364 0.667
## 3 1 3 100 -0.06 1 1
## 4 2 1 15 -0.03 0.0263 0.667
## 5 2 2 10 0 0 0.333
## 6 2 3 15 0.02 0 0.333
## 7 3 1 200 -0.04 1 1
## 8 3 2 120 -0.02 1 1
## 9 3 3 80 0 0.765 0.667
Let’s briefly comment on this synthetic data. We assume that dates are ordered chronologically and far away: each date stands for a year or the beginning of a decade, but the (forward) returns are computed on a monthly basis. The first firm is hugely successful and multiplies its cap ten times over the periods. The second firms remains stable cap-wise, while the third one plummets. If we look at ‘local’ future returns, they are strongly negatively related to size for the first and third firms. For the second one, there is no clear pattern.
Date-by-date, the analysis is fairly similar, though slightly nuanced.
While the relationship is not always perfectly monotonous, there seems to be a link between size and return and typically, investing in the smallest firm would be a very good strategy with this sample.
Now let’s look at the regressions.
lm(return ~ cap_0, data = data_toy) %>% summary() # First regression (min-max rescaling)
##
## Call:
## lm(formula = return ~ cap_0, data = data_toy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.044970 -0.016278 0.003722 0.013425 0.043722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01628 0.01374 1.185 0.2746
## cap_0 -0.04970 0.02137 -2.326 0.0529 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02867 on 7 degrees of freedom
## Multiple R-squared: 0.4359, Adjusted R-squared: 0.3553
## F-statistic: 5.409 on 1 and 7 DF, p-value: 0.05294
lm(return ~ cap_u, data = data_toy) %>% summary() # Second regression (uniformised feature)
##
## Call:
## lm(formula = return ~ cap_u, data = data_toy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02667 -0.02000 0.00000 0.01667 0.03333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.06000 0.01981 3.028 0.01916 *
## cap_u -0.10000 0.02752 -3.634 0.00835 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02247 on 7 degrees of freedom
## Multiple R-squared: 0.6536, Adjusted R-squared: 0.6041
## F-statistic: 13.21 on 1 and 7 DF, p-value: 0.008351
In terms of p-value (last column), the first estimation for the cap coefficient is above 5% while the second is below 1%. One possible explanation for this discrepancy is the standard deviation of the variables. The deviations are equal to 0.47 and 0.29 for cap_0 and cap_u, respectively. Values like market capitalisations can have very large ranges and are thus subject to substantial deviations (even after scaling). Working with uniformised variables reduces dispersion and can help solve this problem.
Note that this is a double-edged sword: while it can help avoid false negatives, it can also lead to false positives.
In supervised learning, regression tasks consist of explaining or predicting real numbers while classification exercises rely on cateogorical data, which are often harder to handle.
Below, we talk interchangeably of categories or classes.
First, we start from the scaled sample data_u, duplicate it and create a categorical variable: FC_Return (forward categorical). We use a simple function for that purpose. It is defined as \[\tilde{y}_{t,i}=\left\{ \begin{array}{rll} -1 & \text{ if } & y_{t,i} < -r \\ 0 & \text{ if } & y_{t,i} \in [-r,r] \\ +1 & \text{ if } & y_{t,i} > r \\ \end{array} \right.\] for some threshold value \(r>0\). Intuitively, if the dependent variable y is a measure of profitability, a +1 value signals a buying intention, a -1 value a selling intention, while a zero value recommends the status quo.
thresh_coding <- function(x, r){ # The function that creates the categorical data
z <- x # Duplicate original data
z[x < (-r)] <- (-1) # Sell!
z[x > r] <- 1 # Buy!
z[x >= (-r) & x <= r] <- 0 # Hold!
return(z)
}
r_thresh <- 0.03 # Decision threshold
data_c <- round(data_u,3) # Duplicate data and round numbers (3 decimals is enough)
data_c <- data_c %>% mutate(FC_Return = thresh_coding(F_Return, r_thresh))
head(data_c) # Let's check!
## # A tibble: 6 x 7
## Vol_1M Mkt_Cap P2B D2E Prof_Marg F_Return FC_Return
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.933 0.2 0.5 0.2 0.4 0.009 0
## 2 0.667 0.233 0.3 0.533 0.2 0.074 1
## 3 0.867 0.433 0.1 0.867 0.867 -0.035 -1
## 4 0.467 0.733 0.433 0.933 0.667 0.024 0
## 5 0.5 0.933 0.967 0.067 0.6 0.022 0
## 6 0.967 0.167 0.533 0.367 0.133 -0.122 -1
We confirm the coding of the categorical variable on a small subset of the sample.
data_c$FC_Return <- data_c$FC_Return %>% as.factor() # Turn it into real category!
summary(data_c$FC_Return)
## -1 0 1
## 1970 2921 2729
The last line shows the relative importance of the categories, which greatly depends on r_thresh. Our choice of r_thresh gives a rather balanced distribution for the categories. Smaller values of r_thresh would decrease the number of zeros and increase the number of \(\pm 1\).
if(!require(dummies)){ install.packages("dummies")}
library(dummies) # One package for one-hot encoding
one_hot <- dummy(data_c$FC_Return) # The encoding
colnames(one_hot) <- c("Sell", "Hold", "Buy")
data_c <- cbind(data_c, one_hot) # Combine the two
data_c %>% # Let's see!
select(-Vol_1M, -Mkt_Cap) %>% # Remove a few variables first..
head()
## P2B D2E Prof_Marg F_Return FC_Return Sell Hold Buy
## 1 0.500 0.200 0.400 0.009 0 0 1 0
## 2 0.300 0.533 0.200 0.074 1 0 0 1
## 3 0.100 0.867 0.867 -0.035 -1 1 0 0
## 4 0.433 0.933 0.667 0.024 0 0 1 0
## 5 0.967 0.067 0.600 0.022 0 0 1 0
## 6 0.533 0.367 0.133 -0.122 -1 1 0 0
The number of additional columns is logically equal to the number of categories.
One-hot encoding will mostly be used for neural networks (decision trees handle categorical variables quite well).
When there are only two classes (e.g., buy vs sell), it is possible to extend the concept of regression analysis to categorical data. This classical tool is referred to as logistic regression because it is linked to the logistic function \((1+e^{-x})^{-1}\). This function takes values in [0,1] and such output forms are convenient because they can be interpreted as the probability of one occurrence of belonging to one class. If we use the simple model \[\frac{p}{1-p}=e^{\beta_0+\sum_{i=1}^N\beta_ix_i},\]
where \(p\) is the probability to buy, then \(p=(1+e^{-\beta_0+\sum_{i=1}^N\beta_ix_i})^{-1}\). The \(\beta_j\) are usually estimated via maximum likelihood (assuming iid Bernoulli samples).
Below, we show how this kind of model can be very simply implemented in R.
data_c$FB_Return <- (data_c$F_Return>0) # Binary variable: 1 (true) if positive return, 0 if not
fit <- glm(FB_Return ~ Mkt_Cap + P2B + Vol_1M + D2E, # Generalised linear model
family = "binomial", # Binary variable
data = data_c) # Data source
summary(fit) # Output of regression
##
## Call:
## glm(formula = FB_Return ~ Mkt_Cap + P2B + Vol_1M + D2E, family = "binomial",
## data = data_c)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.380 -1.271 1.032 1.082 1.171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.53069 0.10344 5.130 2.89e-07 ***
## Mkt_Cap -0.28296 0.08406 -3.366 0.000762 ***
## P2B -0.01393 0.08118 -0.172 0.863786
## Vol_1M -0.13940 0.08160 -1.708 0.087576 .
## D2E -0.13465 0.08404 -1.602 0.109116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10459 on 7619 degrees of freedom
## Residual deviance: 10446 on 7615 degrees of freedom
## AIC: 10456
##
## Number of Fisher Scoring iterations: 4
Again, the variables that stand out are the Mkt_Cap (strongly) and the RSI_1M (weakly), both with negative impact. Each unit of increase in Mkt_Cap decreases the probability of positive return by a factor \(e^{-0.294}=0.745\). Finally, we show the syntax for prediction.
predict(fit, data_c[1:3,]) # Predicts the log-odds
## 1 2 3
## 0.3101408 0.2958293 0.1691674
predict(fit, data_c[1:3,], type = "response") # Gives the probability
## 1 2 3
## 0.5769196 0.5734226 0.5421913
1/(1+exp(-predict(fit, data_c[1:3,]))) # Check via logistic function
## 1 2 3
## 0.5769196 0.5734226 0.5421913
Several output formats are possible. The log-odds are \(\log\left(\frac{p}{1-p}\right)\). The probability of buying is obtained via the logistic function.
Refinements in labelling include:
(subtitle: ANOTHER STEP TOWARDS ML-BASED FACTOR INVESTING)
The core idea of the course is to translate data about firms into investment decisions. We seek to infer correlation (‘causal’ would be too strong) patterns between corporations’ characteristics and their subsequent returns. The base mode is \[\mathbf{y}=f(\mathbf{X})+\mathbf{\epsilon},\]
where y is some proxy for future performance while X embeds the current values of firm attributes. More precisely, for one given date, t, we will try to estimate \(f_t\) \[y_{s+1,j}=f_t\left(x_{s,j}^{(1)}, \dots, x_{s,j}^{(K)}\right)+\mathbf{\epsilon}_{s,j}, \quad s= t-1, \ t-2,\dots\]
where j is the index of the firm and s the index of time. In the above equation, there are K features or firm attributes. To make things simple, we will consider that y is the return of the asset. This (future) return will thus be a function of the current values of stocks’ characteristics.
If our sample comprises only one date (s = t-1), then we will try to build a model that maps the features of t-1 to the returns computed from t-1 to t (i.e., after the feature are observed). This stage where we determine the optimal (in some sense) model \(f_t\) according to past observations is called training: the parameters of the model are adjusted so that the predicted values are as close as possible to those that actually occurred.
Using only one date will probably be insufficient and the model will lack robustness. Hence, it will be preferable to train the model on many dates so that the model learns from different states of the market.
Before we move to a simple application of this principle using a linear \(f_t\), we illustrate the concept of learning on increasingly large samples.
Below, we use the gganimate package to demonstrate the simple idea behind training.
if(!require(gganimate)){install.packages("gganimate")}
library(gganimate)
Iteration <- 16 # Number of steps in the animation
reg_data <- data.frame( # Create dataframe with:
x = sin(1:Iteration), # x-axis (independent/explanatory variable)
iteration = Iteration, # Index of the step
status = c(rep("Past",Iteration-1),"Current") # Which point is new versus old
)
reg_data <- reg_data %>% mutate(y = x + rnorm(Iteration)/3) # Linear Data Generator
tmp <- reg_data # Copy of the data with fewer points to pass to animation
for (j in 1:(Iteration-2)) { # Index of animation step
tmp <- tmp[-nrow(tmp),] # Take out one point
tmp$iteration <- Iteration - j # Change animation step number
tmp$status[nrow(tmp)] <- "Current" # Change label of current point
reg_data <- rbind(reg_data, tmp) # Aggregate
}
p <- ggplot(reg_data) + theme_light() +
geom_point(aes(x = x, y = y, color = status), size = 4) + # Points
stat_smooth(aes(x = x, y = y), color = "black", method = "lm", se = FALSE) + # Linear model
transition_manual(iteration) # Animation variable
anim <- animate(p, duration = 20) # 20 second animation
anim_save("reg.gif", anim) # Saving for export/include
Each time a new (red) point is added to the training data, the linear model is adjusted: the black line comes slightly closer to the new point. As more data accumulates, the impact of each new point is less and less pronounced because the parameter values converge. If the data generating process exhibits some stationarity properties (samples follow a reccurrent pattern) and if the model type is well chosen, learning will be effective.
In this section, we will consider a linear model: \[f_t\left(x_{s,j}^{(1)}, \dots, x_{s,j}^{(K)}\right)=\alpha_t+\sum_{k=1}^K\beta_{t}^{(k)}x_{s,j}^{(k)},\]
where the parameters alpha and beta are estimated via simple Ordinary Least Squares. In the above expression, we do not refer to the error term, but do not forget that it exists!
First, we have a look on the full sample (using data_u).
lm(F_Return ~ ., data = data_u) %>% summary() # Regression output
##
## Call:
## lm(formula = F_Return ~ ., data = data_u)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59535 -0.04085 0.00132 0.04245 1.26286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.016038 0.004292 3.737 0.000188 ***
## Vol_1M 0.003753 0.003294 1.139 0.254706
## Mkt_Cap -0.017952 0.003759 -4.776 1.82e-06 ***
## P2B -0.003728 0.003286 -1.135 0.256620
## D2E -0.005834 0.003400 -1.716 0.086226 .
## Prof_Marg 0.009142 0.003645 2.508 0.012163 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0814 on 7614 degrees of freedom
## Multiple R-squared: 0.003665, Adjusted R-squared: 0.003011
## F-statistic: 5.602 on 5 and 7614 DF, p-value: 3.7e-05
As a confirmation of our previous exercise, Mkt_Cap stands out, meaning that the addition of other predictors does not alter its impact on future returns.
We now turn to portfolio choice.
First, we create the weights function. The strategy forecasts future returns and invests equally in the half of the assets that have forecasts above the cross-sectional median return. This is in fact a particular case of the penalised predictive regression seen in the previous (LASSO) session, except that:
weights_multi <- function(past_data, curr_data, j, thresh){
if(j == 1){ # j = 1 => regression-based
fit <- lm(F_Return ~ ., data = past_data) # Training on past data
pred <- predict(fit, curr_data) # Prediction
w <- pred > quantile(pred, thresh) # Investment decision: TRUE = 1
return(w / sum(w))
}
if(j == 2){ # j = 2 => EW
N <- nrow(curr_data)
return(rep(1/N,N))
}
}
We then initiate the backtesting variables.
t_all <- unique(data$Date) # Set of dates
sep_date <- as.Date("2010-01-01") # This date separates in-sample vs out-of-sample
t_oos <- t_all[t_all > sep_date] # Out-of-sample dates (i.e., testing set)
nb_port <- 2 # Nb of portfolios
Tt <- length(t_oos) # Nb dates
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick))) # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port) # Store returns
Finally, we are ready for the backtesting loop.
thresh <- 0.8
data_u <- data.frame(data_u, Date = data$Date) # We recover the dates from original file
for(t in 1:length(t_oos)){
past_data <- data_u %>% # The source is data_u!
filter(Date < t_oos[t]) %>% # Past dates: expanding window!
select(-Date) # Take out dates
curr_data <- data_u %>% # The source is data_u!
filter(Date == t_oos[t]) %>% # Current date
select(-Date) # Take out dates
for(j in 1:nb_port){ # The loop on the strategies
portf_weights[t,j,] <- weights_multi(past_data, # Portfolio weights...
curr_data,
j, # The weights' function is indexed by j
thresh) # The threshold
realised_returns <- data_u %>% # Realised returns: take data_u
filter(Date == t_oos[t]) %>% # Filter the right date
select(F_Return) # Take the returns only
portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns) # Porfolio returns
}
}
portf_returns %>% apply(2, mean) # Average return
## [1] 0.01502077 0.01119351
portf_returns %>% apply(2, sd) # Volatility
## [1] 0.04059465 0.03984288
portf_returns %>% apply(2, mean) / portf_returns %>% apply(2, sd)
## [1] 0.3700185 0.2809414
The regression-based allocation is slightly more profitable, but also more volatile: in the end, both Sharpe ratios are close. In order to statistically assess whether a strategy is significantly preferable, it is possible to perform a t-test on the difference of returns.
t_test <- t.test(portf_returns[,1]-portf_returns[,2])
t_test
##
## One Sample t-test
##
## data: portf_returns[, 1] - portf_returns[, 2]
## t = 2.6982, df = 132, p-value = 0.007882
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.001021441 0.006633080
## sample estimates:
## mean of x
## 0.003827261
The difference is not sufficiently pronounced. A t-stat of 2.698 is not negligible, but not impressive either.
Inputs in financial backtesting can all be considered as time-series. There are many ways these tools can be characterised, but we underline one important property: persistence. One simply way to assess persistence is to compute auto-correlations.
acf_2 <- function(v){ # We build a function that keeps the first order correlation
return(acf(v, lag.max = 1, plot = FALSE)$acf[2])
}
data %>%
group_by(Tick) %>%
summarise_if(is.numeric, acf_2) # We apply this function to features
## # A tibble: 30 x 8
## Tick Close Vol_1M Mkt_Cap P2B D2E Prof_Marg F_Return
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 0.951 0.672 0.959 0.926 0.979 0.952 0.0481
## 2 BA 0.985 0.684 0.982 0.956 0.941 0.646 0.0407
## 3 BAC 0.978 0.874 0.969 0.921 0.957 0.815 0.0927
## 4 C 0.992 0.875 0.971 0.976 0.989 0.725 0.0676
## 5 CSCO 0.969 0.781 0.950 0.924 0.986 0.713 -0.0289
## 6 CVS 0.986 0.672 0.983 0.960 0.971 0.733 -0.0145
## 7 CVX 0.986 0.691 0.977 0.873 0.974 0.777 -0.155
## 8 D 0.988 0.642 0.982 0.984 0.949 0.656 -0.0384
## 9 DIS 0.974 0.739 0.967 0.972 0.949 0.860 0.0598
## 10 F 0.945 0.811 0.963 0.798 0.719 0.766 -0.00353
## # … with 20 more rows
Almost all attributes are highly autocorrelated at the monthly frequency. This is obviously true for Mkt_Cap and Vol_1M which are indeed expected to show some memory over time.
The main issue is that the last column, which is the predicted variable, is not persistent at all! Hence, this is a major problem because it is unlikely that any good is going to come out of a model - even a very sophisticated one. If the returns are oscillating very quickly while the predictors have very smooth trajectories, then there is a framing issue. What will probably happen is that the model/algorithm will find spurious relationships/arbitrages between variables that will fail out-of-sample.
Two solutions (at least) exist:
Below, we provide examples of implementation for each alternative. We start with the second one.
data_1 <- data %>%
group_by(Tick) %>% # Grouping
mutate(F_Return = lead(Close) / Close - 1, # Adding 1M future returns
FL_Return = lead(Close, 12) / Close - 1) %>% # Adding 12M (Long-term) future returns
na.omit() %>% # Take out missing data
ungroup() # Ungrouping
data_1 %>% group_by(Tick) %>%
summarise(autocor = acf_2(FL_Return)) # Showing the results
## # A tibble: 30 x 2
## Tick autocor
## <chr> <dbl>
## 1 AAPL 0.924
## 2 BA 0.932
## 3 BAC 0.835
## 4 C 0.881
## 5 CSCO 0.907
## 6 CVS 0.909
## 7 CVX 0.894
## 8 D 0.850
## 9 DIS 0.910
## 10 F 0.911
## # … with 20 more rows
By looking 12 months ahead, we are smoothing the dependent variable, but we are also losing some data points. The result (in terms of autocorrelation patterns) is nonetheless satisfactory.
Finally, we turn to feature differences, which is the first option.
data_2 <- data %>%
group_by(Tick) %>% # Group by stock
mutate(D_Mkt_Cap = Mkt_Cap / lag(Mkt_Cap) - 1, # Relative variation: scale issue
D_P2B = P2B - lag(P2B), # Absolute variation for the rest
D_Vol_1M = Vol_1M - lag(Vol_1M),
D_D2E = D2E - lag(D2E),
D_Prof_Marg = Prof_Marg - lag(Prof_Marg)) %>%
na.omit() %>% # Take out missing data
ungroup() # Ungroup!
data_2 %>% group_by(Tick) %>%
summarise_if(is.numeric, acf_2) %>%
mutate_if(is.numeric, round, digits = 3) # Show the results
## # A tibble: 30 x 13
## Tick Close Vol_1M Mkt_Cap P2B D2E Prof_Marg F_Return D_Mkt_Cap D_P2B
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 0.951 0.668 0.959 0.926 0.979 0.952 0.049 0.055 0.27
## 2 BA 0.985 0.683 0.982 0.956 0.941 0.646 0.046 0.069 0.049
## 3 BAC 0.978 0.873 0.969 0.921 0.957 0.815 0.092 0.148 -0.01
## 4 C 0.992 0.875 0.97 0.975 0.989 0.725 0.068 0.179 0.122
## 5 CSCO 0.968 0.781 0.947 0.929 0.985 0.714 -0.03 -0.033 -0.033
## 6 CVS 0.986 0.665 0.983 0.965 0.971 0.733 -0.015 -0.046 0.06
## 7 CVX 0.985 0.691 0.976 0.874 0.973 0.777 -0.16 -0.158 -0.165
## 8 D 0.988 0.643 0.982 0.984 0.949 0.656 -0.028 -0.013 -0.081
## 9 DIS 0.975 0.738 0.968 0.972 0.949 0.86 0.075 0.086 -0.067
## 10 F 0.947 0.812 0.965 0.797 0.719 0.766 -0.006 0.021 0.064
## # … with 20 more rows, and 3 more variables: D_Vol_1M <dbl>, D_D2E <dbl>,
## # D_Prof_Marg <dbl>
The difference operator did wipe out all autocorrelations.
Finally, do not hesitate to work with relative performance with respect to a benchmark. For instance, removing the cross-sectional average return at each point in time will single out the stocks with above-average performance. Moreover, the cross-sectional average of this indicator is not time-varying and subject to macro-economics fluctuations.
Feature engineering is not all: labelling is also very important. Recycle the code above to change the dependent variable (future returns) into:
Create a weighting scheme (function) that is based on categorical output (-1,0,1). The weights are initialised uniformly, then * if the output is 1, then the prior weight is one
* if the output is -1, then the prior weight is zero
* if the output is 0, then the prior weight is the sign of the previous weight
* finally, normalise the prior weights so that they sum to one.
Verify if the persistence of attributes such as market capitalisation or volatility remains after the rescaling of features.
f_thresh <- function(thresh){
for(t in 1:length(t_oos)){
past_data <- data_u %>% # The source is data_u!
filter(Date < t_oos[t]) %>% # Past dates: expanding window!
select(-Date) # Take out dates
curr_data <- data_u %>% # The source is data_u!
filter(Date == t_oos[t]) %>% # Current date
select(-Date) # Take out dates
for(j in 1:nb_port){ # The loop on the strategies
portf_weights[t,j,] <- weights_multi(past_data, # Portfolio weights...
curr_data,
j, # The weights' function is indexed by j
thresh) # The threshold
realised_returns <- data_u %>% # Realised returns: take data_u
filter(Date == t_oos[t]) %>% # Filter the right date
select(F_Return) # Take the returns only
portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns) # Porfolio returns
}
}
return(portf_returns %>% apply(2, mean))
}
map(seq(0.1,0.9, by = 0.1),
f_thresh)
## [[1]]
## [1] 0.01100376 0.01119351
##
## [[2]]
## [1] 0.01143626 0.01119351
##
## [[3]]
## [1] 0.01152758 0.01119351
##
## [[4]]
## [1] 0.01197404 0.01119351
##
## [[5]]
## [1] 0.01197781 0.01119351
##
## [[6]]
## [1] 0.01311213 0.01119351
##
## [[7]]
## [1] 0.01411238 0.01119351
##
## [[8]]
## [1] 0.01502077 0.01119351
##
## [[9]]
## [1] 0.01537934 0.01119351