This is the companion notebook for the course on the LASSO and its application for sparse portfolios and penalised predictive regressions.
\(\rightarrow\) ABOUT THE NOTEBOOK:
First, we load the packages and import the data. The steps are the same as before. Notably, we immediately add the future/forward returns (F_Return) in addition to the past/current returns (P_Return).
if(!require(glmnet)){install.packages("glmnet", "tictoc", "scales", "ggrepel", "ggcorrplot")}
library(glmnet) # This is THE package for penalised regressions
library(tidyverse) # ... the usual core packages
library(lubridate) # Package for date management
library(ggrepel) # Package for ggplot annotations
library(scales) # Package for scales in ggplot
library(tictoc) # Package for computation time
library(ggcorrplot) # Visualizing the correlation matrices
load("data.RData") # Loading the data
data <- data %>% arrange(Date,Tick) # Just making sure all is in order
tick <- unique(data$Tick) # Set of assets
data <- data %>%
select(-ESG_rank) %>% # Remove ESG: too many missing points
group_by(Tick) %>% # Group asset by asset
mutate(P_Return = Close / lag(Close) - 1) %>% # Adding past returns
mutate(F_Return = lead(P_Return)) %>% # Adding forward returns: could to better
na.omit() # Remove missing points
Let’s have a look at correlations.
data[,4:10] %>% cor() %>% round(3) %>% # Correlation matrix
ggcorrplot(type = "upper", lab = TRUE)
There seems to be a positive link between Mk_Cap and Profitability. Let’s have a closer look. We build a pivot-table of averages over all firms and plot the result.
data %>%
group_by(Tick) %>% # Building a pivot table
summarise(avg_cap = mean(Mkt_Cap),
avg_prof = mean(Prof_Marg)) %>%
ggplot(aes(x = avg_cap, y = avg_prof)) + geom_point() +
geom_text_repel(aes(label = Tick)) + theme_light() +
ylab("Average Profitability Margin (%)") + xlab("Average Market Capitalization (K$)") +
geom_smooth(method = "lm", se = FALSE)
The coefficients are available at:
data %>%
group_by(Tick) %>% # Building a pivot table
summarise(avg_cap = mean(Mkt_Cap),
avg_prof = mean(Prof_Marg)) %>%
lm(avg_prof ~ avg_cap, data = .) # Compute the regression
##
## Call:
## lm(formula = avg_prof ~ avg_cap, data = .)
##
## Coefficients:
## (Intercept) avg_cap
## 6.328e+00 3.378e-05
Indeed, even at the aggregate level, the positive relationship seems to hold! The blue line shows the best linear fit for the data.
Next, we look at the impact of the regularisation intensities on the outcome of penalised regressions.
We choose to work with a simple example. We investigate if the cross-section of characteristics can help understand the cross-section of future returns (directly, without resorting to the definition of asset-pricing factors). To do so, we normalise the firm characteristics using the scale() function. The scaling is performed each point in time. Finally, we estimate the following regression:
\[R_{t+1}=a + b^{cap}cap_t+\dots +b^{P2B}P2B_t+\epsilon_t.\]
Note that characteristics are lagged compared to returns: the scheme is clearly predictive.
Note: below, we scale returns, which may not be a great idea… In fact, it’s a pooled panel model.
data %>%
group_by(Date) %>% # Grouping to normalise on a date basis
mutate_if(is.numeric, scale) %>% # Scaled chars
ungroup() %>% # Ungroup: global variable
select(-Date, -Tick, -Close) %>% # Remove irrelevant columns
lm(F_Return ~ ., data = .) %>% # Perform the regression
summary() %>% # Get the summary
"$"(coef) %>% # Keeping only the coefs & stats
round(3) %>% # Round up to 3 digits
data.frame() # Convert to dataframe
## Estimate Std..Error t.value Pr...t..
## (Intercept) 0.000 0.011 0.000 1.000
## Vol_1M 0.015 0.012 1.292 0.196
## Mkt_Cap -0.044 0.012 -3.586 0.000
## P2B -0.002 0.012 -0.164 0.870
## D2E -0.023 0.012 -1.939 0.052
## Prof_Marg 0.024 0.012 1.967 0.049
## P_Return -0.012 0.012 -1.021 0.307
Given our sample, the most impactful variables are the market capitalisation, profitability margin and the debt-to-equity ratio. Those are the variables for which the last column (p-value) is smaller than 5%. If we adopt a more conservative threshold (1%), only the Mkt_Cap remains significant. Notably, the past returns have little predictive power over future returns.
We start with the general definition of penalised regression proposed by the elasticnet (Zou & Hastie (2005)). It can be characterised by its Lagrange form:
\[\underset{\mathbf{\beta}}{\text{min}} \, \left\{(\mathbf{Y}- \mathbf{X} \mathbf{\beta})'(\mathbf{Y}- \mathbf{X} \mathbf{\beta})+\lambda \alpha || \mathbf{\beta} ||_1+\lambda (1-\alpha)||\mathbf{\beta}||_2^2\right\},\] where \(\lambda>0\), \(\alpha\in[0,1]\). When \(\alpha=1\), we recover the so-called LASSO (Tibshirani (1996)) and when \(\alpha=0\), we get the classical ridge regression, which is a very smooth regularisation of the usual regression. In our case, the predicted value \(\mathbf{Y}\) will be the future return \(\mathbf{R}_{t+1}\) and \(\mathbf{X}\) the other scaled variables.
We start with the case of the LASSO. When running, glmnet() performs a series of LASSO estimations with many possible values for the penalisation constant (lambda in the function’s arguments).
data_lasso <- data %>%
group_by(Date) %>% # Grouping to normalise, date-by-date
mutate_if(is.numeric,scale) %>% # Scaled chars
ungroup() # Ungroup: global variable
y <- data$F_Return # Dependent variable: ORIGINAL returns!
x <- data_lasso %>% # Independent variables
select(-Tick, -Date, - Close, -F_Return) %>% # Removing irrelevant columns
as.matrix() # Transform in base matrix
fit <- glmnet(x,y, alpha = 1) # Performing the LASSO: 1 = Lasso, 0 = Ridge
# Below, we format the results
var_names <- colnames(data)[4:10] # Names of independent variables
res <- summary(fit$beta) # Summary of the series of LASSO regressions
lambda <- fit$lambda # Values of the penalisation constant
res$Lambda <- lambda[res$j] # Putting the labels where they belong
res$Char <- var_names[res$i] %>% as.factor() # Adding names of variables to the output
res %>% ggplot(aes(x = Lambda, y = x, color = Char)) + # Plot!
geom_line() + theme_light()
The value of coefficients (y-axis) is a function of the penalisation intensity (lambda). As the latter increases, the magnitude of coefficient decreases. Mkt_Cap dominates and the convergence to zero of the other variables is somewhat rapid. The slope for the past return (in green) is the slowest compared to other features.
Overall, the curves are angular.
We then switch to ridge regressions. The penalisation in this case is a multiple of the \(L^2\) norm of the coefficients.
fit <- glmnet(x,y, alpha = 0) # Performing ridge regression: 1 = Lasso, 0 = Ridge
res <- summary(fit$beta) # Summary of the series of ridge regressions
lambda <- fit$lambda # Values of the penalisation constant
res$Lambda <- lambda[res$j] # Putting the labels where they belong
res$Char <- var_names[res$i] %>% as.factor() # Adding the names of variables to the output
res %>% ggplot(aes(x = Lambda, y = x, color = Char)) + # Plot!
geom_line() + theme_light()
res %>% ggplot(aes(x = Lambda, y = x, color = Char)) + # Logscale plot
geom_line() + scale_x_log10() + theme_light()
On the above graph, we used a logarithmic scale to discern the relative convergence speed of each coefficient (the bulk of the first graph is unclear). In contrast to the LASSO, the convergence is very smooth.
Because the LASSO (harsh convergence) and the ridge regression (smooth convergence) have very different outcomes, there was a natural gap between the two that was filled by the elasticnet (Zou and Hastie 2005). The penalisation in this case is a linear convex combination of both types of norms: \(a||\beta||_1+(1-a)||\beta||_2^2\).
Below, we provide the corresponding plot for \(a=0.005\). The relationship is not linear: a = 0.5 is not the middle point between LASSO and ridge regression.
fit <- glmnet(x,y, alpha = 0.01) # The elasticnet: 1 = Lasso, 0 = Ridge
res <- summary(fit$beta) # Summary of elasticnet regressions
lambda <- fit$lambda # Values of the penalisation constant
res$Lambda <- lambda[res$j] # Putting the labels where they belong
res$Char <- var_names[res$i] %>% as.factor() # Adding the names of variables to the output
res %>% ggplot(aes(x = Lambda, y = x, color = Char)) + # Plot
geom_line() + theme_light()
Indeed, the convergence pattern of the coefficients is somewhere between the two extremes.
It is smoother compared to the LASSO and slower compared to the ridge regression.
We now turn to the application of the LASSO to sparse portfolios (see Goto & Xu 2015 for the full implementation and Stevens 1998 for the original idea).
We first initialise the variables. Sparse portfolios are based on returns only; we thus build a dedicated variable (in matrix/rectangular format).
sep_date <- as.Date("2010-01-01") # This date separates in-sample vs out-of-sample
t_oos <- data$Date[data$Date>sep_date] %>% # Out-of-sample dates (i.e., testing set)
unique() %>% # Remove duplicates
as.Date(origin = "1970-01-01") # Transform in date format
returns <- data %>% # Computing returns, in matrix format, in 2 steps:
select(Date, Tick, P_Return) %>% # 1. Keep returns along with dates & firm names
pivot_wider(names_from = Tick, # 2. Put in matrix shape
values_from = P_Return)
# Below, we initialise the variables used in the backtesting loop
portf_weights <- matrix(0, nrow = length(t_oos), ncol = length(tick))
portf_returns <- vector(mode = "numeric", length = length(t_oos))
returns # A look at the returns
## # A tibble: 253 x 31
## Date AAPL BA BAC C CSCO CVS CVX D
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000-01-31 0.00885 0.0739 -0.0349 0.0236 0.0222 -0.122 -0.0346 0.0637
## 2 2000-02-29 0.105 -0.167 -0.0503 -0.0896 0.207 0.00179 -0.0993 -0.106
## 3 2000-03-31 0.184 0.0237 0.152 0.157 0.170 0.0732 0.238 0.0477
## 4 2000-04-28 -0.0862 0.0496 -0.0656 -0.0121 -0.103 0.160 -0.0791 0.171
## 5 2000-05-31 -0.323 -0.0121 0.142 0.0540 -0.179 0 0.0936 0.0313
## 6 2000-06-30 0.248 0.0704 -0.224 -0.0312 0.116 -0.0805 -0.0825 -0.0628
## 7 2000-07-31 -0.0298 0.167 0.102 0.170 0.0295 -0.0112 -0.0687 0.0598
## 8 2000-08-31 0.198 0.102 0.142 0.107 0.0487 -0.0596 0.0785 0.179
## 9 2000-09-29 -0.577 0.202 -0.0225 -0.0742 -0.195 0.247 0.00864 0.0982
## 10 2000-10-31 -0.240 0.0513 -0.0823 -0.0266 -0.0249 0.144 -0.0367 0.0255
## # … with 243 more rows, and 22 more variables: DIS <dbl>, F <dbl>, GE <dbl>,
## # HD <dbl>, IBM <dbl>, INTC <dbl>, JNJ <dbl>, JPM <dbl>, K <dbl>, MCK <dbl>,
## # MRK <dbl>, MSFT <dbl>, ORCL <dbl>, PFE <dbl>, PG <dbl>, T <dbl>, UNH <dbl>,
## # UPS <dbl>, VZ <dbl>, WFC <dbl>, WMT <dbl>, XOM <dbl>
The core of the engine is the weighting function. Inside this function, the user will have two degrees of freedom: \(alpha\), which is the intensity between LASSO and ridge in the glmnet() function and \(\lambda\) which indicates the strength of the constraint. We recall that the approach is based on \(n\) regressions of the type: \[ \underset{\beta_{i|}}{\text{argmin}}\, \left\{\sum_{t=1}^T\left( r_{i,t}-a_i-\sum_{n\neq i}^N\beta_{i|n}r_{n,t}\right)^2+\lambda \alpha || \beta||_1+\lambda (1-\alpha)||\beta||_2^2\right\},\] where each asset’s returns are regressed against those of all other assets. The purpose is to hedge the risk of the asset thanks to position in the other assets. Due to estimation errors, the unconstrained regression performs poorly and is often very leveraged. The penalisation is intended to solve these two problems.
weights_lasso <- function(returns, alpha, lambda){ # The parameters are defined here
w <- 0 # Initiate weights
for(i in 1:ncol(returns)){ # Loop on the assets
y <- returns[,i] # Dependent variable
x <- returns[,-i] # Independent variable
fit <- glmnet(x,y, family = "gaussian", alpha = alpha, lambda = lambda)
err <- y-predict(fit, x) # Prediction errors
w[i] <- (1-sum(fit$beta))/var(err) # Output: weight of asset i
}
return(w / sum(w)) # Normalisation of weights
}
Finally, we move to the main loop, which has the same structure as those seen previously.
for(t in 1:length(t_oos)){
temp_data <- returns %>% filter(Date < t_oos[t]) %>% # Extracting past data: expand. window
select(-Date) %>% # Take out the date
as.matrix() # Into matrix: glmnet requires matrices
portf_weights[t,] <- weights_lasso(temp_data, 0.01, 0.1)# Hard-coded parameters! User specified!
realised_returns <- returns %>% # Realised returns:
filter(Date == t_oos[t]) %>% # Filtered by date
select(-Date) # With date removed
portf_returns[t] <- sum(portf_weights[t,] * realised_returns) # Portfolio returns
}
We then recall the perf_met() function, to which we added the turnover.
perf_met <- function(portf_returns, weights, asset_returns){
avg_ret <- mean(portf_returns, na.rm = T) # Arithmetic mean
vol <- sd(portf_returns, na.rm = T) # Volatility
Sharpe_ratio <- avg_ret / vol # Sharpe ratio
VaR_5 <- quantile(portf_returns, 0.05) # Value-at-risk
turn <- 0 # Initialisation of turnover
for(t in 2:dim(weights)[1]){
realised_returns <- asset_returns %>% filter(Date == t_oos[t]) %>% select(-Date)
prior_weights <- weights[t-1,] * (1 + realised_returns)
turn <- turn + apply(abs(weights[t,] - prior_weights/sum(prior_weights)),1,sum)
}
turn <- turn/(length(t_oos)-1) # Average over time
met <- data.frame(avg_ret, vol, Sharpe_ratio, VaR_5, turn) # Aggregation of all of this
rownames(met) <- "metrics"
return(met)
}
asset_returns <- filter(returns, Date>sep_date) # Keep out-of-sample returns
perf_met(portf_returns, portf_weights, asset_returns) # Compute perf metrics
## avg_ret vol Sharpe_ratio VaR_5 turn
## metrics 0.00919831 0.03194172 0.2879716 -0.04688039 0.04399418
Let’s have a look at the weights.
portf_weights <- bind_cols(t_oos, as_tibble(portf_weights))
colnames(portf_weights) <- c("date", tick)
portf_weights %>%
pivot_longer(-date, names_to = "stock", values_to = "weight") %>%
ggplot(aes(x = date, y = weight, color = stock)) +
geom_line() + theme_light() +
geom_text_repel(aes(label = stock), hjust = -0.2,
data = portf_weights %>%
pivot_longer(-date, names_to = "stock", values_to = "weight") %>%
filter(date == max(date))) +
annotate("text", x = as.Date("2021-06-01"), y = 0.14, label = "Ticker") +
theme(legend.position = "none")
The purpose of the sparse portfolios is to reduce the volatility. Let’s see how it compares to the other classical schemes. First, we build the weighting function. It’s just a concatenation of functions we saw previously.
weights_multi <- function(returns, j, alpha, lambda){
N <- ncol(returns)
if(j == 1){ # j = 1 => EW
return(rep(1/N,N))
}
if(j == 2){ # j = 2 => Minimum Variance
sigma <- cov(returns)
w <- solve(sigma) %*% rep(1,N)
return(w / sum(w))
}
if(j == 3){ # j = 3 => Maximum Sharpe Ratio
m <- apply(returns, 2, mean)
sigma <- cov(returns)
w <- solve(sigma) %*% m
return(w / sum(w))
}
if(j == 4){ # j = 4 => Penalised / elasticnet
w <- weights_lasso(returns, alpha, lambda)
}
}
Note that this does take some space in the script. A helpful shortcut is to create a separate file where you store the functions only. Below, we proceed to initialisation of variables and jump directly the main loop. Given that we compute the weights and returns for 4 portfolios, it takes a bit longer.
Tt <- length(t_oos) # Nb of dates, avoid T = TRUE
nb_port <- 4 # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick))) # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port) # Store returns
for(t in 1:length(t_oos)){
temp_data <- returns %>%
filter(Date < t_oos[t]) %>%
select(-Date) %>%
as.matrix()
realised_returns <- returns %>%
filter(Date == t_oos[t]) %>%
select(-Date)
for(j in 1:nb_port){
portf_weights[t,j,] <- weights_multi(temp_data, j, 0.1, 0.01) # Hard-coded parameters!
portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns)
}
}
Finally, we compute the corresponding performance metrics. Because of the complex nature of perf_met(), we use a small loop.
port_names <- c("EW", "MV", "MSR", "LASSO") # Portfolio names
met <- c() # Initialise metrics
for(i in 1:nb_port){
met <- rbind(met, perf_met(portf_returns[,i], portf_weights[,i,], asset_returns)) # Ugly!
}
met %>% data.frame(row.names = port_names) # Display results
## avg_ret vol Sharpe_ratio VaR_5 turn
## EW 0.010769125 0.03982011 0.2704444 -0.05507966 0.03872154
## MV 0.009737534 0.03241443 0.3004074 -0.04274005 0.10221530
## MSR 0.011078110 0.04361729 0.2539844 -0.06104447 0.29806042
## LASSO 0.009225296 0.03148715 0.2929860 -0.04077635 0.07107574
The LASSO portfolio has the lowest volatility! a close vol value is obtained by the Minimum Volatility (MV) model. The latter has a turnover much smaller compared to the MSR portfolio but is beaten by the LASSO strategy on this metric. Moreover, in terms of extreme risk (VaR), the LASSO portfolio is the best!
The parameters of the LASSO portfolio play a decisive role in the performance of the strategy: they should be chosen carefully after sensitivity tests show how they impact the results. The \(\lambda\) will dictate the amount of constraint on the coefficients. A strict constraint will build more stable and sparse portfolios, but the weights will be further from those of the MV portfolio. As usual, it’s all a matter of tradeoff.
Another use of the LASSO would simply be to use it as predictive tool. In this case, the explanatory variables would be, e.g., the firms’ characteristics. Below, we show how to implement the predictive regression approach. More precisely, it is a pooled panel estimation.
We start by the weighting function. We use the predictive regression to forecast future returns (probably not the best choice). Based on the prediction, we form equally-weighted portfolios of assets with positive forecasts (in fact, above median forecasts would be more stable and a better choice: fixed number of assets in the portfolio).
weights_pure_lasso <- function(past_data, current_data, alpha, lambda){
y <- past_data$F_Return # Dependent variable
x <- past_data %>% # Independent variables
select(-Tick, -Date, -Close, -F_Return) %>% # Remove irrelevant columns
as.matrix() # Format to matrix shape
fit <- glmnet(x,y, alpha = alpha, lambda = lambda) # Performing the glmnet regression
newx <- current_data %>% # Preparing the new data
select(-Tick, -Date, -Close, -F_Return) %>% # Remove irrelevant columns
as.matrix() # Format to matrix shape
pred <- predict(fit, newx = newx) # Launching the prediction
w <- pred > median(pred) # Invests only if positive prediction
return(w/sum(w)) # Then, equal weights
}
Below, we launch the main loop. Because of the forward return we must go one step backward in the out-of-sample dates. Indeed, before, we took data up to \(t-1\) to decide the allocation at time \(t-1\) and apply the return observed at time \(t\). Since we resort to the forward return, we must go back to time \(t-2\). The decision is made with the values at time \(t-1\):
To ‘simplify’ the architecture, we simply define a new vector of out-of-sample dates.
t_oos2 <- data$Date[data$Date>sep_date-20] %>% # New dates, we take one more (prev. month)!::
unique() %>% # Remove duplicates
as.Date(origin = "1970-01-01") # Transform in date format
portf_weights <- matrix(0, nrow = length(t_oos), ncol = length(tick)) # Initialisation
portf_returns <- c() # Initialisation
for(t in 2:length(t_oos2)){ # Current time is t-1
past_data <- data_lasso %>% filter(Date < t_oos2[t-1]) # Past data: expanding window
current_data <- data_lasso %>% filter(Date == t_oos2[t-1]) # Extracting current data
portf_weights[t-1,] <- weights_pure_lasso(past_data,current_data, 0.1, 0.01)
# Hard-coded parameters above! User specified!
realised_returns <- returns %>% # Realised returns
filter(Date == t_oos2[t]) %>% # Note: starts at t = 2, equal to t_oos[1]
select(-Date) # Take out date column
portf_returns[t-1] <- sum(portf_weights[t-1,] * realised_returns)
# Note: t-1, like for the portfolios !!!
}
asset_returns <- filter(returns, Date %in% t_oos) # And not t_oos2!
perf_met(portf_returns, portf_weights, asset_returns)
## avg_ret vol Sharpe_ratio VaR_5 turn
## metrics 0.01048623 0.04142142 0.2531597 -0.0592228 0.3806262
The performance is reasonably good compared to our previous results. The average return is quite high, but the risk has also increased (the VaR, notably). The turnover is nonetheless much too high.
Try different values for alpha and lambda in the sparse portfolios and see if it is possible to reduce the out-of-sample volatility.
Create a weighting function and the corresponding backtesting loop that tests several parameter values for:
- the LASSO-based sparse portfolios
- the predictive LASSO strategies
Is it possible to improve the volatility of the penalized minimum variance portfolio?
Given the computation time of only one backtest, this will take some time.
# Below, we create a function that depends on the two parameters
lasso_sensi <- function(alpha, lambda, t_oos, tick, returns){
portf_weights <- matrix(0, nrow = length(t_oos), ncol = length(tick))
portf_returns <- c()
for(t in 1:length(t_oos)){
temp_data <- returns %>% filter(Date < t_oos[t]) %>% # Extracting past data: expnd. wndw
select(-Date) %>% # Take out the date
as.matrix() # Into matrix
portf_weights[t,] <- weights_lasso(temp_data, alpha, lambda)
realised_returns <- returns %>% # Realised returns:
filter(Date == t_oos[t]) %>% # Filtered by date
select(-Date) # With date removed
portf_returns[t] <- sum(portf_weights[t,] * realised_returns) # Portfolio returns
}
return(sd(portf_returns))
}
alpha <- c(0.001, 0.01, 0.1) # alpha values
lambda <- c(0.001, 0.01, 0.1, 1) # lambda values
pars <- expand.grid(alpha, lambda) # Exploring all combinations!
alpha <- pars[,1]
lambda <- pars[,2]
tic() # Launch the clock
vols <- pmap(list(alpha, lambda), # Parameters for the grid search
lasso_sensi, # Function on which to apply pmap
t_oos = t_oos, # The other inputs below
tick = tick,
returns = returns)
toc() # Stop the clock!
## 116.769 sec elapsed
vols <- vols %>%
unlist() %>%
cbind(as.matrix(pars)) %>%
data.frame()
colnames(vols) <- c("vol", "alpha", "lambda")
vols %>% ggplot(aes(x = as.factor(alpha), y = vol, fill = as.factor(alpha))) +
geom_col() + facet_grid(~lambda) + theme_light() +
scale_y_continuous(limits = c(0.02, 0.04), oob = rescale_none)
The original parameters were a rather good choice.
First, let’s extract the coefficients of the regression.
data %>%
mutate(Year = year(Date)) %>% # Adding a column thanks to year() from lubridate
group_by(Tick) %>% # Building a pivot table
summarise(avg_cap = mean(Mkt_Cap),
avg_prof = mean(Prof_Marg)) %>%
lm(avg_prof ~ avg_cap, data = .)
##
## Call:
## lm(formula = avg_prof ~ avg_cap, data = .)
##
## Coefficients:
## (Intercept) avg_cap
## 6.328e+00 3.378e-05
Next, we can move forward to the plot.
data %>%
group_by(Tick) %>% # Building a pivot table
summarise(avg_cap = mean(Mkt_Cap), # The average cap
avg_prof = mean(Prof_Marg)) %>% # The average prof
mutate(pred = 6.302 + 3.368e-05 * avg_cap, # The model values (see above)
col = if_else(pred > avg_prof,
"Positive error",
"Negative error")) %>%
ggplot(aes(x = avg_cap, y = avg_prof)) + # The plot!
geom_point() +
geom_text_repel(aes(label = Tick)) + theme_light() +
ylab("Average Profitability Margin (%)") + xlab("Average Market Capitalization (K$)") +
stat_smooth(geom = 'line', method = "lm", alpha = 0.4, se = F, linetype = 2) +
geom_point(aes(x = avg_cap, y = pred), shape = 1, color = "#1166EE") +
geom_segment(aes(xend = avg_cap, yend = pred, color = col), alpha = .4) +
annotate(geom = "text", x = 320000, y = 18, label = "prediction line", color = "#1166EE", angle = 18) +
annotate(geom = "text", x = 229000, y = 9.5, label = "positive error", color = "#99BBBB", angle = 90) +
annotate(geom = "text", x = 334000, y = 12, label = "positive error", color = "#99BBBB", angle = 90) +
annotate(geom = "text", x = 420000, y = 24.5, label = "negative error", color = "#DD8888", angle = 90) +
annotate(geom = "text", x = 125000, y = 15, label = "negative error", color = "#DD8888", angle = 90) +
theme(legend.position = "none") +
annotate(geom = "text", x = 335000, y = 16.8, label = "prof = 7.15+2.16e-05 * mkt_cap", color = "#1166EE", angle = 18)
# ggsave("reg3.pdf", width = 8, height = 5)