This is the companion notebook to the course on empirical foundations of machine learning in asset management.
The topic is strongly tilted towards asset pricing anomalies and serves as the economic justification of ML-based factor portfolios. Indeed, since the 1990s, a large body of empirical work has shown that firms’ characteristics have an impact on future returns. This impact is possibly time-varying.
The first chunks of the code are inspired from the book Reproducible Finance by Jonathan Regenstein (http://www.reproduciblefinance.com).
The code is commented and the results are discussed as much as possible.
\(\rightarrow\) ABOUT THE NOTEBOOK:
First, we collect the data for our baseline assets. We work with 6 large American companies which are associated to their ticker symbol: Apple (AAPL, technologies), Boeing (BA, industry), Citigroup (C, banking), Pfizer (PFE, healthcare), Walmart (WMT, retail) and Exxon (XOM, oil/commodities). The six firms have been chosen in very diverse industry sectors.
NOTE: The analysis below should be carried out on a much larger number of assets. We restrict the width of the analysis for simplicity and for computational purposes.
if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(quantmod)){install.packages("quantmod")}
if(!require(lubridate)){install.packages("lubridate")}
library(tidyverse) # Package for data manipulation
library(quantmod) # Package for data extraction
library(lubridate) # Package for date management
tickers <- c("AAPL", "BA", "C", "PFE", "WMT", "XOM")
# Tickers: Apple, Boeing, Citigroup, Pfizer, Walmart, Exxon
# Others: , "F", "DIS", "GE", "CVX", "MSFT", "GS")
min_date <- "2000-01-01" # Starting date
max_date <- "2022-04-27" # Ending date
prices <- getSymbols(tickers, src = 'yahoo', # The data comes from Yahoo Finance
from = min_date, # Start date
to = max_date, # End date
auto.assign = TRUE,
warnings = FALSE) %>%
map(~Ad(get(.))) %>% # Retrieving the data
reduce(merge) %>% # Merge in one dataframe
`colnames<-`(tickers) # Set the column names
head(prices,7) # Have a look at the result!
## AAPL BA C PFE WMT XOM
## 2000-01-03 0.855796 25.94028 236.7048 14.20126 44.74233 19.67147
## 2000-01-04 0.783644 25.89994 222.1900 13.67220 43.06817 19.29469
## 2000-01-05 0.795111 27.51364 231.1223 13.89496 42.18923 20.34655
## 2000-01-06 0.726304 27.79605 242.2875 14.39618 42.64964 21.39841
## 2000-01-07 0.760708 28.60290 241.1711 15.37078 45.87244 21.33561
## 2000-01-10 0.747329 28.19947 240.3335 15.34294 45.03532 21.03732
## 2000-01-11 0.709102 27.67502 237.2631 15.14801 44.36568 21.10013
rm(AAPL, BA, C, PFE, WMT, XOM) # Remove unnecessary variables
The data is stored in the ‘prices’ variable. Let’s have a look at the series, to make sure they make sense.
prices %>%
data.frame(Date = index(.)) %>%
remove_rownames() %>%
pivot_longer(-Date, names_to = "Asset",
values_to = "Price") %>%
ggplot(aes(x = Date, y = Price, color = Asset)) + geom_line() +
facet_wrap(vars(Asset), scales = "free") + theme_light()
This seems alright, though Citigroup suffered massively in the subprime crisis.
Before we move towards the factors, we transform (via mutate) the daily prices into monthly returns. Returns are indeed the raw material in portfolio management. In the code, we resort to a lot of piping using the pipe operator %>%.
returns <- prices %>% # Start from prices
to.monthly(indexAt = "lastof", OHLC = FALSE) %>% # Convert to monthly
data.frame(Date = index(.)) %>% # Convert the index to a date
remove_rownames() %>% # Remove index => converted into row names
pivot_longer(-Date, names_to = "Asset",
values_to = "Price") %>% # Put the data in 'tidy' format
group_by(Asset) %>% # Group the rows by asset to compute returns
mutate(returns = Price / dplyr::lag(Price) - 1) %>% # Compute the returns
select(-Price) %>% # Remove price column
pivot_wider(names_from = "Asset",
values_from = "returns") %>% # Put the data back in matrix form
select(all_of(c("Date", tickers))) %>% # Keep only the relevant columns
na.omit() # Discard rows with missing/NA data
head(returns) # Again, show the result!
## # A tibble: 6 × 7
## Date AAPL BA C PFE WMT XOM
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000-02-29 0.105 -0.167 -0.0896 -0.110 -0.110 -0.0862
## 2 2000-03-31 0.185 0.0237 0.157 0.138 0.160 0.0357
## 3 2000-04-30 -0.0865 0.0496 -0.0121 0.152 -0.0199 -0.00401
## 4 2000-05-31 -0.323 -0.0121 0.0540 0.0584 0.0406 0.0782
## 5 2000-06-30 0.247 0.0704 -0.0312 0.0787 0.00109 -0.0578
## 6 2000-07-31 -0.0298 0.167 0.170 -0.0964 -0.0412 0.0215
We now turn to the importation of factor data. It’s extracted from the amazing database maintained by Kenneth French: http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
temp <- tempfile()
link <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip"
download.file(link, temp, quiet = TRUE) # Download!
nb_factors <- 5 # Number of factors
FF_factors <- read_csv(unz(temp,
"F-F_Research_Data_5_Factors_2x3.csv"), # The file name
skip = 3) %>% # Check the nb of lines to be skipped (could be 2!)
rename(Date = `...1`, MKT_RF = `Mkt-RF`) %>% # Change name of the first column
mutate_at(vars(-Date), as.numeric) %>% # Convert to number
mutate(Date = ymd(parse_date_time(Date, "%Y%m"))) %>% # Date in right format
mutate(Date = rollback(Date + months(1))) # End of month date
FF_factors <- FF_factors %>% mutate(MKT_RF = MKT_RF / 100, # Scaling returns
SMB = SMB / 100,
HML = HML / 100,
RMW = RMW / 100,
CMA = CMA / 100,
RF = RF / 100) %>% # That's the risk-free rate
filter(Date >= min_date) # Finally, we keep the recent points
head(FF_factors) # And we have a look!
## # A tibble: 6 × 7
## Date MKT_RF SMB HML RMW CMA RF
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000-01-31 -0.0474 0.0445 -0.0189 -0.0629 0.0474 0.0041
## 2 2000-02-29 0.0245 0.184 -0.0981 -0.188 -0.0035 0.0043
## 3 2000-03-31 0.052 -0.154 0.0823 0.118 -0.0161 0.0047
## 4 2000-04-30 -0.064 -0.0496 0.0725 0.0767 0.0562 0.0046
## 5 2000-05-31 -0.0442 -0.0387 0.0483 0.0418 0.0132 0.005
## 6 2000-06-30 0.0464 0.0987 -0.0841 -0.0826 -0.0291 0.004
Finally, we merge the two sources of data into one single variable.
data <- left_join(returns, FF_factors, by = "Date")
# left_join() is an incredible function that merges two sets through one common column
head(data) # See the outcome!
## # A tibble: 6 × 13
## Date AAPL BA C PFE WMT XOM MKT_RF SMB
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000-02-29 0.105 -0.167 -0.0896 -0.110 -0.110 -0.0862 0.0245 0.184
## 2 2000-03-31 0.185 0.0237 0.157 0.138 0.160 0.0357 0.052 -0.154
## 3 2000-04-30 -0.0865 0.0496 -0.0121 0.152 -0.0199 -0.00401 -0.064 -0.0496
## 4 2000-05-31 -0.323 -0.0121 0.0540 0.0584 0.0406 0.0782 -0.0442 -0.0387
## 5 2000-06-30 0.247 0.0704 -0.0312 0.0787 0.00109 -0.0578 0.0464 0.0987
## 6 2000-07-31 -0.0298 0.167 0.170 -0.0964 -0.0412 0.0215 -0.0251 -0.0102
## # … with 4 more variables: HML <dbl>, RMW <dbl>, CMA <dbl>, RF <dbl>
With our full dataset at hand, we are ready to begin the analysis.
The Fama-Macbeth regressions can be used to assess the premium associated to factors. They are performed in two steps.
A natural way to proceed would be to resort to a for() loop to perform the regressions iteratively. We take another route and instead make us of the ‘apply’ family of functions. We use a procedure that creates all models via their character syntax in R. And finally, we gather the results in the ‘betas’ variable.
models <- lapply(paste(tickers, ' ~ MKT_RF + SMB + HML + RMW + CMA'), # Specification of models
function(f){ lm(as.formula(f), data = data) %>% # Calling lm(.)
summary() %>% # Gathering the output
"$"(coef) %>% # Keeping only the coefs
data.frame() %>% # Convert to dataframe
select(Estimate)} # Keep the coefficients
)
betas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) %>% # Extracting the betas
data.frame(row.names = tickers) # Formatting: row names
colnames(betas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA") # Formatting: col names
betas # Showing the values
## Constant MKT_RF SMB HML RMW CMA
## AAPL 2.241459e-02 1.2603179 0.203876790 -0.56038315 0.2337760 -0.97665289
## BA 1.705785e-03 1.1354167 0.099093615 0.62774886 0.3038202 0.02417191
## C -2.514768e-03 1.5166111 -0.533662630 2.12343223 -0.9939174 -1.17013600
## PFE -9.473582e-04 0.8444254 -0.406146782 -0.15056910 0.3611860 0.59252631
## WMT 2.539574e-03 0.5297648 -0.396493766 -0.08939238 0.3333755 0.20786897
## XOM -3.720123e-05 0.7409502 -0.006314125 0.34321241 0.2033925 0.33725592
rm(models) # Remove models to save space
The market beta is the second column. When there are no other factors, a beta above 1 (resp. below 1) means that the stock is more volatile than the market (resp. less volatile). The two stocks with high market betas also load heavily (and negatively on CMA). Citigroup also has a strong positive exposure to HML.
We then turn to the second step of the procedure. This time, we regress asset returns against the betas, one date at a time. We start by formatting the data.
loadings <- betas %>% # Start from loadings
select(-Constant) %>% # Keep only the loadings
data.frame() # Convert to dataframe
ret <- returns %>% # Start from returns
select(-Date) %>% # Keep the returns only
data.frame(row.names = returns$Date) %>% # Set row names
t() # Transpose
FM_data <- cbind(loadings, ret) # Aggregate both
FM_data[,1:8] # Quick look at the first 8 columns
## MKT_RF SMB HML RMW CMA 2000-02-29
## AAPL 1.2603179 0.203876790 -0.56038315 0.2337760 -0.97665289 0.10482098
## BA 1.1354167 0.099093615 0.62774886 0.3038202 0.02417191 -0.16709977
## C 1.5166111 -0.533662630 2.12343223 -0.9939174 -1.17013600 -0.08958239
## PFE 0.8444254 -0.406146782 -0.15056910 0.3611860 0.59252631 -0.11001050
## WMT 0.5297648 -0.396493766 -0.08939238 0.3333755 0.20786897 -0.10958886
## XOM 0.7409502 -0.006314125 0.34321241 0.2033925 0.33725592 -0.08615805
## 2000-03-31 2000-04-30
## AAPL 0.18484066 -0.086516014
## BA 0.02368869 0.049586144
## C 0.15700398 -0.012084238
## PFE 0.13813239 0.152136900
## WMT 0.16043235 -0.019911897
## XOM 0.03568452 -0.004006031
This will be the inputs of the second stage of regressions: for each stock, we have both the returns and the associated loading values.
models <- lapply(paste("`", returns$Date, "`", ' ~ MKT_RF + SMB + HML + RMW + CMA', sep = ""),
function(f){ lm(as.formula(f), data = FM_data) %>% # Calling lm(.)
summary() %>% # Gathering the output
"$"(coef) %>% # Keeping only the coefs
data.frame() %>% # Convert to dataframe
select(Estimate)} # Keep the coefficients
)
gammas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) %>% # Switch to dataframe
data.frame(row.names = returns$Date) # & set row names
colnames(gammas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA") # Set col names
gammas[1:20,] # Show first 20 lines
## Constant MKT_RF SMB HML RMW
## 2000-02-29 0.06195001 -0.0073364873 0.1581067403 -0.197437047 -0.37044657
## 2000-03-31 0.08084707 0.0148681466 -0.2129338840 -0.078731564 -0.00459732
## 2000-04-30 -0.30384152 0.3201790485 -0.1598278960 -0.016377913 0.03869766
## 2000-05-31 0.09976535 -0.1274611386 -0.0435199251 0.104245250 -0.10700397
## 2000-06-30 -0.27314377 0.3015530083 -0.1325347738 -0.077149801 0.21248284
## 2000-07-31 -0.08904186 -0.0114286039 0.0526508050 0.248621591 0.36545601
## 2000-08-31 -0.17292260 0.3260336572 0.2376255788 -0.090300684 -0.21137606
## 2000-09-30 -0.20529126 -0.0665755163 -0.1159311047 0.469183573 0.64677663
## 2000-10-31 -0.12210987 -0.0028578157 -0.0005302572 0.178998559 0.20686626
## 2000-11-30 0.04074763 -0.2255984526 -0.2889947221 0.192706184 0.48441357
## 2000-12-31 -0.01898855 0.0003900559 -0.1098393697 -0.002510792 -0.04779740
## 2001-01-31 0.25837803 -0.0658709363 0.0462664928 -0.295905078 -0.36659592
## 2001-02-28 -0.31176463 0.2230169625 0.0181719396 0.096327439 0.19657947
## 2001-03-31 0.29737619 -0.2009474362 0.2080627271 -0.169256410 -0.28175765
## 2001-04-30 0.06112848 0.0761933308 0.1346452223 -0.036126678 -0.09761786
## 2001-05-31 -0.04475331 -0.0549063553 -0.0827848288 0.147036924 0.13958393
## 2001-06-30 0.17289993 -0.0361796958 0.1921473429 -0.203174165 -0.45890514
## 2001-07-31 -0.09506371 -0.1665547145 -0.3967360445 0.267139935 0.70087410
## 2001-08-31 0.01877263 0.0630448562 0.2253066808 -0.164458737 -0.44628784
## 2001-09-30 0.33914234 -0.2159149549 -0.0048804592 -0.316727845 -0.80898239
## CMA
## 2000-02-29 0.004255256
## 2000-03-31 -0.087668992
## 2000-04-30 0.175949065
## 2000-05-31 0.173798303
## 2000-06-30 -0.076003048
## 2000-07-31 -0.119557651
## 2000-08-31 0.090467520
## 2000-09-30 0.156536012
## 2000-10-31 0.064019811
## 2000-11-30 -0.144057664
## 2000-12-31 0.048969686
## 2001-01-31 -0.193386585
## 2001-02-28 0.123951337
## 2001-03-31 -0.096038572
## 2001-04-30 0.027716795
## 2001-05-31 0.037622030
## 2001-06-30 0.007820043
## 2001-07-31 -0.184186844
## 2001-08-31 0.148226811
## 2001-09-30 0.223495340
rm(models) # Remove models, save space
Note that the estimates are not going to be very reliable because the number of observations (number of stocks, i.e., 6) is barely larger than the number of variables (number of factors, 5). Accuracy would require a much larger number of assets (hundreds, or, preferably, thousands).
gammas %>% colMeans() %>% round(4) # Means (over all periods)
## Constant MKT_RF SMB HML RMW CMA
## 0.0069 0.0052 0.0084 -0.0061 0.0047 -0.0075
apply(gammas, 2, t.test) %>% unlist() %>% # t-stats
matrix(ncol = ncol(gammas), dimnames = list(c(), colnames(betas))) %>%
data.frame() %>% slice(1) %>% as.numeric() %>% round(4) # Take only the first line of the test
## [1] 0.6954 0.5968 0.8468 -0.8422 0.2671 -1.1443
Surprisingly, some averages are quite negative. But the t-stats are hardly significant.
Let’s nonetheless visually assess the stability of premia across time.
gammas %>% # Start from gamma (estimates)
select(-Constant) %>% # Take out the constant
rownames_to_column("Date") %>% # Convert row names to the first column => dates
mutate(Date = as.Date(Date)) %>% # Convert first date column to true date format
pivot_longer(-Date, names_to = "Factor", # Put in column/tidy format for ggplot
values_to = "Value") %>%
ggplot(aes(x = Date, y = Value, color = Factor)) + # Plot
geom_line() + facet_grid(Factor~.) + theme_light() +
scale_color_viridis_d()
CMA and HML appears more stable, while RMW and SMB are more volatile.
Finally, what are the annual average of these series? We build a simple pivot-table below.
gammas %>% # To create a pivot table, start with the gammas
select(-Constant) %>% # Take out the constant
rownames_to_column("Date") %>% # Convert row names to the first column
mutate(Date = year(as.Date(Date))) %>% # Convert this new first column to year
pivot_longer(-Date, names_to = "Factor", # Put in column/tidy format for ggplot
values_to = "Value") %>%
group_by(Date, Factor) %>% # Prepare pivot-table
summarise(avg_premium = mean(Value)) %>% # Pivot table!
ggplot(aes(x = Date, y = avg_premium, color = Factor)) + # Plot
geom_line() + geom_point() + theme_light() +
scale_color_brewer(palette = "Set2")
## `summarise()` has grouped output by 'Date'. You can override using the
## `.groups` argument.
The final graph shows the average of premias computed over each year in the sample. It unveils some cycles for most factors. The Size factor is known to be very sensitive to macro-economic conditions: indeed smaller firms thrive in good economic times, but are hit hardly during economic downturns. Both in 2008 and 2009, the SMB returns were negative.
Below, we regress each factor against all its ‘competitors’. If the alpha (coefficient of the constant/Integer vector) in the regression is not significantly different from zero, it means that the factor is somewhat redundant because it can be ‘explained’ by exposures to other factors. If not, the factor cannot be explained by the other factors and is thus adding value to the analysis.
We can thus work simply with the Fama-French factors with no external data.
factors <- c("MKT_RF", "SMB", "HML", "RMW", "CMA")
models <- lapply(paste(factors, ' ~ MKT_RF + SMB + HML + RMW + CMA-',factors),
function(f){ lm(as.formula(f), data = FF_factors) %>% # Calling lm(.)
summary() %>% # Gathering the output
"$"(coef) %>% # Keeping only the coefs
data.frame() %>% # Convert to dataframe
filter(rownames(.) == "(Intercept)") %>% # Keep only the Intercept
select(Estimate,`Pr...t..`)} # Keep the coef & p-value
)
alphas <- matrix(unlist(models), ncol = 2, byrow = T) %>% # Switch from list to dataframe
data.frame(row.names = factors) # Set row names
colnames(alphas) <- c("coef", "p-value") # Set col names
alphas # Show
## coef p-value
## MKT_RF 0.009359218 2.334365e-04
## SMB 0.003976494 1.978716e-02
## HML -0.004216841 9.134865e-03
## RMW 0.005920758 3.013396e-05
## CMA 0.003243847 1.685547e-03
rm(models) # Remove models
The p-values are rather small. The results are somewhat in line with the conclusions of Fama & French (2015) - Section 7 in which the authors conclude that HML is redundant (not always true!).
In R, panel models are handled via the great plm package, so we need it! Also: let’s load a small dataset and compute forward returns.
if(!require(plm)){install.packages("plm")} # Package for panels
library(plm)
load("data.RData") # Load data
data <- data %>%
group_by(Tick) %>%
mutate(F_return = dplyr::lead(Close)/Close - 1) %>% # Add return column
ungroup()
data
## # A tibble: 7,650 × 10
## Tick Date Close Vol_1M Mkt_Cap P2B D2E Prof_Marg ESG_rank F_return
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 1999-12-31 0.791 53.6 16532. 3.35 9.66 8.31 NA 0.00885
## 2 AAPL 2000-01-31 0.798 80.8 16683. 3.38 6.80 7.81 NA 0.105
## 3 AAPL 2000-02-29 0.882 60.0 18480. 3.73 6.80 7.81 NA 0.184
## 4 AAPL 2000-03-31 1.04 75.7 21896. 4.68 6.80 7.81 NA -0.0862
## 5 AAPL 2000-04-28 0.954 81.6 20002. 4.28 7.12 12.0 NA -0.323
## 6 AAPL 2000-05-31 0.646 61.3 13670. 2.90 7.12 12.0 NA 0.248
## 7 AAPL 2000-06-30 0.806 66.6 17047. 3.64 7.12 12.0 NA -0.0298
## 8 AAPL 2000-07-31 0.782 76.7 16539. 3.53 7.18 11.0 NA 0.198
## 9 AAPL 2000-08-31 0.937 56.7 19803. 4.24 7.18 11.0 NA -0.577
## 10 AAPL 2000-09-29 0.396 231. 8368. 1.85 7.18 11.0 NA -0.240
## # … with 7,640 more rows
In equations below: - \(t\) is time;
- \(n\) is stock;
- \(k\) is characteristic.
The predictive relationship is
\[r_{t+1,n} = a + \sum_{k=1}^Kb^kx^k_{t,n} + u_{t+1,n}\]
If we estimate this like a stacked regression, the esimtator is called “pooled”.
if(!require(broom)){install.packages("broom")} # Neat package for regression output
## Loading required package: broom
library(broom)
est_pooled <- plm(F_return ~ Vol_1M + Mkt_Cap + P2B + D2E, # Model
data = data, # Data source
model = "pooling") # Estimation method
summary(est_pooled) # Summary of output
## Pooling Model
##
## Call:
## plm(formula = F_return ~ Vol_1M + Mkt_Cap + P2B + D2E, data = data,
## model = "pooling")
##
## Balanced Panel: n = 30, T = 254, N = 7620
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.59128763 -0.04086125 0.00044276 0.04234380 1.26571898
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 9.8501e-03 2.0040e-03 4.9152 9.052e-07 ***
## Vol_1M 2.8787e-05 4.9794e-05 0.5781 0.56320
## Mkt_Cap -1.1541e-08 6.1956e-09 -1.8627 0.06254 .
## P2B 4.2764e-06 1.4138e-05 0.3025 0.76229
## D2E -1.6050e-06 2.3055e-06 -0.6962 0.48633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 50.639
## Residual Sum of Squares: 50.611
## R-Squared: 0.000561
## Adj. R-Squared: 3.6017e-05
## F-statistic: 1.06861 on 4 and 7615 DF, p-value: 0.37022
coef_pooled <- tidy(est_pooled) # Output to dataframe
rm(est_pooled)
If we allow the error term to be decomposed so that
\[r_{t+1,n} = a + \sum_{k=1}^Kb^kx^k_{t,n} + \mu_n+ e_{t+1,n},\]
then estimators are called within (fixed-effect estimation). This means that the model allows for stock-specific idiosyncrasies.
est_within <- plm(F_return ~ Vol_1M + Mkt_Cap + P2B + D2E,
data = data,
model = "within")
summary(est_within)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = F_return ~ Vol_1M + Mkt_Cap + P2B + D2E, data = data,
## model = "within")
##
## Balanced Panel: n = 30, T = 254, N = 7620
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.6136509 -0.0406315 0.0013567 0.0420280 1.2655448
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## Vol_1M -1.3733e-05 5.2289e-05 -0.2626 0.792843
## Mkt_Cap -2.4726e-08 8.1732e-09 -3.0253 0.002492 **
## P2B -6.7260e-06 1.6375e-05 -0.4107 0.681274
## D2E 5.1642e-07 2.8497e-06 0.1812 0.856201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 50.461
## Residual Sum of Squares: 50.397
## R-Squared: 0.001266
## Adj. R-Squared: -0.0030786
## F-statistic: 2.40406 on 4 and 7586 DF, p-value: 0.047508
coef_within <- tidy(est_within)
Let’s plot the differences. There are some discrepancies!
We plot t-statistics to normalize scales.
bind_rows(
bind_cols(coef_within, type = "within"), # Create new type column
bind_cols(coef_pooled, type = "pooled") # Create new type column
) %>%
ggplot(aes(x = term, y = statistic, fill = type)) + # Plot
theme_light() + # White background
geom_col(position = "dodge", alpha = 0.8) +
scale_fill_manual(values = c("#77ABD9", "#FFAF43")) # Custom colors
The differences seem to matter! Except for Mkt_Cap which seem to be a driver in both types of estimations
If you want to extract the fixed effects, you can use the fixef() function.
fixef(est_within)
## AAPL BA BAC C CSCO CVS CVX D
## 0.0375688 0.0157350 0.0144934 0.0067987 0.0095592 0.0105258 0.0127779 0.0105631
## DIS F GE HD IBM INTC JNJ JPM
## 0.0140952 0.0082480 0.0074514 0.0127411 0.0090035 0.0128892 0.0139873 0.0152703
## K MCK MRK MSFT ORCL PFE PG T
## 0.0069580 0.0127352 0.0094126 0.0216019 0.0118950 0.0100482 0.0116584 0.0085960
## UNH UPS VZ WFC WMT XOM
## 0.0213596 0.0094590 0.0095485 0.0122815 0.0116318 0.0141969