Multiple regression models the relationship between a dependent variable and multiple independent variables. In many real‐world scenarios, predictors can be correlated—this is known as multicollinearity. High multicollinearity inflates the variance of the coefficient estimates, making them unstable and difficult to interpret. A common diagnostic measure is the Variance Inflation Factor (VIF). For a regression model:
the VIF for predictor \(\\(X_i\\)\) is computed as:
\[
\text{VIF}(X_i) = \frac{1}{1 - R_i^2},
\]
where \(\\(R_i^2\\)\) is the coefficient of determination when \(\\(X_i\\)\) is regressed on the other predictors. VIF values greater than 5 (or 10, depending on context) indicate potentially problematic collinearity.
Real-World Examples
1. Economy Example (US Macro Data)
Dataset:us_change (available in the fpp3 package) Model:
In this example, we explore how Income and Production drive Consumption. We then use VIF to diagnose multicollinearity.
library(car)# Fit the TSLM model on us_change datamodel_macro <- us_change %>%model(tslm_macro =TSLM(Consumption ~ Income + Production))# Display model coefficientsmodel_macro %>%tidy(tslm_macro) %>% knitr::kable(caption ='US Macro Model Coefficients')
US Macro Model Coefficients
.model
term
estimate
std.error
statistic
p.value
tslm_macro
(Intercept)
0.5106790
0.0477604
10.692528
0.00e+00
tslm_macro
Income
0.1842811
0.0427041
4.315301
2.53e-05
tslm_macro
Production
0.1925038
0.0252739
7.616697
0.00e+00
# Compute VIF values using a standard lm objectmacro_lm <-lm(Consumption ~ Income + Production, data = us_change)vif_macro <- car::vif(macro_lm)vif_macro
Here, we investigate the effect of COVID-19 deaths and policy stringency on the growth of confirmed cases, using a log-transformation to account for exponential growth. VIF is computed to check for multicollinearity between the predictors.
library(readr)url <-"https://github.com/OxCGRT/covid-policy-dataset/raw/main/data/OxCGRT_compact_national_v1.csv"oxcgrt <-read_csv(url)# Filter for the United States and create a log-transformed 'ConfirmedCases' variableoxcgrt_us <- oxcgrt %>%filter(CountryName =='United States') %>%mutate(Date =ymd(Date),log_ConfirmedCases =log(1+ ConfirmedCases)) %>%as_tsibble(index = Date) %>%drop_na(log_ConfirmedCases, ConfirmedDeaths, StringencyIndex_Average)# Fit the TSLM modelmodel_covid <- oxcgrt_us %>%model(tslm_covid =TSLM(log_ConfirmedCases ~ ConfirmedDeaths + StringencyIndex_Average))# Display model coefficientsmodel_covid %>%tidy(tslm_covid) %>% knitr::kable(caption ='COVID Model Coefficients')
COVID Model Coefficients
.model
term
estimate
std.error
statistic
p.value
tslm_covid
(Intercept)
1.7724524
0.0956172
18.53696
0
tslm_covid
ConfirmedDeaths
0.0000113
0.0000001
156.84617
0
tslm_covid
StringencyIndex_Average
0.1578871
0.0013481
117.12148
0
# Compute VIF values using a standard lm object (car::vif works with lm)covid_lm <-lm(log_ConfirmedCases ~ ConfirmedDeaths + StringencyIndex_Average, data =as_tibble(oxcgrt_us))vif_covid <- car::vif(covid_lm)vif_covid
Lab Activity 1: Assessing Multicollinearity in US Macro Data
Prompt:
Using the us_change dataset, fit a TSLM model with Consumption as the response and all of the remaining variables as predictors.
Compute the VIF for each predictor using the car::vif() function.
Interpret the VIF values and discuss whether there is evidence of multicollinearity.
Solution:
# Fit the TSLM model on us_change datamodel_macro_lab <- us_change %>%model(tslm_macro_lab =TSLM(Consumption ~ .))# Convert to an lm object for VIF computationmacro_lm_lab <-lm(Consumption ~ ., data = us_change)vif_macro_lab <- car::vif(macro_lm_lab)vif_macro_lab
Quarter Income Production Savings Unemployment
1.118319 2.713012 2.709416 2.533910 2.784650
Lab Activity 2: Exploring COVID Policy Measures and Their Effects
Prompt:
Using the oxcgrt dataset for the United States, create a new variable log_ConfirmedDeaths as \(\\(\\log(1 + \text{ConfirmedDeaths})\\)\).
Fit a TSLM model with ConfirmedCases, StringencyIndex_Average, C6M_Stay at home requirements as predictors for log_ConfirmedDeaths.
Compute the VIF for the predictors and interpret the results in terms of potential collinearity.
Solution:
# Filter and transform the dataoxcgrt_lab <- oxcgrt %>%filter(CountryName =='United States') %>%mutate(Date =ymd(Date),log_ConfirmedDeaths =log(1+ ConfirmedDeaths)) %>%as_tsibble(index = Date) %>%drop_na(ConfirmedCases, log_ConfirmedDeaths, StringencyIndex_Average)# Fit the TSLM modelmodel_covid_lab <- oxcgrt_lab %>%model(tslm_covid_lab =TSLM(log_ConfirmedDeaths ~ ConfirmedCases + StringencyIndex_Average +`C6M_Stay at home requirements`))# Convert to an lm object for VIF computationcovid_lm_lab <-lm(log_ConfirmedDeaths ~ ConfirmedCases + StringencyIndex_Average +`C6M_Stay at home requirements`, data =as_tibble(oxcgrt_lab))vif_covid_lab <- car::vif(covid_lm_lab)vif_covid_lab
ConfirmedCases StringencyIndex_Average
1.466983 4.000390
`C6M_Stay at home requirements`
3.786874