Activity14

Regression with Lagged Variables & Multivariate EDA

Objective

Improve seasonal forecasting models using:

  1. Autoregressive components via lagged variables (\(y_{t-1}, y_{t-2}\))
  2. Multivariate diagnostics using correlation matrices and pairwise plots
  • Lagged variables capture serial dependence: \(E[y_t | y_{t-1}]\)
  • Fourier terms model periodic patterns: \(S_t = \sum_{k=1}^K [\alpha_k\sin(2\pi kt/m) + \beta_k\cos(2\pi kt/m)]\)
  • Multicollinearity detection via pairwise correlations (|r| > 0.8 indicates potential issues)

Focus

  • Extend the NYSE example by including lagged volume values.
  • Visualize relationships between the current volume, its lagged values, and Fourier terms.

Step-by-Step Tutorial

1. Data Preparation & Lag Creation

# Get NYSE Composite index data (volume) from tidyquant and convert to tsibble
nyse <- tq_get("^NYA", from = "2021-01-01", to = Sys.Date(), get = "stock.prices") %>% 
  as_tsibble(index = date) %>% 
  fill_gaps() %>% 
  mutate(volume = na.approx(volume),
         adjusted = na.approx(adjusted),
         logVolume = log(volume)) %>% 
  select(date, volume, logVolume, adjusted)

# Plot volume series
nyse %>% 
  autoplot(volume) +
  labs(title = "NYSE Composite Index Volume", y = "Volume", x = "Date")

# Create extended lag structure (4 weeks backward)
nyse <- nyse %>% 
  mutate(volume_lag1 = lag(volume, 1),
         volume_lag2 = lag(volume, 2)) %>% 
  drop_na()

2. Regression with Lagged Variables & Fourier Terms

# Fit a model including Fourier terms and lagged volume predictors
nyse_lag_model <- nyse %>% 
  model(
    LagModel = TSLM(volume ~ fourier(K = 2) + volume_lag1 + volume_lag2)
  )

report(nyse_lag_model) %>% 
glance(nyse_lag_model) %>% knitr::kable()
Series: volume 
Model: TSLM 

Residuals:
       Min         1Q     Median         3Q        Max 
-2.017e+09 -2.322e+08 -4.401e+07  1.557e+08  4.654e+09 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         9.120e+08  7.290e+07  12.511  < 2e-16 ***
fourier(K = 2)C1_7  8.649e+07  2.048e+07   4.223 2.55e-05 ***
fourier(K = 2)S1_7  1.040e+07  2.049e+07   0.508   0.6118    
fourier(K = 2)C2_7 -3.684e+07  2.044e+07  -1.803   0.0716 .  
fourier(K = 2)S2_7  5.220e+07  2.042e+07   2.556   0.0107 *  
volume_lag1         7.802e-01  2.549e-02  30.605  < 2e-16 ***
volume_lag2         6.441e-03  2.549e-02   0.253   0.8006    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 566900000 on 1539 degrees of freedom
Multiple R-squared: 0.6194, Adjusted R-squared: 0.6179
F-statistic: 417.4 on 6 and 1539 DF, p-value: < 2.22e-16
.model r_squared adj_r_squared sigma2 statistic p_value df log_lik AIC AICc BIC CV deviance df.residual rank
LagModel 0.6193707 0.6178867 3.213785e+17 417.384 0 7 -33350.88 62330.4 62330.5 62373.15 3.234206e+17 4.946016e+20 1539 7
# Residual diagnostics
nyse_lag_model %>% 
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics: NYSE Lag Model")

3. Multivariate EDA: ggpairs Plot

# Prepare a multivariate dataset for ggpairs visualization
nyse_multi <- nyse %>% 
  mutate(trend = row_number()) %>% 
  select(trend, volume, volume_lag1, volume_lag2) %>% 
  as_tibble()

library(GGally)
ggpairs(nyse_multi, 
        lower = list(continuous = wrap("smooth", alpha = 0.3)),
        title = "Multivariate Relationships with Trend Component")

Lab Activity

Task: Experiment with including additional lags (e.g., lag 3 or lag 4) and higher order Fourier Harmonics in your model. Use ggpairs to visualize how these lagged variables relate to the current volume and discuss potential multicollinearity issues.

Solution:

  1. Model Comparison:
nyse <- nyse %>% 
  mutate(volume_lag3 = lag(volume, 3),
         volume_lag4 = lag(volume, 4)) %>% 
  drop_na()

nyse %>% 
  model(
    Base = TSLM(volume ~ fourier(K = 2)),
    Extended = TSLM(volume ~ fourier(K = 3) + volume_lag1 + volume_lag2),
    Extended1 = TSLM(volume ~ fourier(K = 3) + volume_lag1 + volume_lag2 + volume_lag3 + volume_lag4)
  ) %>% 
  glance() %>%  # Compare AIC/BIC
  select(.model, adj_r_squared, AIC, AICc, BIC)
# A tibble: 3 × 5
  .model    adj_r_squared    AIC   AICc    BIC
  <chr>             <dbl>  <dbl>  <dbl>  <dbl>
1 Base            0.00569 63641. 63641. 63673.
2 Extended        0.621   62159. 62159. 62213.
3 Extended1       0.627   62136. 62136. 62200.
# Fit the better mode
nyse_final <- nyse %>% 
  model(
    BestModel = TSLM(volume ~ fourier(K = 3) + volume_lag1 + volume_lag2 + volume_lag3 + volume_lag4)
  )
  
nyse_final %>% gg_tsresiduals()

nyse_final %>% 
  augment() %>% 
  ggplot(aes(x = .fitted, y = volume)) + 
  geom_hex() +  # Density-aware plotting
  geom_abline(slope = 1) +
  labs(title = "Actual vs Fitted Values with Density Gradient")