7.3 Stepwise Regression Procedures

"Essentially, all models are wrong, but some are useful."
- George Box

7.3.1 Stepwise Regression

In those occasional cases when the pool of potential predictor variables contains 30 to 40 or even more variables, use of a "best" subsets algorithm may not be feasible.

An automatic search procedure that develops the "best" subset of $X$ variables sequentially may then be helpful.

The forward stepwise regression procedure is probably the most widely used of the automatic search methods.

It was developed to economize on computational efforts as compared with the various all-possible-regressions procedures. Essentially, this search method develops a sequence of regression models, at each step adding or deleting an $X$ variable.

The criterion for adding or deleting an $X$ variable can be stated equivalently in terms of error sum of squares reduction, $t^*$ statistic, $F^*$ statistic, AIC, or BIC.

Limitations of Stepwise Methods

An essential difference between stepwise procedures and the "best" subsets algorithm is that stepwise search procedures end with the identification of a single regression model as "best."

With the "best" subsets algorithm, on the other hand. several regression models can be identified as "good" for final consideration.

The identification of a single regression model as "best" by the stepwise procedures is a major weakness of these procedures.

Experience has shown that each of the stepwise search procedures can sometimes err by identifying a suboptimal regression model as "best."

In addition, the identification of a single regression model may hide the fact that several other regression models may also be "good."

Finally, the "goodness" of a regression model can only be established by a thorough examination using a variety of diagnostics.

What then can we do on those occasions when the pool of potential X variables is very large and an automatic search procedure must be utilized? Basically, we should use the subset identified by the automatic search procedure as a starting point for searching for other "good" subsets.

One possibility is to treat the number of X variables in the regression model identified by the automatic search procedure as being about the right subset size and then use the "best" subsets procedure for subsets of this and nearby sizes.

7.3.2 Forward Stepwise Regression

We shall describe the forward stepwise regression search algorithm in terms of the AIC statistic.

The stepwise regression routine first fits a simple linear regression model for each of the $P - 1$ potential $X$ variables.

For each simple linear regression model, the AIC statistic is obtained.

The X variable with the smallest AIC is the candidate for first addition.
Assume $X_7$ is the variable entered at step 1. The stepwise regression routine now fits all regression models with two $X$ variables, where X7 is one of the pair.

For each such regression model, the AIC corresponding to the newly added predictor $X_k$ is obtained.

The $X$ variable with the smallest AIC is the candidate for addition at the second stage.

If the AIC is smaller than AIC for model in the previous step (in this case, the model with only $X_7$), then that variable is added to the model that already has $X_7$. Otherwise, the program terminates.
Suppose $X_3$ is added at the second stage. Now the stepwise regression routine examines whether any of the other $X$ variables already in the model should be dropped.

For our illustration, there is at this stage only one other $X$ variable in the model, $X_7$. At later stages, there would be a number of variables in the model besides the one last added.

The variable with the largest AIC is the candidate for deletion. If this AIC exceeds the AIC for when the variable is not in the model, the variable is dropped from the model; otherwise, it is retained.
Suppose $X_7$ is retained so that both $X_3$ and $X_7$ are now in the model.

The stepwise regression routine now examines which $X$ variable is the next candidate for addition, then examines whether any of the variables already in the model should now be dropped, and so on until no further $X$ variables can either be added or deleted, at which point the search terminates.

Note that the stepwise regression algorithm allows an X variable, brought into the model at an earlier stage, to be dropped subsequently if it is no longer helpful in conjunction with variables added at later stages.

7.3.3 Other Stepwise Procedures

Other stepwise procedures are available to find a "best" subset of predictor variables. We mention two of these.

Forward Selection

The forward selection search procedure is a simplified version of forward stepwise regression, omitting the test whether a variable once entered into the model should be dropped.

Backward Elimination

The backward elimination search procedure is the opposite of forward selection.

It begins with the model containing all potential $X$ variables and identifies the one with the largest AIC. If the maximum AIX is greater than the full model, that $X$ variable is dropped.

The model with the remaining $P - 2$ $X$ variables is then fitted, and the next candidate for dropping is identified.

This process continues until no further $X$ variables can be dropped.

A stepwise modification can also be adapted that allows variables eliminated earlier to be added later: this modification is called the backward stepwise regression procedure.

Example 7.3.1

Let's examine the surgical unit data.

library(tidyverse)
library(car)

dat = read.table("http://www.jpstats.org/Regression/data/surgicalunit.txt", header=T)
dat$surv_days=NULL

#We will use the step function to do stepwise regression

#We start with the full model and the model with
#only the intercept
full = lm(ln_surv_days~., data=dat)
fit.none = lm(ln_surv_days~1, data=dat)

#forward stepwise
#note that trace=T allows you to see all 
#the steps
fit.fs = step(fit.none, scope =formula(full), 
              direction = "both", trace=T)

Start:  AIC=-75.7
ln_surv_days ~ 1

              Df Sum of Sq     RSS      AIC
+ enzyme_test  1    5.4762  7.3316 -103.827
+ liver_test   1    5.3990  7.4087 -103.262
+ prog_ind     1    2.8285  9.9792  -87.178
+ alc_heavy    1    1.7798 11.0279  -81.782
+ blood_clot   1    0.7763 12.0315  -77.079
+ gender       1    0.6897 12.1180  -76.692
                     12.8077  -75.703
+ age          1    0.2691 12.5386  -74.849
+ alc_mod      1    0.2052 12.6025  -74.575

Step:  AIC=-103.83
ln_surv_days ~ enzyme_test

              Df Sum of Sq     RSS      AIC
+ prog_ind     1    3.0191  4.3125 -130.483
+ liver_test   1    2.2019  5.1297 -121.113
+ blood_clot   1    1.5506  5.7810 -114.658
+ alc_heavy    1    1.1376  6.1940 -110.932
                      7.3316 -103.827
+ gender       1    0.2585  7.0730 -103.765
+ age          1    0.2388  7.0928 -103.615
+ alc_mod      1    0.0650  7.2666 -102.308
- enzyme_test  1    5.4762 12.8077  -75.703

Step:  AIC=-130.48
ln_surv_days ~ enzyme_test + prog_ind

              Df Sum of Sq    RSS      AIC
+ alc_heavy    1    1.4696 2.8429 -150.985
+ blood_clot   1    1.2040 3.1085 -146.161
+ liver_test   1    0.6984 3.6141 -138.023
+ alc_mod      1    0.2263 4.0862 -131.394
+ age          1    0.1646 4.1479 -130.585
                     4.3125 -130.483
+ gender       1    0.0824 4.2300 -129.526
- prog_ind     1    3.0191 7.3316 -103.827
- enzyme_test  1    5.6667 9.9792  -87.178

Step:  AIC=-150.98
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy

              Df Sum of Sq    RSS      AIC
+ blood_clot   1    0.6641 2.1788 -163.351
+ liver_test   1    0.4663 2.3766 -158.659
+ gender       1    0.1374 2.7055 -151.660
                     2.8429 -150.985
+ age          1    0.0708 2.7721 -150.347
+ alc_mod      1    0.0246 2.8182 -149.455
- alc_heavy    1    1.4696 4.3125 -130.483
- prog_ind     1    3.3511 6.1940 -110.932
- enzyme_test  1    4.9456 7.7885  -98.562

Step:  AIC=-163.35
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot

              Df Sum of Sq    RSS      AIC
+ gender       1    0.0968 2.0820 -163.805
                     2.1788 -163.351
+ age          1    0.0759 2.1029 -163.265
+ liver_test   1    0.0417 2.1371 -162.395
+ alc_mod      1    0.0229 2.1559 -161.923
- blood_clot   1    0.6641 2.8429 -150.985
- alc_heavy    1    0.9297 3.1085 -146.161
- prog_ind     1    2.9873 5.1661 -118.731
- enzyme_test  1    5.4513 7.6301  -97.671

Step:  AIC=-163.81
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot + 
    gender

              Df Sum of Sq    RSS      AIC
+ age          1    0.0768 2.0052 -163.834
                     2.0820 -163.805
- gender       1    0.0968 2.1788 -163.351
+ alc_mod      1    0.0224 2.0596 -162.389
+ liver_test   1    0.0164 2.0656 -162.232
- blood_clot   1    0.6235 2.7055 -151.660
- alc_heavy    1    0.9745 3.0565 -145.072
- prog_ind     1    2.8268 4.9088 -119.490
- enzyme_test  1    5.0791 7.1611  -99.097

Step:  AIC=-163.83
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot + 
    gender + age

              Df Sum of Sq    RSS      AIC
                     2.0052 -163.834
- age          1    0.0768 2.0820 -163.805
- gender       1    0.0977 2.1029 -163.265
+ alc_mod      1    0.0332 1.9720 -162.736
+ liver_test   1    0.0023 2.0029 -161.896
- blood_clot   1    0.6282 2.6335 -151.117
- alc_heavy    1    0.9002 2.9055 -145.809
- prog_ind     1    2.7626 4.7678 -119.064
- enzyme_test  1    5.0801 7.0853  -97.672


#backward stepwise
fit.bs = step(full, direction = "both", trace=T)

Start:  AIC=-160.77
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + liver_test + 
    age + gender + alc_mod + alc_heavy

              Df Sum of Sq    RSS     AIC
- liver_test   1   0.00129 1.9720 -162.74
- alc_mod      1   0.03220 2.0029 -161.90
- age          1   0.07354 2.0443 -160.79
                     1.9707 -160.77
- gender       1   0.08415 2.0549 -160.51
- blood_clot   1   0.31809 2.2888 -154.69
- alc_heavy    1   0.84573 2.8165 -143.49
- prog_ind     1   2.09045 4.0612 -123.72
- enzyme_test  1   2.99085 4.9616 -112.91

Step:  AIC=-162.74
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender + 
    alc_mod + alc_heavy

              Df Sum of Sq    RSS      AIC
- alc_mod      1    0.0332 2.0052 -163.834
                     1.9720 -162.736
- age          1    0.0876 2.0596 -162.389
- gender       1    0.0971 2.0691 -162.141
+ liver_test   1    0.0013 1.9707 -160.771
- blood_clot   1    0.6267 2.5988 -149.833
- alc_heavy    1    0.8446 2.8166 -145.486
- prog_ind     1    2.6731 4.6451 -118.471
- enzyme_test  1    5.0986 7.0706  -95.784

Step:  AIC=-163.83
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender + 
    alc_heavy

              Df Sum of Sq    RSS      AIC
                     2.0052 -163.834
- age          1    0.0768 2.0820 -163.805
- gender       1    0.0977 2.1029 -163.265
+ alc_mod      1    0.0332 1.9720 -162.736
+ liver_test   1    0.0023 2.0029 -161.896
- blood_clot   1    0.6282 2.6335 -151.117
- alc_heavy    1    0.9002 2.9055 -145.809
- prog_ind     1    2.7626 4.7678 -119.064
- enzyme_test  1    5.0801 7.0853  -97.672


#forward selection
fit.fsel = step(fit.none, scope =formula(full), 
                direction = "forward", trace=T)

Start:  AIC=-75.7
ln_surv_days ~ 1

              Df Sum of Sq     RSS      AIC
+ enzyme_test  1    5.4762  7.3316 -103.827
+ liver_test   1    5.3990  7.4087 -103.262
+ prog_ind     1    2.8285  9.9792  -87.178
+ alc_heavy    1    1.7798 11.0279  -81.782
+ blood_clot   1    0.7763 12.0315  -77.079
+ gender       1    0.6897 12.1180  -76.692
                     12.8077  -75.703
+ age          1    0.2691 12.5386  -74.849
+ alc_mod      1    0.2052 12.6025  -74.575

Step:  AIC=-103.83
ln_surv_days ~ enzyme_test

             Df Sum of Sq    RSS     AIC
+ prog_ind    1   3.01908 4.3125 -130.48
+ liver_test  1   2.20187 5.1297 -121.11
+ blood_clot  1   1.55061 5.7810 -114.66
+ alc_heavy   1   1.13756 6.1940 -110.93
                    7.3316 -103.83
+ gender      1   0.25854 7.0730 -103.77
+ age         1   0.23877 7.0928 -103.61
+ alc_mod     1   0.06498 7.2666 -102.31

Step:  AIC=-130.48
ln_surv_days ~ enzyme_test + prog_ind

             Df Sum of Sq    RSS     AIC
+ alc_heavy   1   1.46961 2.8429 -150.99
+ blood_clot  1   1.20395 3.1085 -146.16
+ liver_test  1   0.69836 3.6141 -138.02
+ alc_mod     1   0.22632 4.0862 -131.39
+ age         1   0.16461 4.1479 -130.59
                    4.3125 -130.48
+ gender      1   0.08245 4.2300 -129.53

Step:  AIC=-150.98
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy

             Df Sum of Sq    RSS     AIC
+ blood_clot  1   0.66408 2.1788 -163.35
+ liver_test  1   0.46630 2.3766 -158.66
+ gender      1   0.13741 2.7055 -151.66
                    2.8429 -150.99
+ age         1   0.07081 2.7721 -150.35
+ alc_mod     1   0.02464 2.8182 -149.46

Step:  AIC=-163.35
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot

             Df Sum of Sq    RSS     AIC
+ gender      1  0.096791 2.0820 -163.81
                    2.1788 -163.35
+ age         1  0.075876 2.1029 -163.26
+ liver_test  1  0.041701 2.1371 -162.40
+ alc_mod     1  0.022944 2.1559 -161.92

Step:  AIC=-163.81
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot + 
    gender

             Df Sum of Sq    RSS     AIC
+ age         1  0.076782 2.0052 -163.83
                    2.0820 -163.81
+ alc_mod     1  0.022387 2.0596 -162.39
+ liver_test  1  0.016399 2.0656 -162.23

Step:  AIC=-163.83
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot + 
    gender + age

             Df Sum of Sq    RSS     AIC
                    2.0052 -163.83
+ alc_mod     1  0.033193 1.9720 -162.74
+ liver_test  1  0.002284 2.0029 -161.90


#backward elemination
fit.belim = step(full, direction = "backward",  trace=T)

Start:  AIC=-160.77
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + liver_test + 
    age + gender + alc_mod + alc_heavy

              Df Sum of Sq    RSS     AIC
- liver_test   1   0.00129 1.9720 -162.74
- alc_mod      1   0.03220 2.0029 -161.90
- age          1   0.07354 2.0443 -160.79
                     1.9707 -160.77
- gender       1   0.08415 2.0549 -160.51
- blood_clot   1   0.31809 2.2888 -154.69
- alc_heavy    1   0.84573 2.8165 -143.49
- prog_ind     1   2.09045 4.0612 -123.72
- enzyme_test  1   2.99085 4.9616 -112.91

Step:  AIC=-162.74
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender + 
    alc_mod + alc_heavy

              Df Sum of Sq    RSS      AIC
- alc_mod      1    0.0332 2.0052 -163.834
                     1.9720 -162.736
- age          1    0.0876 2.0596 -162.389
- gender       1    0.0971 2.0691 -162.141
- blood_clot   1    0.6267 2.5988 -149.833
- alc_heavy    1    0.8446 2.8166 -145.486
- prog_ind     1    2.6731 4.6451 -118.471
- enzyme_test  1    5.0986 7.0706  -95.784

Step:  AIC=-163.83
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender + 
    alc_heavy

              Df Sum of Sq    RSS      AIC
                     2.0052 -163.834
- age          1    0.0768 2.0820 -163.805
- gender       1    0.0977 2.1029 -163.265
- blood_clot   1    0.6282 2.6335 -151.117
- alc_heavy    1    0.9002 2.9055 -145.809
- prog_ind     1    2.7626 4.7678 -119.064
- enzyme_test  1    5.0801 7.0853  -97.672



#Let's compare the four different "best" stepwise models
fit.fs %>% formula()

ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot + 
    gender + age

fit.bs %>% formula()

ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender + 
    alc_heavy

fit.fsel %>% formula()

ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot + 
    gender + age

fit.belim %>% formula()

ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender + 
    alc_heavy


#for this example, all four ended with the same model

#check for multicollinearity
fit.fs %>% vif

enzyme_test    prog_ind   alc_heavy  blood_clot      gender         age 
   1.076909    1.037099    1.114506    1.108821    1.047745    1.016471

« 7.2: Best Subsets Procedure 7.4: Lasso »