Fill in Blanks
Home
1.1 Bivariate Relationships
1.2 Probabilistic Models
1.3 Estimation of the Line
1.4 Properties of the Least Squares Estimators
1.5 Estimation of the Variance
2.1 The Normal Errors Model
2.2 Inferences for the Slope
2.3 Inferences for the Intercept
2.4 Correlation and Coefficient of Determination
2.5 Estimating the Mean Response
2.6 Predicting the Response
3.1 Residual Diagnostics
3.2 The Linearity Assumption
3.3 Homogeneity of Variance
3.4 Checking for Outliers
3.5 Correlated Error Terms
3.6 Normality of the Residuals
4.1 More Than One Predictor Variable
4.2 Estimating the Multiple Regression Model
4.3 A Primer on Matrices
4.4 The Regression Model in Matrix Terms
4.5 Least Squares and Inferences Using Matrices
4.6 ANOVA and Adjusted Coefficient of Determination
4.7 Estimation and Prediction of the Response
5.1 Multicollinearity and Its Effects
5.2 Adding a Predictor Variable
5.3 Outliers and Influential Cases
5.4 Residual Diagnostics
5.5 Remedial Measures
7.3 Stepwise Regression Procedures
"Essentially, all models are wrong, but some are useful."
- George Box
In those occasional cases when the pool of potential predictor variables contains 30 to 40 or even
more variables, use of a "best" subsets algorithm may not be feasible.
An automatic search procedure that develops the "best" subset of $X$ variablessequentially may then be helpful.
Theforward stepwise regression procedure is probably the most widely used of the automatic
search methods.
It was developed to economize on computational efforts as compared with the various all-possible-regressions procedures. Essentially, this search method develops a sequence of regression models, at each stepadding or deleting an $X$ variable.
The criterion for adding or deleting an $X$ variable can be stated equivalently in terms of error sum of squares reduction, $t^*$ statistic, $F^*$ statistic, AIC, or BIC.
An automatic search procedure that develops the "best" subset of $X$ variables
The
It was developed to economize on computational efforts as compared with the various all-possible-regressions procedures. Essentially, this search method develops a sequence of regression models, at each step
The criterion for adding or deleting an $X$ variable can be stated equivalently in terms of error sum of squares reduction, $t^*$ statistic, $F^*$ statistic, AIC, or BIC.
An essential difference between stepwise procedures and the "best" subsets algorithm is
that stepwise search procedures end with the identification of a single regression model as
"best."
With the "best" subsets algorithm, on the other hand. several regression models can be identified as "good" for final consideration.
The identification of a single regression model as "best" by the stepwise procedures is a major weakness of these procedures.
Experience has shown that each of the stepwise search procedures can sometimes err by identifying a suboptimal regression model as "best."
In addition, the identification of a single regression model may hide the fact that several other regression models may also be "good."
Finally, the "goodness" of a regression model can only be established by a thorough examination using a variety of diagnostics.
What then can we do on those occasions when the pool of potential X variables is very large and an automatic search procedure must be utilized? Basically, we should use the subset identified by the automatic search procedure as a starting point for searching for other "good" subsets.
One possibility is to treat the number of X variables in the regression model identified by the automatic search procedure as being about the right subset size and then use the "best" subsets procedure for subsets of this and nearby sizes.
With the "best" subsets algorithm, on the other hand. several regression models can be identified as "good" for final consideration.
The identification of a single regression model as "best" by the stepwise procedures is a major weakness of these procedures.
Experience has shown that each of the stepwise search procedures can sometimes err by identifying a suboptimal regression model as "best."
In addition, the identification of a single regression model may hide the fact that several other regression models may also be "good."
Finally, the "goodness" of a regression model can only be established by a thorough examination using a variety of diagnostics.
What then can we do on those occasions when the pool of potential X variables is very large and an automatic search procedure must be utilized? Basically, we should use the subset identified by the automatic search procedure as a starting point for searching for other "good" subsets.
One possibility is to treat the number of X variables in the regression model identified by the automatic search procedure as being about the right subset size and then use the "best" subsets procedure for subsets of this and nearby sizes.
We shall describe the forward stepwise regression search algorithm in terms of the AIC statistic.
Note that the stepwise regression algorithm allows an X variable, brought into the model
at an earlier stage, to be dropped subsequently if it is no longer helpful in conjunction with
variables added at later stages.
-
The stepwise regression routine first fits a simple linear regression model for each of
the $P - 1$ potential $X$ variables.
For each simple linear regression model, the AIC statistic is obtained.
The X variable with the smallest AIC is the candidate for first addition. -
Assume $X_7$ is the variable entered at step 1. The stepwise regression routine now
fits all regression models with two $X$ variables, where X7 is one of the pair.
For each such regression model, the AIC corresponding to the newly added predictor $X_k$ is obtained.
The $X$ variable with the smallest AIC is the candidate for addition at the second stage.
If the AIC is smaller than AIC for model in the previous step (in this case, the model with only $X_7$), then that variable is added to the model that already has $X_7$. Otherwise, the program terminates. -
Suppose $X_3$ is added at the second stage. Now the stepwise regression routine examines
whether any of the other $X$ variables already in the model should be dropped.
For our illustration, there is at this stage only one other $X$ variable in the model, $X_7$. At later stages, there would be a number of variables in the model besides the one last added.
The variable with the largest AIC is the candidate for deletion. If this AIC exceeds the AIC for when the variable is not in the model, the variable is dropped from the model; otherwise, it is retained. -
Suppose $X_7$ is retained so that both $X_3$ and $X_7$ are now in the model.
The stepwise regression routine now examines which $X$ variable is the next candidate for addition, then examines whether any of the variables already in the model should now be dropped, and so on until no further $X$ variables can either be added or deleted, at which point the search terminates.
Other stepwise procedures are available to find a "best" subset of predictor variables. We
mention two of these.
The forward selection search procedure is a simplified version of
forward stepwise regression, omitting the test whether a variable once entered into the
model should be dropped.
The backward elimination search procedure is the opposite of
forward selection.
It begins with the model containing all potential $X$ variables and identifies the one with the largest AIC. If the maximum AIX is greater than the full model, that $X$ variable is dropped.
The model with the remaining $P - 2$ $X$ variables is then fitted, and the next candidate for dropping is identified.
This process continues until no further $X$ variables can be dropped.
A stepwise modification can also be adapted that allows variables eliminated earlier to be added later: this modification is called thebackward stepwise regression procedure.
It begins with the model containing all potential $X$ variables and identifies the one with the largest AIC. If the maximum AIX is greater than the full model, that $X$ variable is dropped.
The model with the remaining $P - 2$ $X$ variables is then fitted, and the next candidate for dropping is identified.
This process continues until no further $X$ variables can be dropped.
A stepwise modification can also be adapted that allows variables eliminated earlier to be added later: this modification is called the
Let's examine the surgical unit data.
library(tidyverse)
library(car)
dat = read.table("http://www.jpstats.org/Regression/data/surgicalunit.txt", header=T)
dat$surv_days=NULL
#We will use the step function to do stepwise regression
#We start with the full model and the model with
#only the intercept
full = lm(ln_surv_days~., data=dat)
fit.none = lm(ln_surv_days~1, data=dat)
#forward stepwise
#note that trace=T allows you to see all
#the steps
fit.fs = step(fit.none, scope =formula(full),
direction = "both", trace=T)
Start: AIC=-75.7
ln_surv_days ~ 1
Df Sum of Sq RSS AIC
+ enzyme_test 1 5.4762 7.3316 -103.827
+ liver_test 1 5.3990 7.4087 -103.262
+ prog_ind 1 2.8285 9.9792 -87.178
+ alc_heavy 1 1.7798 11.0279 -81.782
+ blood_clot 1 0.7763 12.0315 -77.079
+ gender 1 0.6897 12.1180 -76.692
12.8077 -75.703
+ age 1 0.2691 12.5386 -74.849
+ alc_mod 1 0.2052 12.6025 -74.575
Step: AIC=-103.83
ln_surv_days ~ enzyme_test
Df Sum of Sq RSS AIC
+ prog_ind 1 3.0191 4.3125 -130.483
+ liver_test 1 2.2019 5.1297 -121.113
+ blood_clot 1 1.5506 5.7810 -114.658
+ alc_heavy 1 1.1376 6.1940 -110.932
7.3316 -103.827
+ gender 1 0.2585 7.0730 -103.765
+ age 1 0.2388 7.0928 -103.615
+ alc_mod 1 0.0650 7.2666 -102.308
- enzyme_test 1 5.4762 12.8077 -75.703
Step: AIC=-130.48
ln_surv_days ~ enzyme_test + prog_ind
Df Sum of Sq RSS AIC
+ alc_heavy 1 1.4696 2.8429 -150.985
+ blood_clot 1 1.2040 3.1085 -146.161
+ liver_test 1 0.6984 3.6141 -138.023
+ alc_mod 1 0.2263 4.0862 -131.394
+ age 1 0.1646 4.1479 -130.585
4.3125 -130.483
+ gender 1 0.0824 4.2300 -129.526
- prog_ind 1 3.0191 7.3316 -103.827
- enzyme_test 1 5.6667 9.9792 -87.178
Step: AIC=-150.98
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy
Df Sum of Sq RSS AIC
+ blood_clot 1 0.6641 2.1788 -163.351
+ liver_test 1 0.4663 2.3766 -158.659
+ gender 1 0.1374 2.7055 -151.660
2.8429 -150.985
+ age 1 0.0708 2.7721 -150.347
+ alc_mod 1 0.0246 2.8182 -149.455
- alc_heavy 1 1.4696 4.3125 -130.483
- prog_ind 1 3.3511 6.1940 -110.932
- enzyme_test 1 4.9456 7.7885 -98.562
Step: AIC=-163.35
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot
Df Sum of Sq RSS AIC
+ gender 1 0.0968 2.0820 -163.805
2.1788 -163.351
+ age 1 0.0759 2.1029 -163.265
+ liver_test 1 0.0417 2.1371 -162.395
+ alc_mod 1 0.0229 2.1559 -161.923
- blood_clot 1 0.6641 2.8429 -150.985
- alc_heavy 1 0.9297 3.1085 -146.161
- prog_ind 1 2.9873 5.1661 -118.731
- enzyme_test 1 5.4513 7.6301 -97.671
Step: AIC=-163.81
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot +
gender
Df Sum of Sq RSS AIC
+ age 1 0.0768 2.0052 -163.834
2.0820 -163.805
- gender 1 0.0968 2.1788 -163.351
+ alc_mod 1 0.0224 2.0596 -162.389
+ liver_test 1 0.0164 2.0656 -162.232
- blood_clot 1 0.6235 2.7055 -151.660
- alc_heavy 1 0.9745 3.0565 -145.072
- prog_ind 1 2.8268 4.9088 -119.490
- enzyme_test 1 5.0791 7.1611 -99.097
Step: AIC=-163.83
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot +
gender + age
Df Sum of Sq RSS AIC
2.0052 -163.834
- age 1 0.0768 2.0820 -163.805
- gender 1 0.0977 2.1029 -163.265
+ alc_mod 1 0.0332 1.9720 -162.736
+ liver_test 1 0.0023 2.0029 -161.896
- blood_clot 1 0.6282 2.6335 -151.117
- alc_heavy 1 0.9002 2.9055 -145.809
- prog_ind 1 2.7626 4.7678 -119.064
- enzyme_test 1 5.0801 7.0853 -97.672
#backward stepwise
fit.bs = step(full, direction = "both", trace=T)
Start: AIC=-160.77
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + liver_test +
age + gender + alc_mod + alc_heavy
Df Sum of Sq RSS AIC
- liver_test 1 0.00129 1.9720 -162.74
- alc_mod 1 0.03220 2.0029 -161.90
- age 1 0.07354 2.0443 -160.79
1.9707 -160.77
- gender 1 0.08415 2.0549 -160.51
- blood_clot 1 0.31809 2.2888 -154.69
- alc_heavy 1 0.84573 2.8165 -143.49
- prog_ind 1 2.09045 4.0612 -123.72
- enzyme_test 1 2.99085 4.9616 -112.91
Step: AIC=-162.74
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender +
alc_mod + alc_heavy
Df Sum of Sq RSS AIC
- alc_mod 1 0.0332 2.0052 -163.834
1.9720 -162.736
- age 1 0.0876 2.0596 -162.389
- gender 1 0.0971 2.0691 -162.141
+ liver_test 1 0.0013 1.9707 -160.771
- blood_clot 1 0.6267 2.5988 -149.833
- alc_heavy 1 0.8446 2.8166 -145.486
- prog_ind 1 2.6731 4.6451 -118.471
- enzyme_test 1 5.0986 7.0706 -95.784
Step: AIC=-163.83
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender +
alc_heavy
Df Sum of Sq RSS AIC
2.0052 -163.834
- age 1 0.0768 2.0820 -163.805
- gender 1 0.0977 2.1029 -163.265
+ alc_mod 1 0.0332 1.9720 -162.736
+ liver_test 1 0.0023 2.0029 -161.896
- blood_clot 1 0.6282 2.6335 -151.117
- alc_heavy 1 0.9002 2.9055 -145.809
- prog_ind 1 2.7626 4.7678 -119.064
- enzyme_test 1 5.0801 7.0853 -97.672
#forward selection
fit.fsel = step(fit.none, scope =formula(full),
direction = "forward", trace=T)
Start: AIC=-75.7
ln_surv_days ~ 1
Df Sum of Sq RSS AIC
+ enzyme_test 1 5.4762 7.3316 -103.827
+ liver_test 1 5.3990 7.4087 -103.262
+ prog_ind 1 2.8285 9.9792 -87.178
+ alc_heavy 1 1.7798 11.0279 -81.782
+ blood_clot 1 0.7763 12.0315 -77.079
+ gender 1 0.6897 12.1180 -76.692
12.8077 -75.703
+ age 1 0.2691 12.5386 -74.849
+ alc_mod 1 0.2052 12.6025 -74.575
Step: AIC=-103.83
ln_surv_days ~ enzyme_test
Df Sum of Sq RSS AIC
+ prog_ind 1 3.01908 4.3125 -130.48
+ liver_test 1 2.20187 5.1297 -121.11
+ blood_clot 1 1.55061 5.7810 -114.66
+ alc_heavy 1 1.13756 6.1940 -110.93
7.3316 -103.83
+ gender 1 0.25854 7.0730 -103.77
+ age 1 0.23877 7.0928 -103.61
+ alc_mod 1 0.06498 7.2666 -102.31
Step: AIC=-130.48
ln_surv_days ~ enzyme_test + prog_ind
Df Sum of Sq RSS AIC
+ alc_heavy 1 1.46961 2.8429 -150.99
+ blood_clot 1 1.20395 3.1085 -146.16
+ liver_test 1 0.69836 3.6141 -138.02
+ alc_mod 1 0.22632 4.0862 -131.39
+ age 1 0.16461 4.1479 -130.59
4.3125 -130.48
+ gender 1 0.08245 4.2300 -129.53
Step: AIC=-150.98
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy
Df Sum of Sq RSS AIC
+ blood_clot 1 0.66408 2.1788 -163.35
+ liver_test 1 0.46630 2.3766 -158.66
+ gender 1 0.13741 2.7055 -151.66
2.8429 -150.99
+ age 1 0.07081 2.7721 -150.35
+ alc_mod 1 0.02464 2.8182 -149.46
Step: AIC=-163.35
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot
Df Sum of Sq RSS AIC
+ gender 1 0.096791 2.0820 -163.81
2.1788 -163.35
+ age 1 0.075876 2.1029 -163.26
+ liver_test 1 0.041701 2.1371 -162.40
+ alc_mod 1 0.022944 2.1559 -161.92
Step: AIC=-163.81
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot +
gender
Df Sum of Sq RSS AIC
+ age 1 0.076782 2.0052 -163.83
2.0820 -163.81
+ alc_mod 1 0.022387 2.0596 -162.39
+ liver_test 1 0.016399 2.0656 -162.23
Step: AIC=-163.83
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot +
gender + age
Df Sum of Sq RSS AIC
2.0052 -163.83
+ alc_mod 1 0.033193 1.9720 -162.74
+ liver_test 1 0.002284 2.0029 -161.90
#backward elemination
fit.belim = step(full, direction = "backward", trace=T)
Start: AIC=-160.77
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + liver_test +
age + gender + alc_mod + alc_heavy
Df Sum of Sq RSS AIC
- liver_test 1 0.00129 1.9720 -162.74
- alc_mod 1 0.03220 2.0029 -161.90
- age 1 0.07354 2.0443 -160.79
1.9707 -160.77
- gender 1 0.08415 2.0549 -160.51
- blood_clot 1 0.31809 2.2888 -154.69
- alc_heavy 1 0.84573 2.8165 -143.49
- prog_ind 1 2.09045 4.0612 -123.72
- enzyme_test 1 2.99085 4.9616 -112.91
Step: AIC=-162.74
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender +
alc_mod + alc_heavy
Df Sum of Sq RSS AIC
- alc_mod 1 0.0332 2.0052 -163.834
1.9720 -162.736
- age 1 0.0876 2.0596 -162.389
- gender 1 0.0971 2.0691 -162.141
- blood_clot 1 0.6267 2.5988 -149.833
- alc_heavy 1 0.8446 2.8166 -145.486
- prog_ind 1 2.6731 4.6451 -118.471
- enzyme_test 1 5.0986 7.0706 -95.784
Step: AIC=-163.83
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender +
alc_heavy
Df Sum of Sq RSS AIC
2.0052 -163.834
- age 1 0.0768 2.0820 -163.805
- gender 1 0.0977 2.1029 -163.265
- blood_clot 1 0.6282 2.6335 -151.117
- alc_heavy 1 0.9002 2.9055 -145.809
- prog_ind 1 2.7626 4.7678 -119.064
- enzyme_test 1 5.0801 7.0853 -97.672
#Let's compare the four different "best" stepwise models
fit.fs %>% formula()
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot +
gender + age
fit.bs %>% formula()
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender +
alc_heavy
fit.fsel %>% formula()
ln_surv_days ~ enzyme_test + prog_ind + alc_heavy + blood_clot +
gender + age
fit.belim %>% formula()
ln_surv_days ~ blood_clot + prog_ind + enzyme_test + age + gender +
alc_heavy
#for this example, all four ended with the same model
#check for multicollinearity
fit.fs %>% vif
enzyme_test prog_ind alc_heavy blood_clot gender age
1.076909 1.037099 1.114506 1.108821 1.047745 1.016471