Fill in Blanks
Home
1.1 Bivariate Relationships
1.2 Probabilistic Models
1.3 Estimation of the Line
1.4 Properties of the Least Squares Estimators
1.5 Estimation of the Variance
2.1 The Normal Errors Model
2.2 Inferences for the Slope
2.3 Inferences for the Intercept
2.4 Correlation and Coefficient of Determination
2.5 Estimating the Mean Response
2.6 Predicting the Response
3.1 Residual Diagnostics
3.2 The Linearity Assumption
3.3 Homogeneity of Variance
3.4 Checking for Outliers
3.5 Correlated Error Terms
3.6 Normality of the Residuals
4.1 More Than One Predictor Variable
4.2 Estimating the Multiple Regression Model
4.3 A Primer on Matrices
4.4 The Regression Model in Matrix Terms
4.5 Least Squares and Inferences Using Matrices
4.6 ANOVA and Adjusted Coefficient of Determination
4.7 Estimation and Prediction of the Response
5.1 Multicollinearity and Its Effects
5.2 Adding a Predictor Variable
5.3 Outliers and Influential Cases
5.4 Residual Diagnostics
5.5 Remedial Measures
5.1 Multicollinearity and Its Effects
"I am fast. To give you a reference point I am somewhere between a snake and a mongoose… And a panther." - Dwight Schrute
Recall from Section 4.1.1 that correlation among the predictor variables may cause problems in multiple regression analysis.
The ability to obtain a goodfit or to make inferences on the mean response or to predict the response are not affected by multicollinearity. However, inferences for the coefficients (the $\beta$s) and for the model variance ($\sigma^2$) are affected by large correlation among the predictor variables.
Another effect of multicollinearity is the interpretation of the estimated coefficients. In multiple regression, we interpret the coefficient as the average change in $Y$ when $X$ is increased by one unit when all other predictor variables are held constant. If $X$ is highly correlated with one or more of the other predictor variables, then it may not be feasible to think of varying $X$ when the others are constant.
The ability to obtain a good
Another effect of multicollinearity is the interpretation of the estimated coefficients. In multiple regression, we interpret the coefficient as the average change in $Y$ when $X$ is increased by one unit when all other predictor variables are held constant. If $X$ is highly correlated with one or more of the other predictor variables, then it may not be feasible to think of varying $X$ when the others are constant.
We can see evidence of multicollinearity by examining the scatterplot
matrix since this will give us a plot of each pair of predictor variables.
If there are pairs that appear to be highly correlated (ggpairs will
give the correlation value as well) then multicollinearity will be
present.
We could also examine $R_{a}^{2}$ for models with and without certain pairs of variables. If $R_{a}^{2}$ decreases when a particular $X$ variable is added but it appears to have a strong linear relationship with $Y$ in the scatterplot matrix, then this is evidence of multicollinearity.
A more convenient way to examine multicollinearity is through the use of thevariance inflation factors (VIF).
Each predictor variable will have a VIF. Suppose we are interested in the VIF for $X_{1}$. We start by regression $X_{1}$ on all the other predictor variables. Thus, we fit the model \begin{align*} X_{i1} & =\alpha_{0}+\alpha_{2}X_{i2}+\alpha_{3}X_{i3}+\cdots+\alpha_{p-1}X_{i,p-1}+\epsilon \end{align*} where the $\alpha$'s are the coefficients and $\epsilon$ is the random error term.
Now find the coefficient of multiple determination for this model which we will denote as $R_{1}^{2}$. The VIF for $X_{1}$ is then \begin{align*} VIF_{1} & =\frac{1}{1-R_{1}^{2}}. \end{align*} We can do this for any $i$th predictor variable so that the VIF for that variable is \begin{align*} VIF_{i} & =\frac{1}{1-R_{i}^{2}}\qquad(5.1) \end{align*} where $R_{i}^{2}$ is the coefficient of multiple determination for the regression fit of $X_{i}$ on all the other predictor variables.
A rule of thumb is that a VIF greater than 10 is evidence that multicollinearity is high when that variable is added to the model. Some use a cutoff of 5 instead of 10.
We could also examine $R_{a}^{2}$ for models with and without certain pairs of variables. If $R_{a}^{2}$ decreases when a particular $X$ variable is added but it appears to have a strong linear relationship with $Y$ in the scatterplot matrix, then this is evidence of multicollinearity.
A more convenient way to examine multicollinearity is through the use of the
Each predictor variable will have a VIF. Suppose we are interested in the VIF for $X_{1}$. We start by regression $X_{1}$ on all the other predictor variables. Thus, we fit the model \begin{align*} X_{i1} & =\alpha_{0}+\alpha_{2}X_{i2}+\alpha_{3}X_{i3}+\cdots+\alpha_{p-1}X_{i,p-1}+\epsilon \end{align*} where the $\alpha$'s are the coefficients and $\epsilon$ is the random error term.
Now find the coefficient of multiple determination for this model which we will denote as $R_{1}^{2}$. The VIF for $X_{1}$ is then \begin{align*} VIF_{1} & =\frac{1}{1-R_{1}^{2}}. \end{align*} We can do this for any $i$th predictor variable so that the VIF for that variable is \begin{align*} VIF_{i} & =\frac{1}{1-R_{i}^{2}}\qquad(5.1) \end{align*} where $R_{i}^{2}$ is the coefficient of multiple determination for the regression fit of $X_{i}$ on all the other predictor variables.
A rule of thumb is that a VIF greater than 10 is evidence that multicollinearity is high when that variable is added to the model. Some use a cutoff of 5 instead of 10.
One approach to seeing which variables are correlated with each other is to remove a variable with a large VIF and see which variables had the largest change in their VIF.
We will illustrate this process with theUN98 dataset from the car library. We will not use the region and GDPperCapita variables for this example.
We will illustrate this process with the
library(car)
#remove the region and GDPperCapita variables for this example
dat = UN98
dat$region = NULL
dat$GDPperCapita=NULL
full = lm(infantMortality ~. , data = dat)
summary(full)
Call:
lm(formula = infantMortality ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-22.1193 -4.6541 0.1195 4.3203 14.7447
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.03151 48.85991 2.272 0.03093 *
tfr 12.04099 3.86634 3.114 0.00423 **
contraception -0.07093 0.13715 -0.517 0.60910
educationMale 5.98231 3.11064 1.923 0.06468 .
educationFemale -6.40921 2.76181 -2.321 0.02781 *
lifeMale -0.40539 1.11163 -0.365 0.71809
lifeFemale -0.75718 1.31751 -0.575 0.57008
economicActivityMale -0.38268 0.26354 -1.452 0.15760
economicActivityFemale 0.17223 0.12810 1.344 0.18960
illiteracyMale -0.54440 0.39236 -1.387 0.17624
illiteracyFemale 0.32089 0.31299 1.025 0.31403
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.396 on 28 degrees of freedom
(168 observations deleted due to missingness)
Multiple R-squared: 0.9444, Adjusted R-squared: 0.9245
F-statistic: 47.52 on 10 and 28 DF, p-value: 6.751e-15
# examine vif for all varialbes
(vfull = vif(full))
tfr contraception educationMale
14.319844 3.331552 23.833590
educationFemale lifeMale lifeFemale
27.561149 42.279211 74.707620
economicActivityMale economicActivityFemale illiteracyMale
2.059083 2.049210 17.476821
illiteracyFemale
24.355614
#lifeFemale has the highest, remove that variable and see the changes
fit1 = lm(infantMortality ~. -lifeFemale , data = dat)
(v1 = vif(fit1))
tfr contraception educationMale
7.650182 3.151511 23.768388
educationFemale lifeMale economicActivityMale
27.547206 6.053211 2.040815
economicActivityFemale illiteracyMale illiteracyFemale
1.989025 15.271120 20.160209
# we want to see which variables had VIFs that changed by more than 10
#find which variables match
ind = which(names(vfull) %in% names(v1))
#the differences in the vifs
vfull[ind] - v1
tfr contraception educationMale
6.66966174 0.18004132 0.06520235
educationFemale lifeMale economicActivityMale
0.01394232 36.22599962 0.01826806
economicActivityFemale illiteracyMale illiteracyFemale
0.06018455 2.20570151 4.19540516
# we see that lifeMale decreased by more than 36
#this tells us that lifeMale and lifeFemale are highly correlated with
#each other and should not both be in the model
#now see which has the next highest vif above 10
v1
tfr contraception educationMale
7.650182 3.151511 23.768388
educationFemale lifeMale economicActivityMale
27.547206 6.053211 2.040815
economicActivityFemale illiteracyMale illiteracyFemale
1.989025 15.271120 20.160209
#educationFemale is the next highest so remove it and see the changes
fit2 = lm(infantMortality ~. -lifeFemale -educationFemale , data = dat)
(v2 = vif(fit2))
tfr contraception educationMale
7.545102 3.126793 3.402244
lifeMale economicActivityMale economicActivityFemale
6.051905 1.882903 1.937600
illiteracyMale illiteracyFemale
13.085258 14.052399
#find which variables match
ind = which(names(v1) %in% names(v2))
#the differences in the vifs
v1[ind] - v2
tfr contraception educationMale
0.105080392 0.024717979 20.366144133
lifeMale economicActivityMale economicActivityFemale
0.001306497 0.157912135 0.051424911
illiteracyMale illiteracyFemale
2.185861974 6.107809635
#educationMale has a change in VIF of more than 10, therefore educationFemale
#and educationMale are highly correlated and should not be in the model
#together
#check for the next highest VIF
v2
tfr contraception educationMale
7.545102 3.126793 3.402244
lifeMale economicActivityMale economicActivityFemale
6.051905 1.882903 1.937600
illiteracyMale illiteracyFemale
13.085258 14.052399
#illiteracyFemale is the next highest so remove it and see the changes
fit3 = lm(infantMortality ~. -lifeFemale -educationFemale
-illiteracyFemale, data = dat)
(v3 = vif(fit3))
tfr contraception educationMale
7.539093 3.080353 3.399481
lifeMale economicActivityMale economicActivityFemale
4.772572 1.881863 1.674741
illiteracyMale
4.240990
#find which variables match
ind = which(names(v2) %in% names(v3))
#the differences in the vifs
v2[ind] - v3
tfr contraception educationMale
0.006008728 0.046439417 0.002762578
lifeMale economicActivityMale economicActivityFemale
1.279332690 0.001040184 0.262858728
illiteracyMale
8.844267260
#although no variables changed by more than 10, illiteracyMale was close
#with a change of over 8
#Therefore illiteracyMale and illiteracyFemale should not be in the
#model together
#no other variables had a VIF above 10 so we can stop
v3
tfr contraception educationMale
7.539093 3.080353 3.399481
lifeMale economicActivityMale economicActivityFemale
4.772572 1.881863 1.674741
illiteracyMale
4.240990
Recall from (4.31)
It can be shown that this variance can be expressed in terms of VIF. So the variance of $b_{j}$ can be expressed as \begin{align*} s^{2}\left[b_{j}\right] & =MSE\frac{VIF_{j}}{\left(n-1\right)\widehat{Var}\left[X_{j}\right]}\qquad(5.2) \end{align*} where $\widehat{Var}\left[X_{j}\right]$ is the sample variance of $X_j$.
So we see that if $VIF_{j}$ is large (meaning there is multicollinearity when $X_{j}$ is included in the model) then the standard error will be larger.
Noting that the test statistic for testing $\beta_{j}=0$ in (4.33) is \begin{align*} t^{*} & =\frac{b_{j}}{s\left[b_{j}\right]}\qquad(4.33) \end{align*} So an inflated standard error $s\left[b_{j}\right]$ will lead to a smaller $t$ and thus a larger p-value. This will cause us to conclude there is not enough evidence for the alternative hypothesis when in fact $\beta_{j}\ne0$.
\begin{align*}
{\bf s}^{2}\left[{\bf b}\right] & =MSE\left({\bf X}^{\prime}{\bf X}\right)^{-1}\qquad(4.31)
\end{align*}
that we can obtain the variance of the least squares
estimators with the diagonal of ${\bf s}^{2}\left[{\bf b}\right]$.
It can be shown that this variance can be expressed in terms of VIF. So the variance of $b_{j}$ can be expressed as \begin{align*} s^{2}\left[b_{j}\right] & =MSE\frac{VIF_{j}}{\left(n-1\right)\widehat{Var}\left[X_{j}\right]}\qquad(5.2) \end{align*} where $\widehat{Var}\left[X_{j}\right]$ is the sample variance of $X_j$.
So we see that if $VIF_{j}$ is large (meaning there is multicollinearity when $X_{j}$ is included in the model) then the standard error will be larger.
Noting that the test statistic for testing $\beta_{j}=0$ in (4.33) is \begin{align*} t^{*} & =\frac{b_{j}}{s\left[b_{j}\right]}\qquad(4.33) \end{align*} So an inflated standard error $s\left[b_{j}\right]$ will lead to a smaller $t$ and thus a larger p-value. This will cause us to conclude there is not enough evidence for the alternative hypothesis when in fact $\beta_{j}\ne0$.
As stated above, multicollinearity does not affect the confidence interval for the mean response or the prediction interval.
We will illustrate this with thebodyfat data in Example 5.1.2 below.
We will illustrate this with the
library(tidyverse)
library(car)
dat = read.table("http://www.jpstats.org/Regression/data/BodyFat.txt", header=T)
full = lm(bfat~tri+thigh+midarm, data=dat)
#We see that there is multicollinearity between the variables
vif(full)
tri thigh midarm
708.8429 564.3434 104.6060
#remove tri since it has the highest vif
fit23 = lm(bfat~thigh+midarm, data=dat)
vif(fit23)
thigh midarm
1.00722 1.00722
#both thigh and midarm reduces substantially
#it is tempting to conclude that tri is correlated with both
#thigh and midarm and then conclude that tri should not be
#in the model with either. However, the method used in
#Example 5.1.1 may lead to concluding some variables are
#correlated but actually are not.
#With just a few variables, we can just try differnt fits
#and then find the VIF for each
fit13 = lm(bfat~tri+midarm, data=dat)
vif(fit13)
tri midarm
1.265118 1.265118
#thus, tri and midarm can be in the model together
#let's also fit tri and midarm separately and together
fit1 = lm(bfat~tri, data=dat)
fit2 = lm(bfat~thigh, data=dat)
fit12 = lm(bfat~tri+thigh, data=dat)
# Compare the std error of the coefficients for the
# different fits
summary(fit1)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.4961046 3.3192346 -0.4507378 6.575609e-01
tri 0.8571865 0.1287808 6.6561675 3.024349e-06
summary(fit2)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -23.6344891 5.6574137 -4.177614 5.656662e-04
thigh 0.8565466 0.1100156 7.785681 3.599996e-07
summary(fit12)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -19.1742456 8.3606407 -2.2933943 0.03484327
tri 0.2223526 0.3034389 0.7327755 0.47367898
thigh 0.6594218 0.2911873 2.2645969 0.03689872
summary(full)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 117.084695 99.782403 1.173400 0.2578078
tri 4.334092 3.015511 1.437266 0.1699111
thigh -2.856848 2.582015 -1.106441 0.2848944
midarm -2.186060 1.595499 -1.370142 0.1895628
#notice the MSE (the estimate of the variance) does
#not inflate when adding a correlated variable to
#model. Instead, the MSE decreases since we are
#explaining more variability in Y, thus getting
#a better fit
summary(fit1)$sigma^2
[1] 7.951095
summary(fit12)$sigma^2
[1] 6.467694
summary(full)$sigma^2
[1] 6.150306
#we also do not see much change in CI for the mean
#response when a correlated variable is added
xnew = data.frame(tri = 25, thigh = 50)
predict(fit1, xnew, interval="confidence")
fit lwr upr
1 19.93356 18.60632 21.2608
predict(fit12, xnew, interval="confidence")
fit lwr upr
1 19.35566 18.03848 20.67283
#for PIs
predict(fit1, xnew, interval="prediction")
fit lwr upr
1 19.93356 13.86259 26.00453
predict(fit12, xnew, interval="prediction")
fit lwr upr
1 19.35566 13.83074 24.88058