Fill in Blanks
Home

5.1 Multicollinearity and Its Effects

"I am fast. To give you a reference point I am somewhere between a snake and a mongoose… And a panther." - Dwight Schrute
Recall from Section 4.1.1 that correlation among the predictor variables may cause problems in multiple regression analysis.

The ability to obtain a good fit or to make inferences on the mean response or to predict the response are not affected by multicollinearity. However, inferences for the coefficients (the $\beta$s) and for the model variance ($\sigma^2$) are affected by large correlation among the predictor variables.

Another effect of multicollinearity is the interpretation of the estimated coefficients. In multiple regression, we interpret the coefficient as the average change in $Y$ when $X$ is increased by one unit when all other predictor variables are held constant. If $X$ is highly correlated with one or more of the other predictor variables, then it may not be feasible to think of varying $X$ when the others are constant.
We can see evidence of multicollinearity by examining the scatterplot matrix since this will give us a plot of each pair of predictor variables. If there are pairs that appear to be highly correlated (ggpairs will give the correlation value as well) then multicollinearity will be present.

We could also examine $R_{a}^{2}$ for models with and without certain pairs of variables. If $R_{a}^{2}$ decreases when a particular $X$ variable is added but it appears to have a strong linear relationship with $Y$ in the scatterplot matrix, then this is evidence of multicollinearity.

A more convenient way to examine multicollinearity is through the use of the variance inflation factors (VIF).

Each predictor variable will have a VIF. Suppose we are interested in the VIF for $X_{1}$. We start by regression $X_{1}$ on all the other predictor variables. Thus, we fit the model \begin{align*} X_{i1} & =\alpha_{0}+\alpha_{2}X_{i2}+\alpha_{3}X_{i3}+\cdots+\alpha_{p-1}X_{i,p-1}+\epsilon \end{align*} where the $\alpha$'s are the coefficients and $\epsilon$ is the random error term.

Now find the coefficient of multiple determination for this model which we will denote as $R_{1}^{2}$. The VIF for $X_{1}$ is then \begin{align*} VIF_{1} & =\frac{1}{1-R_{1}^{2}}. \end{align*} We can do this for any $i$th predictor variable so that the VIF for that variable is \begin{align*} VIF_{i} & =\frac{1}{1-R_{i}^{2}}\qquad(5.1) \end{align*} where $R_{i}^{2}$ is the coefficient of multiple determination for the regression fit of $X_{i}$ on all the other predictor variables.

A rule of thumb is that a VIF greater than 10 is evidence that multicollinearity is high when that variable is added to the model. Some use a cutoff of 5 instead of 10.
One approach to seeing which variables are correlated with each other is to remove a variable with a large VIF and see which variables had the largest change in their VIF.

We will illustrate this process with the UN98 dataset from the car library. We will not use the region and GDPperCapita variables for this example.

library(car)

#remove the region and GDPperCapita variables for this example
dat = UN98
dat$region = NULL
dat$GDPperCapita=NULL

full = lm(infantMortality ~. , data = dat)
summary(full)

Call:
lm(formula = infantMortality ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.1193  -4.6541   0.1195   4.3203  14.7447 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)   
(Intercept)            111.03151   48.85991   2.272  0.03093 * 
tfr                     12.04099    3.86634   3.114  0.00423 **
contraception           -0.07093    0.13715  -0.517  0.60910   
educationMale            5.98231    3.11064   1.923  0.06468 . 
educationFemale         -6.40921    2.76181  -2.321  0.02781 * 
lifeMale                -0.40539    1.11163  -0.365  0.71809   
lifeFemale              -0.75718    1.31751  -0.575  0.57008   
economicActivityMale    -0.38268    0.26354  -1.452  0.15760   
economicActivityFemale   0.17223    0.12810   1.344  0.18960   
illiteracyMale          -0.54440    0.39236  -1.387  0.17624   
illiteracyFemale         0.32089    0.31299   1.025  0.31403   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.396 on 28 degrees of freedom
  (168 observations deleted due to missingness)
Multiple R-squared:  0.9444,	Adjusted R-squared:  0.9245 
F-statistic: 47.52 on 10 and 28 DF,  p-value: 6.751e-15


# examine vif for all varialbes
(vfull = vif(full))

                   tfr          contraception          educationMale 
             14.319844               3.331552              23.833590 
       educationFemale               lifeMale             lifeFemale 
             27.561149              42.279211              74.707620 
  economicActivityMale economicActivityFemale         illiteracyMale 
              2.059083               2.049210              17.476821 
      illiteracyFemale 
             24.355614 


#lifeFemale has the highest, remove that variable and see the changes
fit1 = lm(infantMortality ~. -lifeFemale , data = dat)
(v1 = vif(fit1))

                   tfr          contraception          educationMale 
              7.650182               3.151511              23.768388 
       educationFemale               lifeMale   economicActivityMale 
             27.547206               6.053211               2.040815 
economicActivityFemale         illiteracyMale       illiteracyFemale 
              1.989025              15.271120              20.160209


# we want to see which variables had VIFs that changed by more than 10
#find which variables match
ind = which(names(vfull) %in% names(v1))
#the differences in the vifs
vfull[ind] - v1

                   tfr          contraception          educationMale 
            6.66966174             0.18004132             0.06520235 
       educationFemale               lifeMale   economicActivityMale 
            0.01394232            36.22599962             0.01826806 
economicActivityFemale         illiteracyMale       illiteracyFemale 
            0.06018455             2.20570151             4.19540516 


# we see that lifeMale decreased by more than 36
#this tells us that lifeMale and lifeFemale are highly correlated with 
#each other and should not both be in the model

#now see which has the next highest vif above 10
v1

                   tfr          contraception          educationMale 
              7.650182               3.151511              23.768388 
       educationFemale               lifeMale   economicActivityMale 
             27.547206               6.053211               2.040815 
economicActivityFemale         illiteracyMale       illiteracyFemale 
              1.989025              15.271120              20.160209


#educationFemale is the next highest so remove it and see the changes
fit2 = lm(infantMortality ~. -lifeFemale  -educationFemale , data = dat)
(v2 = vif(fit2))

                   tfr          contraception          educationMale 
              7.545102               3.126793               3.402244 
              lifeMale   economicActivityMale economicActivityFemale 
              6.051905               1.882903               1.937600 
        illiteracyMale       illiteracyFemale 
             13.085258              14.052399 


#find which variables match
ind = which(names(v1) %in% names(v2))
#the differences in the vifs
v1[ind] - v2

                   tfr          contraception          educationMale 
           0.105080392            0.024717979           20.366144133 
              lifeMale   economicActivityMale economicActivityFemale 
           0.001306497            0.157912135            0.051424911 
        illiteracyMale       illiteracyFemale 
           2.185861974            6.107809635


#educationMale has a change in VIF of more than 10, therefore educationFemale
#and educationMale are highly correlated and should not be in the model
#together

#check for the next highest VIF
v2

                   tfr          contraception          educationMale 
              7.545102               3.126793               3.402244 
              lifeMale   economicActivityMale economicActivityFemale 
              6.051905               1.882903               1.937600 
        illiteracyMale       illiteracyFemale 
             13.085258              14.052399 


#illiteracyFemale is the next highest so remove it and see the changes
fit3 = lm(infantMortality ~. -lifeFemale  -educationFemale 
          -illiteracyFemale, data = dat)
(v3 = vif(fit3))

                   tfr          contraception          educationMale 
              7.539093               3.080353               3.399481 
              lifeMale   economicActivityMale economicActivityFemale 
              4.772572               1.881863               1.674741 
        illiteracyMale 
              4.240990


#find which variables match
ind = which(names(v2) %in% names(v3))
#the differences in the vifs
v2[ind] - v3

                   tfr          contraception          educationMale 
           0.006008728            0.046439417            0.002762578 
              lifeMale   economicActivityMale economicActivityFemale 
           1.279332690            0.001040184            0.262858728 
        illiteracyMale 
           8.844267260 


#although no variables changed by more than 10, illiteracyMale was close
#with a change of over 8
#Therefore illiteracyMale and illiteracyFemale should not be in the 
#model together

#no other variables had a VIF above 10 so we can stop
v3

                   tfr          contraception          educationMale 
              7.539093               3.080353               3.399481 
              lifeMale   economicActivityMale economicActivityFemale 
              4.772572               1.881863               1.674741 
        illiteracyMale 
              4.240990


Recall from (4.31)
\begin{align*} {\bf s}^{2}\left[{\bf b}\right] & =MSE\left({\bf X}^{\prime}{\bf X}\right)^{-1}\qquad(4.31) \end{align*}
that we can obtain the variance of the least squares estimators with the diagonal of ${\bf s}^{2}\left[{\bf b}\right]$.

It can be shown that this variance can be expressed in terms of VIF. So the variance of $b_{j}$ can be expressed as \begin{align*} s^{2}\left[b_{j}\right] & =MSE\frac{VIF_{j}}{\left(n-1\right)\widehat{Var}\left[X_{j}\right]}\qquad(5.2) \end{align*} where $\widehat{Var}\left[X_{j}\right]$ is the sample variance of $X_j$.

So we see that if $VIF_{j}$ is large (meaning there is multicollinearity when $X_{j}$ is included in the model) then the standard error will be larger.

Noting that the test statistic for testing $\beta_{j}=0$ in (4.33) is \begin{align*} t^{*} & =\frac{b_{j}}{s\left[b_{j}\right]}\qquad(4.33) \end{align*} So an inflated standard error $s\left[b_{j}\right]$ will lead to a smaller $t$ and thus a larger p-value. This will cause us to conclude there is not enough evidence for the alternative hypothesis when in fact $\beta_{j}\ne0$.
As stated above, multicollinearity does not affect the confidence interval for the mean response or the prediction interval.

We will illustrate this with the bodyfat data in Example 5.1.2 below.
library(tidyverse)
library(car)

dat = read.table("http://www.jpstats.org/Regression/data/BodyFat.txt", header=T)


full = lm(bfat~tri+thigh+midarm, data=dat)

#We see that there is multicollinearity between the variables
vif(full)

     tri    thigh   midarm 
708.8429 564.3434 104.6060


#remove tri since it has the highest vif
fit23 = lm(bfat~thigh+midarm, data=dat)
vif(fit23)

  thigh  midarm 
1.00722 1.00722


#both thigh and midarm reduces substantially
#it is tempting to conclude that tri is correlated with both
#thigh and midarm and then conclude that tri should not be
#in the model with either. However, the method used in 
#Example 5.1.1 may lead to concluding some variables are 
#correlated but actually are not.
#With just a few variables, we can just try differnt fits
#and then find the VIF for each

fit13 = lm(bfat~tri+midarm, data=dat)
vif(fit13)

     tri   midarm 
1.265118 1.265118

#thus, tri and midarm can be in the model together

#let's also fit tri and midarm separately and together
fit1 = lm(bfat~tri, data=dat)
fit2 = lm(bfat~thigh, data=dat)
fit12 = lm(bfat~tri+thigh, data=dat)

# Compare the std error of the coefficients for the
# different fits
summary(fit1)$coefficients

              Estimate Std. Error    t value     Pr(>|t|)
(Intercept) -1.4961046  3.3192346 -0.4507378 6.575609e-01
tri          0.8571865  0.1287808  6.6561675 3.024349e-06

summary(fit2)$coefficients

               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -23.6344891  5.6574137 -4.177614 5.656662e-04
thigh         0.8565466  0.1100156  7.785681 3.599996e-07

summary(fit12)$coefficients

               Estimate Std. Error    t value   Pr(>|t|)
(Intercept) -19.1742456  8.3606407 -2.2933943 0.03484327
tri           0.2223526  0.3034389  0.7327755 0.47367898
thigh         0.6594218  0.2911873  2.2645969 0.03689872

summary(full)$coefficients

              Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 117.084695  99.782403  1.173400 0.2578078
tri           4.334092   3.015511  1.437266 0.1699111
thigh        -2.856848   2.582015 -1.106441 0.2848944
midarm       -2.186060   1.595499 -1.370142 0.1895628



#notice the MSE (the estimate of the variance) does
#not inflate when adding a correlated variable to 
#model. Instead, the MSE decreases since we are 
#explaining more variability in Y, thus getting
#a better fit
summary(fit1)$sigma^2

[1] 7.951095

summary(fit12)$sigma^2

[1] 6.467694

summary(full)$sigma^2

[1] 6.150306



#we also do not see much change in CI for the mean
#response when a correlated variable is added
xnew = data.frame(tri = 25, thigh = 50)
predict(fit1, xnew, interval="confidence")

       fit      lwr     upr
1 19.93356 18.60632 21.2608

predict(fit12, xnew, interval="confidence")

       fit      lwr      upr
1 19.35566 18.03848 20.67283



#for PIs
predict(fit1, xnew, interval="prediction")

       fit      lwr      upr
1 19.93356 13.86259 26.00453

predict(fit12, xnew, interval="prediction")

       fit      lwr      upr
1 19.35566 13.83074 24.88058