5.4 Residual Diagnostics

"This is one time where television really fails to capture the true excitement of a large squirrel predicting the weather."
- Phil Connors (Groundhog Day)

5.4.1 Assumptions for the Multiple Regression Model

In Chapter 3, we discussed checking the assumptions of the simple linear regression model with normal errors (2.1)

\begin{align*} Y_i=&\beta_0+\beta_1X_i+\varepsilon_i\\ \varepsilon_i\overset{iid}{\sim}& N\left(0,\sigma^2\right)\qquad\qquad\qquad\qquad(2.1) \end{align*}

.

For normal errors multiple regression model (4.18)

\begin{align*} {\textbf{Y}}= & {\textbf{X}}{\boldsymbol{\beta}}+{\boldsymbol{\varepsilon}}\\ & \boldsymbol{\varepsilon} \sim N_{n}\left({\bf 0},\sigma^{2}{\bf I}\right) \qquad (4.18) \end{align*}

we also will need to check the assumptions.

We will use the residuals here as we did in Chapter 3 to examine most of these assumptions.

Let's begin by listing all of the assumptions for model (4.18)

\begin{align*} {\textbf{Y}}= & {\textbf{X}}{\boldsymbol{\beta}}+{\boldsymbol{\varepsilon}}\\ & \boldsymbol{\varepsilon} \sim N_{n}\left({\bf 0},\sigma^{2}{\bf I}\right) \qquad (4.18) \end{align*}

Linearity: There is a linear relationship between $Y$ and each of the predictor variables.
Normality of the residuals: In the normal error model, the error terms should be approximately normally distributed.
Independence of the residuals: We assume the error terms are independent of each other.
Constant variance: The variance of the error terms should be constant throughout the range of the predictor variables.
Uncorrelated predictor variables: There is no multicollinearity present between the predictor variables.

Linearity

For the linearity assumption, in simple linear regression, we examined a scatterplot of $X$ and $Y$. If there was an obvious nonlinear pattern, then a transformation on $X$ would be needed.

There are times when a scatterplot may not be sufficient for identifying nonlinear patterns. For instance, if there is a steep slope such as in Example 3.2.1.

A plot of the residuals vs the predictor variable can help determine if there is a nonlinear pattern. The same can be done in multiple regression. We can also use the added-variable plots discussed in Section 5.2.

Multicollinearity and Outliers

The assumption of uncorrelated predictor variables is a difficult one to assume. In a designed experiment, we can usually fixed the different levels of the predictor variables and obtain the values of $Y$ from the experiment. A well-designed experiment will set these values to different combinations throughout the ranges of all the $X$'s.

In most applications, the data is not from a designed experiment but from an observational study. Thus, there was no control over the values of the predictor variables. We expect to have at least some correlation between the predictor variables.

We discussed the problem of high correlation between the $X$'s (multicollinearity) in Section 5.1.

In regression analysis, the multicollinearity is checked first in the model selection process. We will discuss model selection in Chapter 7.

After a model is selected with some subset of predictor variables, a check for outliers should be conducted next. This was discussed in Section 5.3.

After outliers have been checked and any that can be reasonably removed have been removed, then we check the remainder of the assumptions.

5.4.2 Normality and Transformations

We can check normality as we did in Section 3.6 by plotting the residuals in a QQ plot or by using the Shapiro Wilk test.

It is good practice to do both so that you can get a formal test and a visualization.

If the data are not normally distributed, then a transformation on $Y$, such as a Box-Cox transformation (see Section 3.6.3) may be helpful.

One downside to transformations on $Y$ is that the interpretability on the confidence intervals may be difficult to grasp.

For example, if we have a confidence interval for the mean response where $Y$ was log-transformed as $(.35, .78)$, we could back transform by doing the inverse function of log which is the exponential function. So we would have $(\exp(.35), \exp(.78)=(1.419, 2.181)$. However, this is not a confidence interval for the mean $Y$, it is instead a confidence interval for the median $Y$.

Other back transformations may not have this interpretation. If the transformation is not monotonic, then the backtransformed confidence interval may not have the desired coverage.

Thus, a transformation may not be the best option if the goal is to obtain confidence and prediction intervals for the response.

5.4.3 Independence

In Section 3.5 we discussed checking for independence among the residuals.

We can visualize the correlation by using a sequence-plot (see Section 3.5.1).

A formal test for significant autocorrelation is the Breusch-Godfrey Test as was used in simple linear regression.

If significant autocorrelation is present, then a time series model is necessary (beyond the scope of this course).

5.4.4 Constant Variance

We can visualize the variance of the residuals as we did in Section 3.3.

We can plot the residuals, the absolute residuals, or the squared residuals versus the fitted values.

We can also use the Breusch-Pagan test as we did in simple regression to formally test for nonconstant variance.

« 5.3: Outliers and Influential Cases 5.5: Remedial Measures »