3.1 Residual Diagnostics

"…the statistician knows…that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world." - George Box

Model Assumptions

In model (2.1)

\begin{align*} Y_i=&\beta_0+\beta_1X_i+\varepsilon_i\\ \varepsilon_i\overset{iid}{\sim}& N\left(0,\sigma^2\right)\qquad\qquad\qquad\qquad(2.1) \end{align*}

, we make a number of assumptions:

We assume a linear relationship between $X$ and $Y$.
In Section 2.1.1, we assumed that the error terms $\varepsilon_i$ are independent.
Also in Section 2.1.1, we assumed that the error terms $\varepsilon_i$ have constant variance.
The normal errors model assumes the error terms $\varepsilon_i$ are normally distributed (Section 2.1.2).

After fitting the model, we will need to check these assumptions.

We check the assumptions of the model by examining the residuals: \begin{align*} e_{i} & =Y_{i}-\hat{Y}_{i}\qquad\qquad\qquad(1.19) \end{align*} We do this since the assumptions, with the exception of the linearity assumption, are based on the error terms $\varepsilon_i$. We can think of $e_i$ as an observed value of $\varepsilon_i$.

Properties of Residuals

We presented some properties of the residuals in Section 1.5.3 which we present again here: \begin{align*} \sum e_{i} & =0 & \qquad\qquad\qquad(1.20)\\ \sum X_{i}e_{i} & =0 & \qquad\qquad\qquad(1.21)\\ \sum\hat{Y}_{i}e_{i} & =0 & \qquad\qquad\qquad(1.22)\\ \sum Y_{i} & =\sum\hat{Y}_{i} & \qquad\qquad\qquad(1.23) \end{align*} Clearly, from (1.20), the mean of the residuals is $$ \bar{e}_i=0\qquad\qquad\qquad(3.1) $$ The variance of all $n$ residuals, $e_1,\ldots,e_n$ is \begin{align*} \frac{\sum\left(e_{i}-\bar{e}\right)^{2}}{n-2} & =\frac{\sum e_{i}^{2}}{n-2}\\ & =\frac{SSE}{n-2}\\ & =MSE\\ & =s^{2}\qquad\qquad\qquad(3.2) \end{align*}

Nonindependent Residuals

It is important to note that, although the random errors $\varepsilon_i$ are independent in model (2.1)

\begin{align*} Y_i=&\beta_0+\beta_1X_i+\varepsilon_i\\ \varepsilon_i\overset{iid}{\sim}& N\left(0,\sigma^2\right)\qquad\qquad\qquad\qquad(2.1) \end{align*}

, the residuals $e_i$ are not independent.

This is because each $e_i=Y_i - \hat{Y}_i$ is a function of the same fitted regression line.

Semistudentized Residuals

It will be helpful to studentize each residuals. As always, we do this by subtracting off the mean, $\bar{e}_{i}$, and dividing by the standard error of $e_{i}$.

We know by (3.1) that $\bar{e}_{i}=0$.

In (3.2)

\begin{align*} \frac{\sum\left(e_{i}-\bar{e}\right)^{2}}{n-2} & =\frac{\sum e_{i}^{2}}{n-2}\\ & =\frac{SSE}{n-2}\\ & =MSE\\ & =s^{2}\qquad\qquad\qquad(3.2) \end{align*}

, we said the variance of the sample of the $e_{i}$'s is MSE. For each individual $e_{i}$, the standard error is not quite $\sqrt{MSE}$. The actual standard error is dependent on the predictor variable(s). We will discuss this more in Chapter 4.

For now, we will use the approximation $\sqrt{MSE}$ and calculate \begin{align*} e_{i}^{*} & =\frac{e_{i}-\bar{e}}{\sqrt{MSE}}\\ & =\frac{e_{i}}{\sqrt{MSE}}\qquad\qquad\qquad(3.3) \end{align*} We call $e_{i}^{*}$ the semistudentized residual since the standard error is an approximation.

3.1.2 Model Diagnostics using Residuals

Using the residuals, we will check the assumptions listed above in the rest of this chapter.

In Section 3.2, we will check the linearity assumption and discuss data transformations.

In Section 3.3, we will check for non-constant variance.

In Section 3.4, we will discuss checking for outliers.

In Section 3.5, we will check for correlation between residuals.

In Section 3.6, we will check for normality in the residuals.