Fill in Blanks
Home

4.5 Least Squares and Inferences Using Matrices

"Statistics is the grammar of science." - Karl Pearson
For model (4.18)
\begin{align*} {\textbf{Y}}= & {\textbf{X}}{\boldsymbol{\beta}}+{\boldsymbol{\varepsilon}}\\ & \boldsymbol{\varepsilon} \sim N_{n}\left({\bf 0},\sigma^{2}{\bf I}\right) \qquad (4.18) \end{align*}
we want to minimize the squared distances between the observed $Y_i$ and the fitted $\hat{Y}_i$.

We will express the sum of the squared distances in (4.3)
$$ Q=\sum \left(Y_i-\left(b_0 + b_1 X_{i1} + b_2 X_{i2}+\cdots +b_{p-1}X_{i,p-1}\right)\right)^2 \qquad (4.3) $$
in matrix terms as \begin{align*} Q & =\left({\bf Y}-{\bf X}{\bf b}\right)^{\prime}\left({\bf Y}-{\bf X}{\bf b}\right)\qquad(4.19) \end{align*} where ${\bf Y}$ is the response vector, ${\bf X}$ is the design matrix, and ${\bf b}$ is the vector of estimators \begin{align*} {\bf b} & =\left[\begin{array}{c} b_{0}\\ b_{1}\\ \vdots\\ b_{p-1} \end{array}\right] \end{align*}
We will minimize $Q$ in (4.19) by taking the derivative with respect to ${\bf b}$ and setting it equal to the vecor of zeros. The derivative is done using the properties listed in Section 4.3.6 and using the chain rule: \begin{align*} \frac{\partial\left({\bf Y}-{\bf X}{\bf b}\right)^{\prime}\left({\bf Y}-{\bf X}{\bf b}\right)}{\partial{\bf b}} & =-2{\bf X}^{\prime}\left({\bf Y}-{\bf X}{\bf b}\right) \end{align*} Setting this derivative equal to the vector of zeros gives us \begin{align*} -2{\bf X}^{\prime}\left({\bf Y}-{\bf X}{\bf b}\right)={\bf 0} & \Longrightarrow{\bf X}^{\prime}\left({\bf Y}-{\bf X}{\bf b}\right)={\bf 0} \end{align*}
We can distribute ${\bf X}^{\prime}$ through the parentheses \begin{align*} {\bf X}^{\prime}\left({\bf Y}-{\bf X}{\bf b}\right)={\bf 0} & \Longrightarrow{\bf X}^{\prime}{\bf Y}-{\bf X}^{\prime}{\bf X}{\bf b}={\bf 0}\\ & \Longrightarrow{\bf X}^{\prime}{\bf X}{\bf b}={\bf X}^{\prime}{\bf Y}\qquad\qquad(4.20) \end{align*} Equation (4.20) represents the normal equations in matrix format.
Solving the normal equations for ${\bf b}$ gives us the least squares estimators: \begin{align*} {\bf X}^{\prime}{\bf X}{\bf b}={\bf X}^{\prime}{\bf Y} & \Longrightarrow\left({\bf X}^{\prime}{\bf X}\right)^{-1}{\bf X}^{\prime}{\bf X}{\bf b}=\left({\bf X}^{\prime}{\bf X}\right)^{-1}{\bf X}^{\prime}{\bf Y}\\ & \Longrightarrow{\bf b}=\left({\bf X}^{\prime}{\bf X}\right)^{-1}{\bf X}^{\prime}{\bf Y}\qquad\qquad\qquad\qquad(4.21) \end{align*}
The fitted values are \begin{align*} \hat{{\bf Y}} & ={\bf X}{\bf b}\qquad(4.22) \end{align*} Substituting (4.21) for ${\bf b}$ gives us \begin{align*} \hat{{\bf Y}}={\bf X}\left({\bf X}^{\prime}{\bf X}\right)^{-1}{\bf X}^{\prime}{\bf Y}\qquad(4.23) \end{align*} The $n\times n$ matrix ${\bf X}\left({\bf X}^{\prime}{\bf X}\right)^{-1}{\bf X}^{\prime}$ pops up a number of times in multiple regression analysis. We call this matrix the hat matrix and denote it as \begin{align*} {\bf H} & ={\bf X}\left({\bf X}^{\prime}{\bf X}\right)^{-1}{\bf X}^{\prime}\qquad(4.24) \end{align*} Thus, the fitted values can be expressed as \begin{align*} \hat{{\bf Y}} & ={\bf H}{\bf Y}\qquad(4.25) \end{align*}
The residuals can be expressed in matrix terms as \begin{align*} {\bf e} & ={\bf Y}-\hat{{\bf Y}}\\ & ={\bf Y}-{\bf X}{\bf b}\qquad(4.26) \end{align*} Using (4.25), the residuals can be expressed in terms of the hat matrix as well \begin{align*} {\bf e} & ={\bf Y}-{\bf H}{\bf Y}\\ & =\left({\bf I}-{\bf H}\right){\bf Y}\qquad(4.27) \end{align*}
We have seen in (1.17)
$$ s^2 = \frac{SSE}{n-2}\qquad\qquad\qquad(1.17) $$
that $s^{2}$ is an estimate of the variance $\sigma^{2}$. Recall that we call $s^{2}$ the MSE which is just the SSE divided by the degrees of freedom ($n-2$ in simple linear regression).

In matrix terms, we can express SSE as \begin{align*} SSE & ={\bf e}^{\prime}{\bf e}\\ & ={\bf Y}^{\prime}\left({\bf I}-{\bf H}\right)^{\prime}\left({\bf I}-{\bf H}\right){\bf Y}\\ & ={\bf Y}^{\prime}\left({\bf I}-{\bf H}\right){\bf Y}\qquad\qquad\qquad(4.28) \end{align*} The term $\left({\bf I}-{\bf H}\right)^{\prime}\left({\bf I}-{\bf H}\right)$ simplifies to $\left({\bf I}-{\bf H}\right)$ since $\left({\bf I}-{\bf H}\right)$ is idempotent.

The MSE is \begin{align*} MSE & =\frac{SSE}{n-p} \end{align*} Here the degrees of freedom are now $n-p$ since there are $p$ parameters $\left(\beta_{0},\beta_{1},\ldots,\beta_{p-1}\right)$ that need to be estimated first in order to get $SSE$. As before, MSE is an estimator for $\sigma^{2}$.
As they were in the simple regression case, the least squares estimators ${\bf b}$ are unbiased \begin{align*} {\bf E}\left[{\bf b}\right] & =\boldsymbol{\beta}\qquad(4.29) \end{align*} The $p\times p$ covariance matrix is \begin{align*} {\bf Cov}\left[{\bf b}\right] & =\sigma^{2}\left({\bf X}^{\prime}{\bf X}\right)^{-1}\qquad(4.30) \end{align*} We can estimate the covariance matrix by using MSE as an estimate for $\sigma^{2}$. We will denote this estimated covariance matrix as \begin{align*} {\bf s}^{2}\left[{\bf b}\right] & =MSE\left({\bf X}^{\prime}{\bf X}\right)^{-1}\qquad(4.31) \end{align*} From ${\bf s}^{2}\left[{ \bf b}\right]$, we can obtain $s^{2}\left[b_{0}\right]$, $s^{2}\left[b_{1}\right]$, or the variance for any other coefficient to use in the confidence intervals and hypothesis tests.
A $\left(1-\alpha\right)100\%$ confidence interval for the parameter $\beta_{k}$ is \begin{align*} b_{k} & \pm t_{\alpha/2,n-p}s\left[b_{k}\right]\qquad(4.32) \end{align*}
We can test the hypothesis \begin{align*} H_{0}: & \beta_{k}=0\\ H_{a}: & \beta_{k}\ne0 \end{align*} with the test statistic \begin{align*} t^{*} & =\frac{b_{k}}{s\left[b_{k}\right]}\qquad(4.33) \end{align*} where $t^{*}\sim t\left(n-p\right)$.
The bodyfat data from Kutner
Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models (Vol. 5). Boston: McGraw-Hill Irwin.
. In this dataset the bodyfat, tricep skinfold thickness, thigh circumference, and midarm circumference are found for 20 healthy females 25-34 years old. We want to model bodyfat based on the other three variables.

library(tidyverse)

dat = read.table("http://www.jpstats.org/Regression/data/BodyFat.txt", header=T)

#turn the data frame into a matrix
dat.mat = as.matrix(dat)
dat.mat

       tri thigh midarm bfat
 [1,] 19.5  43.1   29.1 11.9
 [2,] 24.7  49.8   28.2 22.8
 [3,] 30.7  51.9   37.0 18.7
 [4,] 29.8  54.3   31.1 20.1
 [5,] 19.1  42.2   30.9 12.9
 [6,] 25.6  53.9   23.7 21.7
 [7,] 31.4  58.5   27.6 27.1
 [8,] 27.9  52.1   30.6 25.4
 [9,] 22.1  49.9   23.2 21.3
[10,] 25.5  53.5   24.8 19.3
[11,] 31.1  56.6   30.0 25.4
[12,] 30.4  56.7   28.3 27.2
[13,] 18.7  46.5   23.0 11.7
[14,] 19.7  44.2   28.6 17.8
[15,] 14.6  42.7   21.3 12.8
[16,] 29.5  54.4   30.1 23.9
[17,] 27.7  55.3   25.7 22.6
[18,] 30.2  58.6   24.6 25.4
[19,] 22.7  48.2   27.1 14.8
[20,] 25.2  51.0   27.5 21.1

#get the fit of the data
fit = lm(bfat~tri+thigh+midarm, data=dat)
fit %>% summary

Call:
lm(formula = bfat ~ tri + thigh + midarm, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7263 -1.6111  0.3923  1.4656  4.1277 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  117.085     99.782   1.173    0.258
tri            4.334      3.016   1.437    0.170
thigh         -2.857      2.582  -1.106    0.285
midarm        -2.186      1.595  -1.370    0.190

Residual standard error: 2.48 on 16 degrees of freedom
Multiple R-squared:  0.8014,	Adjusted R-squared:  0.7641 
F-statistic: 21.52 on 3 and 16 DF,  p-value: 7.343e-06

#to get the confidence intervals for all the coefficients
confint(fit)

                 2.5 %     97.5 %
(Intercept) -94.444550 328.613940
tri          -2.058507  10.726691
thigh        -8.330476   2.616780
midarm       -5.568367   1.196247


#to get the covariance matrix of the coefficents
vcov(fit)

            (Intercept)        tri       thigh      midarm
(Intercept)   9956.5279 300.197963 -257.382315 -158.670413
tri            300.1980   9.093309   -7.779145   -4.788026
thigh         -257.3823  -7.779145    6.666803    4.094616
midarm        -158.6704  -4.788026    4.094616    2.545617



# obtain the least squares estimates using
# matrices
# get the Y and X matrices
Y = dat.mat[,4]
X = cbind(1, dat.mat[,1:3])
X

         tri thigh midarm
 [1,] 1 19.5  43.1   29.1
 [2,] 1 24.7  49.8   28.2
 [3,] 1 30.7  51.9   37.0
 [4,] 1 29.8  54.3   31.1
 [5,] 1 19.1  42.2   30.9
 [6,] 1 25.6  53.9   23.7
 [7,] 1 31.4  58.5   27.6
 [8,] 1 27.9  52.1   30.6
 [9,] 1 22.1  49.9   23.2
[10,] 1 25.5  53.5   24.8
[11,] 1 31.1  56.6   30.0
[12,] 1 30.4  56.7   28.3
[13,] 1 18.7  46.5   23.0
[14,] 1 19.7  44.2   28.6
[15,] 1 14.6  42.7   21.3
[16,] 1 29.5  54.4   30.1
[17,] 1 27.7  55.3   25.7
[18,] 1 30.2  58.6   24.6
[19,] 1 22.7  48.2   27.1
[20,] 1 25.2  51.0   27.5


#now get the least squares estimates
b = solve(t(X)%*%X)%*%t(X)%*%Y
b

             [,1]
       117.084695
tri      4.334092
thigh   -2.856848
midarm  -2.186060


#note that the solve function finds the inverse of a matrix
#compare this the estimates from lm
fit

Coefficients:
(Intercept)          tri        thigh       midarm  
    117.085        4.334       -2.857       -2.186