Fill in Blanks
Home

1.3 Estimation of the Line

"Regression. It is a universal rule that the unknown kinsman in any degree of any specified man, is probably more mediocre than he." - Francis Galton
The first step in regression analysis is to plot the $(X_i,Y_i)$ data values in a scatterplot. From this, you see what type of funtion would best describe the relationship between $X$ and $Y$.

In most situations, the data you have will be from a sample.

Suppose you plot the data in a scatterplot and you determine that a linear function would best describe the relationship.

We will find the line that "best" fits the data. The intercept and slope of this fitted line will be estimates of the intercept ($\beta_0$) and slope ($\beta_1$) parameters in model (1.1)
$$ Y_i=\beta_0+\beta_1X_i+\varepsilon_i\qquad\qquad\qquad(1.1) $$


We now need to define what is meant by "best" line.
In model (1.1)
$$ Y_i=\beta_0+\beta_1X_i+\varepsilon_i\qquad\qquad\qquad(1.1) $$
, we are interested in modeling the variable $Y$ based on $X$. Because of this, the line that "best" fits the data should be judged based on how well it "models" the observed $Y_i$ values.

For example, suppose we have the data presented in the table to the right and plotted in the scatterplot in Figure 1.3.1 below.

We want the line that will be closest to the observed $Y_i$ values. Since the $Y$ variable is represented by the vertical axis on the plot, we want a line that makes all of the vertical distances between the points and the line as small as possible.

One way to measure how close the line is to sum the distances $$ \left(Y_i-\hat{Y}_i\right)\qquad\qquad i=1,\ldots, n $$ where $\hat{Y}_i$ denotes the estimated line $$ \hat{Y}_i=\hat{\beta}_0+\hat{\beta}_1X_i. $$ Here $\hat{\beta}_0$ and $\hat{\beta}_1$ are estimates of the intercept and slope parameters in model (1.1)
$$ Y_i=\beta_0+\beta_1X_i+\varepsilon_i\qquad\qquad\qquad(1.1) $$
.

Any points below the line ($\hat Y$) will have a negative distance and any point above the line will have a positive distance. Thus, when we sum these differences, the negative distances can cancel out the positive distances. In order to keep this from happening, we will need to square the distances so that there are no negatives: $$ \left(Y_i-\hat{Y}_i\right)^2 $$ Thus, the best line will be the one that makes the square distances as small as possible.

For the data in Table 1.3.1, the scatterplot is presented in Figure 1.3.1 to the right.

Use your mouse and click any two spots on the scatterplot. A line will be drawn between through the two spots you click. The fitted values $\hat{Y_i}$, the distances $\left(Y_i-\hat{Y}_i\right)$, and the squared distances $\left(Y_i-\hat{Y}_i\right)^2$ are shown in Table 1.3.2 for the line you constructed.

The estimated line is
$$ \hat{Y}=\hat{\beta}_0+\hat{\beta}_1X $$
Clicking on the plot a third time will clear the line. You can then try another two points. Try to get the sum of the squared distances smaller by trying different lines.

Figure 1.3.1: Scatterplot of Data


For ease of notation, we will not include the starting and ending index in the summations going forward when the starting index is $i=1$ and the ending index is $n$. When the indices is something other than $i=1$ and $n$, then the indices will be included on the summation.

We will denote the sum of the squared distances with $Q$: $$ \begin{align*} Q&=\sum \left(Y_i-\hat{Y}_i\right)^2 \qquad\qquad\qquad (1.2) \end{align*} $$ We determine the "best" line as the one that minimizes $Q$. We will denote the intercept and slope of the line that minimize $Q$, given the data $(X_i, Y_i)$, $i=1,\ldots, n$, as $b_0$ and $b_1$, respectively.

To minimize $Q$, we differentiate it with respect to $b_{0}$ and $b_{1}$: \begin{align*} \frac{\partial Q}{\partial b_{0}} & =-2\sum \left(Y_{i}-\left(b_{0}+b_{1}X_{i}\right)\right)\\ \frac{\partial Q}{\partial b_{1}} & =-2\sum X_{i}\left(Y_{i}-\left(b_{0}+b_{1}X_{i}\right)\right) \end{align*} Setting these partial derivatives equal to 0, we have \begin{align*} -2\sum \left(Y_{i}-\left(b_{0}+b_{1}X_{i}\right)\right) & =0\\ -2\sum X_{i}\left(Y_{i}-\left(b_{0}+b_{1}X_{i}\right)\right) & =0 \end{align*} Looking at the first equation, we can simplify as \begin{align*} -2\sum \left(Y_{i}-\left(b_{0}+b_{1}X_{i}\right)\right)=0 & \Longrightarrow\sum \left(Y_{i}-\left(b_{0}+b_{1}X_{i}\right)\right)=0\\ & \Longrightarrow\sum Y_{i}-\sum b_{0}-b_{1}\sum X_{i}=0\\ & \Longrightarrow\sum Y_{i}-nb_{0}-b_{1}\sum X_{i}=0\\ & \Longrightarrow\sum Y_{i}=nb_{0}+b_{1}\sum X_{i} \end{align*} Simplifying the second equation gives us \begin{align*} -2\sum X_{i}\left(Y_{i}-\left(b_{0}+b_{1}X_{i}\right)\right)=0 & \Longrightarrow\sum X_{i}\left(Y_{i}-\left(b_{0}+b_{1}X_{i}\right)\right)=0\\ & \Longrightarrow\sum X_{i}Y_{i}-b_{0}\sum X_{i}-b_{1}\sum X_{i}^{2}=0\\ & \Longrightarrow\sum X_{i}Y_{i}=b_{0}\sum X_{i}+b_{1}\sum X_{i}^{2} \end{align*} The two equations \begin{align*} \sum Y_{i} & =nb_{0}+b_{1}\sum X_{i}\\ \sum X_{i}Y_{i} & =b_{0}\sum X_{i}+b_{1}\sum X_{i}^{2}\qquad\qquad\qquad(1.3) \end{align*} are called the normal equations.

We now have two equations and two unknowns ($b_{0}$ and $b_{1}$). We can solve the equations simultaneously. We solve the first equation for $b_{0}$ which gives us \begin{align*} b_{0} & =\frac{1}{n}\left(\sum Y_{i}-b_{1}\sum X_{i}\right)\\ & =\bar{Y}-b_{1}\bar{X}. \end{align*} We now substitute this into the second equation in (1.3). Solving this for $b_{1}$ gives us (complete solution here
\begin{align*} & \sum X_{i}Y_{i}=b_{0}\sum X_{i}+b_{1}\sum X_{i}^{2}\\ & \quad\Longrightarrow\sum X_{i}Y_{i}=\left(\bar{Y}-b_{1}\bar{X}\right)\sum X_{i}+b_{1}\sum X_{i}^{2}\\ & \quad\Longrightarrow\sum X_{i}Y_{i}=\bar{Y}\sum X_{i}-b_{1}\bar{X}\sum X_{i}+b_{1}\sum X_{i}^{2}\\ & \quad\Longrightarrow\sum X_{i}Y_{i}-\bar{Y}\sum X_{i}=\left(\sum X_{i}^{2}-\bar{X}\sum X_{i}\right)b_{1}\\ & \quad\Longrightarrow\sum X_{i}Y_{i}-\bar{Y}\sum X_{i}\underbrace{-\bar{X}\sum Y+\bar{X}\sum Y}_{\text{completing the square}}=\left(\sum X_{i}^{2}-\bar{X}\sum X_{i}\underbrace{-\bar{X}\sum X+\bar{X}\sum X}_{\text{completing the square}}\right)b_{1}\\ & \quad\Longrightarrow\sum \left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)=\left(\sum \left(X_{i}-\bar{X}\right)^{2}\right)b_{1}\\ & \quad\Longrightarrow b_{1}=\frac{\sum \left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum \left(X_{i}-\bar{X}\right)^{2}}. \end{align*}
): \begin{align*} & \sum X_{i}Y_{i}=b_{0}\sum X_{i}+b_{1}\sum X_{i}^{2}\\ & \quad\Longrightarrow\sum X_{i}Y_{i}=\left(\bar{Y}-b_{1}\bar{X}\right)\sum X_{i}+b_{1}\sum X_{i}^{2}\\ &\quad\Longrightarrow b_{1}=\frac{\sum \left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum \left(X_{i}-\bar{X}\right)^{2}}. \end{align*} The equations \begin{align*} b_{0} & =\bar{Y}-b_{1}\bar{X}\\ b_{1} & =\frac{\sum \left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum \left(X_{i}-\bar{X}\right)^{2}}\qquad\qquad\qquad(1.4) \end{align*} are called the least squares estimators.

To show these estimators are the minimum, we take the second partial derivatives of $Q$: \begin{align*} \frac{\partial^{2}Q}{\partial\left(b_{0}\right)^{2}} & =2n\\ \frac{\partial^{2}Q}{\partial\left(b_{1}\right)^{2}} & =2\sum X_{i}^{2} \end{align*} Since these second partial derivatives are both positives, then we know the least squares estimators are the minimum.
We now use R to find the least squares estimates for the line that best fits the data in Table 1.3.1
. We first load in the data, convert to a tibble, and then plot the data in a scatterplot.

library(tidyverse)

x = c(1,2 ,2.75, 4, 6, 7, 8, 10)
y = c(2, 1.4, 1.6, 1.25, 1, 0.5, 0.5, 0.4)

dat =  tibble(x,y)
dat
                                
# A tibble: 8 x 2
    x     y
<dbl> <dbl>
1  1     2   
2  2     1.4 
3  2.75  1.6 
4  4     1.25
5  6     1   
6  7     0.5 
7  8     0.5 
8 10     0.4                         
ggplot(dat, aes(x=x,y=y))+
   geom_point()+
   xlim(0,10)+
   ylim(0,2)
Scatterplot of Table 1.3.1 data


We will first find the estimates manually.

ybar =  mean(y)
xbar =  mean(x)
beta1_hat =  sum( (x-xbar)*(y-ybar) ) / sum( (x-xbar)^2 )
beta1_hat
                                
    [1] -0.1766045
                                    
beta0_hat =  ybar - beta1_hat*xbar
beta0_hat
                                
    [1] 1.980829
                                    
ggplot(dat, aes(x=x,y=y))+
   geom_point()+
   xlim(0,10)+
   ylim(0,2)+
   geom_abline(intercept=beta0_hat, slope=beta1_hat, color="red")
                                
Scatterplot of Table 1.3.1 data
# find the fitted values of the line
yhat = beta0_hat + beta1_hat*x

#find the vertical distances of the points to the line
dists = y - yhat

#sum the vertical distances
sum(dists)
                                
     [1] -1.110223e-16
                                    
#sum the squared vertical distances
sum(dists^2)
                                
    [1] 0.2066899
                                    


Note that the sum of the vertical distances, $\sum \left(Y_i-\hat{Y}_i\right)$, is practically zero ($-1.110223e-16$).

The sum of the squared vertical distances, $\sum \left(Y_i-\hat{Y}_i\right)^2$, is $0.2066899$. Compare this to the smallest sum of squared distances you found in Table 1.3.2

Table 1.3.2

$X$ $Y$ $\hat{Y}$ $(Y-\hat{Y})$ $(Y-\hat{Y})^2$
1 2
2 1.4
2.75 1.6
4 1.25
6 1
7 0.5
8 0.5
10 0.4
Sum:
. The value $0.2066899$ is the smallest value of $\sum \left(Y_i-\hat{Y}_i\right)^2$ possible.

The lm() function can be used to find the least squares solutions. This function uses a formula object which takes the form
[response variable] ~ [predictor variable(s)]


For our example, we will have the formula
y ~ x
fit = lm(y~x, data=dat)
fit
                                
Call:
lm(formula = y ~ x, data = dat)

Coefficients:
(Intercept)            x  
     1.9808      -0.1766  
#to get the fitted values
fit$fitted.values
                                
        1         2         3         4         5         6         7         8 
1.8042248 1.6276203 1.4951669 1.2744112 0.9212021 0.7445976 0.5679931 0.2147840  
#to get the vertical distances
fit$residuals
                                
          1           2           3           4           5           6           7           8 
 0.19577520 -0.22762027  0.10483313 -0.02441121  0.07879786 -0.24459761 -0.06799308  0.18521598 
#sum of the squared distances
fit$residuals^2 %>% sum()
                                
[1] 0.2066899