7.4 Lasso

"You want the moon? Just say the word, and I'll throw a lasso around it and pull it down."
- George Bailey (It's a Wonderful Life )

7.4.1 Shrinkage as a Variable Selector

Recall in Section 5.5.4 that ridge regression can be used as a remedial measure when there is multicollinearity in the model.

The variables are first transformed using the correlation transformation in (5.23)

\begin{align*} Y_{i}^{*} & =\frac{1}{\sqrt{n-1}}\left(\frac{Y_{i}-\overline{Y}}{s_{Y}}\right)\\ X_{ik}^{*} & =\frac{1}{\sqrt{n-1}}\left(\frac{X_{ik}-\overline{X}_{k}}{s_{k}}\right)\qquad(5.23) \end{align*}

.

The estimates are then found by minimizing \begin{align*} Q & =\sum\left[Y_{i}^{*}-\left(b_{1}^{*}X_{i1}^{*}+\cdots+b_{p-1}^{*}X_{i,p-1}^{*}\right)\right]^{2}+\lambda\left[\sum_{j=1}^{p-1}\left(b_{j}^{*}\right)^{2}\right]\qquad(5.24) \end{align*} These estimates shrink to zero for predictors that do not have a significant linear relationship on $Y$ given the other variables.

In ridge regression, the coefficient estimates shrink to zero, but do not equal zero.

7.4.2 The Lasso

The lasso (least absolute shrinkage and selection operator) is a shrinkage method like ridge, with subtle but important differences.

The lasso estimate is defined by \begin{align*} Q_{lasso} & =\sum_{i=1}^{n}\left(Y_{i}-\left(\beta_{0}+\beta_{1}X_{1i}+\cdots+\beta_{p-1}X_{p-1,i}\right)\right)^{2}+\lambda\sum_{j=1}^{p-1}\left|\beta_{j}\right| \qquad (7.9) \end{align*} Similarly to ridge regression, the lasso can also be rewritten to be minimizing the sums of squares subject to the sum of the absolute values of the non-intercept beta coefficients being less than a constraint $\tau$: \begin{align*} \sum_{j=1}^{p-1}|\beta_{j}| & \le\tau \qquad (7.10) \end{align*} As $\tau$ decreases toward 0, the beta coefficients shrink toward zero with the least associated beta coefficients decreasing all the way to 0 before the more strongly associated beta coefficients.

As a result, numerous beta coefficients that are not strongly associated with the outcome are decreased to zero, which is equivalent to removing those variables from the model.

In this way, the lasso can be used as a variable selection method.

In order to find the optimal $\lambda$, a range of $\lambda$'s are tested, with the optimal $\lambda$ chosen using a cross-validation.

Confidence and Prediction Intervals

Because the lasso estimates are biased, then examining the standard errors of the estimators do not give the whole picture. The standard errors might be small but we must remember these estimates are biased.

Because of this, confidence intervals and prediction intervals are not easily interpreted.

If prediction intervals are desired, one could use lasso to select the predictor variables and then fit that model using ordinary least squares. This is briefly mentioned in Section 11.4 of Hastie, Tibshirani, Wainwright (2015)

Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC.

Example 7.4.1

In Example 5.5.5 we used ridge regression on the body fat data due to the multicollinearity among the predictor variables.

We will now use the lasso procedure to select the model.

library(tidyverse)
library(car)
library(olsrr)
library(MASS)
library(glmnet)

dat = read.table("http://www.jpstats.org/Regression/data/BodyFat.txt", header=T)

#for cv.glmnet, you must input the variables as matrices
X = dat[,1:3] %>% as.matrix()
y = dat[,4] %>% as.matrix()

#must specify alpha=1 for lasso
fit.lasso = cv.glmnet(X, y, alpha=1)
#plot the mean square error for different lambdas
plot(fit.lasso)


#we can obtain the lasso estimates
predict(fit.lasso, s ="lambda.min", type = "coefficients")

4 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept)  6.7197922
tri          0.9949266
thigh        .        
midarm      -0.4236571

It is important to note that when two $X$ variables are highly correlated, the lasso will arbitrarily have one of those variables shrink to zero. The practitioner should understand that the one variable that was not selected could in fact be more appropriate for their model than the one that was selected.

« 7.3: Stepwise Regression Procedures 7.5: Model Validation »