Fill in Blanks
Home
1.1 Bivariate Relationships
1.2 Probabilistic Models
1.3 Estimation of the Line
1.4 Properties of the Least Squares Estimators
1.5 Estimation of the Variance
2.1 The Normal Errors Model
2.2 Inferences for the Slope
2.3 Inferences for the Intercept
2.4 Correlation and Coefficient of Determination
2.5 Estimating the Mean Response
2.6 Predicting the Response
3.1 Residual Diagnostics
3.2 The Linearity Assumption
3.3 Homogeneity of Variance
3.4 Checking for Outliers
3.5 Correlated Error Terms
3.6 Normality of the Residuals
4.1 More Than One Predictor Variable
4.2 Estimating the Multiple Regression Model
4.3 A Primer on Matrices
4.4 The Regression Model in Matrix Terms
4.5 Least Squares and Inferences Using Matrices
4.6 ANOVA and Adjusted Coefficient of Determination
4.7 Estimation and Prediction of the Response
5.1 Multicollinearity and Its Effects
5.2 Adding a Predictor Variable
5.3 Outliers and Influential Cases
5.4 Residual Diagnostics
5.5 Remedial Measures
7.5 Model Validation
"All generalizations are false, including this one."
- Mark Twain
The final step in the model-building process is the validation of the selected regression
models.
Model validation usually involves checking a candidate model against independent data (data not used in the model building process).
Three basic ways of validating a regression model are:
Our discussion of validation will focus primarily on issues that arise
in validating regression models for exploratory observational studies.
Three basic ways of validating a regression model are:
-
Collection of
new data to check the model and its predictive ability. - Comparison of results with theoretical expectations, earlier empirical results, and simulation results.
-
Use of a
holdout sample to check the model and its predictive ability.
The best means of model validation is through the collection of new data.
The purpose of collecting new data is to be able to examine whether the regression model developed from the earlier data is still applicable for the new data.
If so, one has assurance about the applicability of the model to data beyond those on which the model is based.
The purpose of collecting new data is to be able to examine whether the regression model developed from the earlier data is still applicable for the new data.
If so, one has assurance about the applicability of the model to data beyond those on which the model is based.
There are a variety of methods of examining the validity
of the regression model against the new data.
One validation method is to reestimate the model form chosen earlier using the new data.
The estimated regression coefficients and various characteristics of the fitted model are then compared for consistency to those of the regression model based on the earlier data.
If the results are consistent, they provide strong support that the chosen regression model is applicable under broader circumstances than those related to the original data.
A second validation method is designed to calibrate the predictive capability of the selected regression model.
When a regression model is developed from given data, it is inevitable that the selected model is chosen, at least in large part, because it fits well the data at hand.
For a different set of random outcomes, one may likely have arrived at a different model in terms of the predictor variables selected and/or their functional forms and interaction terms present in the model.
A result of this model development process is that the error mean square MSE will tend to understate the inherent variability in making future predictions from the selected model.
One validation method is to reestimate the model form chosen earlier using the new data.
The estimated regression coefficients and various characteristics of the fitted model are then compared for consistency to those of the regression model based on the earlier data.
If the results are consistent, they provide strong support that the chosen regression model is applicable under broader circumstances than those related to the original data.
A second validation method is designed to calibrate the predictive capability of the selected regression model.
When a regression model is developed from given data, it is inevitable that the selected model is chosen, at least in large part, because it fits well the data at hand.
For a different set of random outcomes, one may likely have arrived at a different model in terms of the predictor variables selected and/or their functional forms and interaction terms present in the model.
A result of this model development process is that the error mean square MSE will tend to understate the inherent variability in making future predictions from the selected model.
A means of measuring the actual predictive capability of the selected regression model
is to use this model to predict each case in the new data set and then to calculate the mean
of the squared prediction errors, denoted by MSPR , which stands for mean squared prediction error:
\begin{align*}
MSPR & =\frac{\sum_{i=1}^{n^{*}}\left(Y_{i}-\hat{Y}_{i}\right)^{2}}{n^*}\qquad(7.11)
\end{align*}
where
\begin{align*}
Y_{i} & \text{ is the value of the response variable in the }i\text{th validation case}\\
\hat{Y}_{i} & \text{ is the predicted value of the }i\text{th validation case based on the model-building data set}\\
n^{*} & \text{ is the number of cases in the validation data set}
\end{align*}
If the mean squared prediction error MSPR is fairly close to MSE based on the regression
fit to the model-building data set, then the error mean square MSE for the selected regression
model is not seriously biased and gives an appropriate indication of the predictive ability of
the model.
If the mean squared prediction error is much larger than MSE, one should rely on the mean squared prediction error as an indicator of how well the selected regression model will predict in the future.
If the mean squared prediction error is much larger than MSE, one should rely on the mean squared prediction error as an indicator of how well the selected regression model will predict in the future.
In some cases, theory, simulation results, or previous empirical results may be helpful in
determining whether the selected model is reasonable.
Comparisons of regression coefficients and predictions with theoretical expectations, previous empirical results, or simulation results should be made. Unfortunately, there is often little theory that can be used to validate regression models.
Comparisons of regression coefficients and predictions with theoretical expectations, previous empirical results, or simulation results should be made. Unfortunately, there is often little theory that can be used to validate regression models.
By far the preferred method to validate a regression model is through the collection of new
data. Often, however, this is neither practical nor feasible.
An alternative when the data set is large enough is to split the data into two sets.
The first set, called themodel -building set or
the training sample , is used to develop the model.
The second data set, called thevalidation
or prediction set , is used to evaluate the reasonableness and predictive ability of the selected
model.
This validation procedure is often calledcross -validation . Data splitting in effect is
an attempt to simulate replication of the study.
The validation data set is used for validation in the same way as when new data are collected.
The regression coefficients can be reestimated for the selected model and then compared for consistency with the coefficients obtained from the model-building data set.
Also, predictions can be made for the data in the validation data set from the regression model developed from the model-building data set, to calibrate the predictive ability of this regression model for the new data.
When the calibration data set is large enough, one can also study how the "good" models considered in the model selection phase fare with the new data.
Data sets are often split equally into model-building and validation data sets. It is important, however, that the model-building data set be sufficiently large so that a reliable model can be developed.
A rule of thumb is that the number of cases should be at least 6 to 10 times the number of variables in the pool of predictor variables.
Thus, when 10 variables are in the pool, the model-building data set should contain at least 60 to 100 cases.
If the entire data set is not large enough under these circumstances for making an equal split, the validation data set will need to be smaller than the model-building data set.
Splits of the data can be made at random. Another possibility is to match cases in pairs and place one of each pair into one of the two split data sets. When data are collected sequentially in time, it is often useful to pick a point in time to divide the data.
A possible drawback of data splitting is that the variances of the estimated regression coefficients developed from the model-building data set will usually be larger than those that would have been obtained from the fit to the entire data set.
If the model-building data set is reasonably large, however, these variances generally will not be that much larger than those for the entire data set. In any case, once the model has been validated, it is customary practice to use the entire data set for estimating the final regression model.
An alternative when the data set is large enough is to split the data into two sets.
The first set, called the
The second data set, called the
This validation procedure is often called
The validation data set is used for validation in the same way as when new data are collected.
The regression coefficients can be reestimated for the selected model and then compared for consistency with the coefficients obtained from the model-building data set.
Also, predictions can be made for the data in the validation data set from the regression model developed from the model-building data set, to calibrate the predictive ability of this regression model for the new data.
When the calibration data set is large enough, one can also study how the "good" models considered in the model selection phase fare with the new data.
Data sets are often split equally into model-building and validation data sets. It is important, however, that the model-building data set be sufficiently large so that a reliable model can be developed.
A rule of thumb is that the number of cases should be at least 6 to 10 times the number of variables in the pool of predictor variables.
Thus, when 10 variables are in the pool, the model-building data set should contain at least 60 to 100 cases.
If the entire data set is not large enough under these circumstances for making an equal split, the validation data set will need to be smaller than the model-building data set.
Splits of the data can be made at random. Another possibility is to match cases in pairs and place one of each pair into one of the two split data sets. When data are collected sequentially in time, it is often useful to pick a point in time to divide the data.
A possible drawback of data splitting is that the variances of the estimated regression coefficients developed from the model-building data set will usually be larger than those that would have been obtained from the fit to the entire data set.
If the model-building data set is reasonably large, however, these variances generally will not be that much larger than those for the entire data set. In any case, once the model has been validated, it is customary practice to use the entire data set for estimating the final regression model.