7.5 Model Validation

"All generalizations are false, including this one."
- Mark Twain

7.5.1 Checking the Selected Model

The final step in the model-building process is the validation of the selected regression models.

Model validation usually involves checking a candidate model against independent data (data not used in the model building process).

Three basic ways of validating a regression model are:

Collection of new data to check the model and its predictive ability.
Comparison of results with theoretical expectations, earlier empirical results, and simulation results.
Use of a holdout sample to check the model and its predictive ability.

Our discussion of validation will focus primarily on issues that arise in validating regression models for exploratory observational studies.

7.5.2 Collection of New Data to Check Model

The best means of model validation is through the collection of new data.

The purpose of collecting new data is to be able to examine whether the regression model developed from the earlier data is still applicable for the new data.

If so, one has assurance about the applicability of the model to data beyond those on which the model is based.

Methods of Checking Validity

There are a variety of methods of examining the validity of the regression model against the new data.

One validation method is to reestimate the model form chosen earlier using the new data.

The estimated regression coefficients and various characteristics of the fitted model are then compared for consistency to those of the regression model based on the earlier data.

If the results are consistent, they provide strong support that the chosen regression model is applicable under broader circumstances than those related to the original data.

A second validation method is designed to calibrate the predictive capability of the selected regression model.

When a regression model is developed from given data, it is inevitable that the selected model is chosen, at least in large part, because it fits well the data at hand.

For a different set of random outcomes, one may likely have arrived at a different model in terms of the predictor variables selected and/or their functional forms and interaction terms present in the model.

A result of this model development process is that the error mean square MSE will tend to understate the inherent variability in making future predictions from the selected model.

Mean Square Prediction Error

A means of measuring the actual predictive capability of the selected regression model is to use this model to predict each case in the new data set and then to calculate the mean of the squared prediction errors, denoted by MSPR, which stands for mean squared prediction error: \begin{align*} MSPR & =\frac{\sum_{i=1}^{n^{*}}\left(Y_{i}-\hat{Y}_{i}\right)^{2}}{n^*}\qquad(7.11) \end{align*} where \begin{align*} Y_{i} & \text{ is the value of the response variable in the }i\text{th validation case}\\ \hat{Y}_{i} & \text{ is the predicted value of the }i\text{th validation case based on the model-building data set}\\ n^{*} & \text{ is the number of cases in the validation data set} \end{align*} If the mean squared prediction error MSPR is fairly close to MSE based on the regression fit to the model-building data set, then the error mean square MSE for the selected regression model is not seriously biased and gives an appropriate indication of the predictive ability of the model.

If the mean squared prediction error is much larger than MSE, one should rely on the mean squared prediction error as an indicator of how well the selected regression model will predict in the future.

Comparison with Theory, Empirical Evidence, or Simulation Results

In some cases, theory, simulation results, or previous empirical results may be helpful in determining whether the selected model is reasonable.

Comparisons of regression coefficients and predictions with theoretical expectations, previous empirical results, or simulation results should be made. Unfortunately, there is often little theory that can be used to validate regression models.

Data Splitting

By far the preferred method to validate a regression model is through the collection of new data. Often, however, this is neither practical nor feasible.

An alternative when the data set is large enough is to split the data into two sets.

The first set, called the model-building set or the training sample, is used to develop the model.

The second data set, called the validation or prediction set, is used to evaluate the reasonableness and predictive ability of the selected model.

This validation procedure is often called cross-validation. Data splitting in effect is an attempt to simulate replication of the study.

The validation data set is used for validation in the same way as when new data are collected.

The regression coefficients can be reestimated for the selected model and then compared for consistency with the coefficients obtained from the model-building data set.

Also, predictions can be made for the data in the validation data set from the regression model developed from the model-building data set, to calibrate the predictive ability of this regression model for the new data.

When the calibration data set is large enough, one can also study how the "good" models considered in the model selection phase fare with the new data.

Data sets are often split equally into model-building and validation data sets. It is important, however, that the model-building data set be sufficiently large so that a reliable model can be developed.

A rule of thumb is that the number of cases should be at least 6 to 10 times the number of variables in the pool of predictor variables.

Thus, when 10 variables are in the pool, the model-building data set should contain at least 60 to 100 cases.

If the entire data set is not large enough under these circumstances for making an equal split, the validation data set will need to be smaller than the model-building data set.

Splits of the data can be made at random. Another possibility is to match cases in pairs and place one of each pair into one of the two split data sets. When data are collected sequentially in time, it is often useful to pick a point in time to divide the data.

A possible drawback of data splitting is that the variances of the estimated regression coefficients developed from the model-building data set will usually be larger than those that would have been obtained from the fit to the entire data set.

If the model-building data set is reasonably large, however, these variances generally will not be that much larger than those for the entire data set. In any case, once the model has been validated, it is customary practice to use the entire data set for estimating the final regression model.

« 7.4: Lasso 8.1: Logistic Regression »