Summary Comments on Variable Selection
1. You can not find the "true model."
2. You can find good models, with lower total expected squared error.
3. Statistical variable selection methods are often called DATA SNOOPING, and cause concern over whether the results are replicable.
Definition: "Data Snooping" means discovering artifacts (results specific to the given data that do not generalize) through excessive and/or unreasonable data manipulation.
4. When you use variable selection, you tend to find models that predict the given data well, but the fit statistics (R-square, Root MSE) tend to be overly optimistic for assessing how well the model works for external predictions. (You can assess how well the model works externally by applying it to an external data that was not used to estimate the model.)
5. Because of 3 and 4, the p-values of selected models are generally too small,
implying higher-than expected likelihood of Type I errors.
6. Use of an external validation sample is an excellent way to overcome the difficulties inherent in variable selection.
Strategies for Variable Selection
1. Use subject matter knowledge heavily.
2. Determine, a priori, a primary analysis and a secondary analysis.
Primary - Main variables
Secondary - main + questionable variables
3. Use little data snooping with primary analysis. Be prepared to claim that the model is real and repeatable.
4. Allow data snooping with secondary analysis. Only claim that the model is real and repeatable if there is extremely strong statistical evidence, and corroborating subject matter knowledge.
5. Attach appropriate caveats to any analysis resulting from data snooping:
"This analysis was exploratory in nature, and further study is needed to assess whether the findings are real or artificial."
6. If you have enough data, use a validation sample to validate your data-snooped regression model. If the results are consistent in the validation sample, then you have more confidence in them.
7. Better yet, replicate the study in a fresh context. (Different population, different industry, different historical regime). If the results are consistent, then the theory is validated. Inconsistency means that there is a problem with the theory that needs explaining. Perhaps the theory interacts with the context. Aha! A new theory!