OUTLIERS: WHAT TO DO ABOUT THEM?
Normal distributions will not generate extreme outliers. Therefore, if the process we want to study is one that produces extreme outliers, a model that assumes a distribution other than the normal distribution may give better and more realistic answers than the model that assumes the normal distribution .
Imagine, for example, that you are developing a model to study a process. If you know that the process will occasionally give you an unusual value, then you need to use a model that occasionally produces outliers. Otherwise, your decisions based on the model will be based on faulty assumptions and might lead to bad recommendations. For example, if you have a model for production, and your model never produces outliers, then you might conclude that the process always proceeds smoothly, and that back-up production systems are a waste of money. On the other hand, if you have a more realistic model that occasionally generates outliers, you may conclude from your analysis that back-up systems are needed because they are cost-effective. Which model is correct? The one that best predicts reality. Which decisions are best? The ones that are based on the best model.
OLS procedures are best under normality, but
are influenced strongly, and often badly, by outliers otherwise. This
means that a single observation can have excessive influence on the fitted
model, the significance tests, the prediction intervals, etc., and that the
results can be very misleading.
Outliers are troublesome because we want our statistical estimates to reflect the main body of the data, not just single observations.
SUGGESTIONS ABOUT WHAT TO DO.
Some, or all of these suggestions may be helpful, depending
upon your specific study.
1. Identify the outliers. Use univariate analyses (show min, max, skewness and kurtosis), scatterplots, residual plots, Normal probability plots, regression outlier diagnostics including standardized residuals, Hat diagonals, Cook's D stats, etc., which can be examined, e.g., by using the "INFLUENCE" and "R" options in PROC REG's MODEL statement. The ODS GRAPHICS of PROC REG automatically produce some nice outlier diagnostic plots.
2. Discuss the outliers very specifically. Are they mistakes? (If so, fix.) Do they represent unusual circumstances that differ dramatically from the study objectives, for example, the outliers are from Oklahoma, and the study involves Texas? (If so, delete them - but explain the reason clearly.) Are they just plain unusual values? (If so, discuss why they are unusual, and what are the practical applications of knowing this? Does it suggest other variables that might be included in the model?) Outliers often provide valuable insight into particular observations. Knowing why an observation is an outlier is very important. For example, outlier identification is a key part of quality control. While statistical methods are used to identify outliers, non-statistical theory (subject matter) is needed to explain why outliers are the way that they are.
It is usually not permissible to conclude that an observation is an "unusual circumstance" (eg, that it is like an "Oklahoma" observation) just because it is an outlier. In other words, it is not permissible to delete outliers automatically, without disclosure, just because they are outliers. This is the same as "sweeping them under the rug," also known as "scientific misconduct" in the extreme case.
Outliers provide interesting case studies.
They should always be identified and discussed. They
should never be ignored, or "swept under the rug." In any
scientific research, full disclosure is the ethical approach, including a
disclosure and discussion of the outliers.
In fact, in many analyses the outliers are the most interesting things. A prominent engineer/statistician is known to have claimed that many of his patents were the results of outliers. Specifically, the outliers were unusually good outcomes, and by mimicking the conditions under which the outlier occurred resulted in a patent.
3. There are "decision rules" for deciding whether an observation is an outlier, and you can find these in texts. These "rules" miss the point, in my opinion. Do not delete outliers in bunches just because they "failed" one of the "decision rules." In fact, try not to pay too much attention to the so-called "decision rules." Take a more thoughtful approach. If you wish to assess influence, you can compare analyses with and without particular outliers.
You may well come to the conclusion that some of the extreme values are indeed mistakes of calculation or other type of mistake in the data, and that the data values should therefore not be trusted. However, simple deletion of the extremes for this reason misses the point that the same errors may exist in the "non-extreme" data values. So, those should be deleted too. Even though the data values are not "outliers," it still doesn't mean the data are "good." Garbage in, garbage out. If you are going to delete some data because they are mistakes, attempt to identify the source of the mistake itself (e.g., a calculation error) and apply that deletion rule to the entire data set, not just to the outliers.
Outlier analysis requires context. You need to evaluate the context of the unusual observations in terms of the underlying science. Then you can decide what to do, whether to delete them, or use a model that accommodates them.
4. Study whether the outliers be diminished through log, rank, or other transformation without harming important model properties such as linearity.
5. Assess the influence of the outliers through deletion. It is best to do this using one-at-a-time deletion (it is generally not a good idea to delete outliers "en masse"). Compare the resulting analyses - do the essential conclusions remain unchanged? Do not hide any analyses - this would be "data snooping" at best, scientific misconduct at worst.
Assess how and whether your final conclusions are altered when outliers are included/excluded. This may involve comparing results of significance tests, or it may involve testing the various models on out-of-sample (validation) data.
6. Model your data generating process as completely as possible using a chosen p(y|x). Thus, if your data generating process spits out an outlier every once in a while, then your model should also spit out an outlier every once in a while! You can do this using maximum likelihood and/or Bayesian methods using an appropriate non-normal distribution for Y|X=x (for examples, a t distribution, or a mixture distribution). This will make the predictions of your model match reality, and will make your model more useful for planning, e.g., for exceptions as described above. When you choose an appropriately heavy-tailed distribution for Y, it has the effect of automatically downweighing the outliers in the estimate of the regression function. The resulting model provides the additional benefit of allowing you to predict frequency and size of the outliers (the exceptions).
It is easy to use maximum likelihood estimation with a t-distribution error distribution using PROC MODEL of SAS/ETS and PROC GLIMMIX of SAS/STAT (which also gives you the ability to model repeated measures, hierarchical, and clustering effects simultaneously). To estimate the model "by hand", see my excel spreadsheet under the "nonnormal" tab.
7. Use robust regression methods. "Quantile" is a newer technique that may give you exactly what you want - a picture of the distribution of Y (possibly nonnormal) for each X=x. You can model the median, the 5th percentile (for VaR analysis in Finance, eg), or whatever other quantile you want. Other robust methods include Least Median of Squares (the "LMS" subroutine of SAS/IML), Least Absolute Deviation, median smoothers, and M estimation. See here for details and references.
8. In data mining applications,
sometimes outliers are deleted en masse, for example by deleting 5% of the
observations according to an exceedance
criterion. The goals are somewhat different for data mining vs structural modeling. For data mining, the idea is
to come up with a "black box" that predicts future observations
well. If the 5% rule works for predicting data in the validation sample,
fine. But one should also check other rules, eg,
2%, 10%, 1%, etc. The bottom line for data mining is
"future" predictive accuracy (out-of-sample, not
in-sample). If you are engaged in data mining and not structural
modeling, and if outlier deletion offers
demonstrable improvements in out-of-sample prediction accuracy, then do
it. But even from the data mining standpoint, one might do better using a
more sophisticated tool.
9. Sometimes “Trimming” or “Winsorizing” are used, where a certain % of extreme outliers are either simply deleted or replaced with less extreme data values. I would consider these a last resort, as they distort the meaning of the model and its parameters considerably, and the assessment of appropriate standard errors is problematic. Further, the approach misses the fundamental goal of regression, to model the distribution of Y for a given X, and instead crams it into a truncated distribution. Worse, these truncated distributions are then assumed normal.
You don't get the right results if you just "delete 5% and proceed as usual." If the process produces outliers, then you should use a model that also produces outliers, because the model is meant to mimic nature.
R-Square in the Presence of Outliers
In cases where there are outliers, R-square is less useful because it is based on squared deviations, and some deviations are extremely large, distorting the R-square value. In cases where the outliers are in the X-space, R-square might by artificially inflated. In cases where the outlier is in Y-space, the R-square might be artificially deflated, and in these cases it may more appropriate to evaluate prediction accuracy using the median absolute deviations rather than the mean squared deviations.