**OUTLIERS: WHAT TO DO ABOUT THEM? **

THE PROBLEM:

Normal distributions will not generate extreme outliers. Therefore, if the process we want to study is one that produces extreme outliers, a model that assumes a distribution other than the normal distribution may give better and more realistic answers than the model that assumes the normal distribution .

Imagine, for example, that you are developing a model to study a process. If you know that the process will occasionally give you an unusual value, then you need to use a model that occasionally produces outliers. Otherwise, your decisions based on the model will be based on faulty assumptions and might lead to bad recommendations. For example, if you have a model for production, and your model never produces outliers, then you might conclude that the process always proceeds smoothly, and that back-up production systems are a waste of money. On the other hand, if you have a more realistic model that occasionally generates outliers, you may conclude from your analysis that back-up systems are needed because they are cost-effective. Which model is correct? The one that best predicts reality. Which decisions are best? The ones that are based on the best model.

OLS procedures are best under normality, but
are *influenced* strongly, and often badly, by outliers otherwise. This
means that a single observation can have excessive influence on the fitted
model, the significance tests, the prediction intervals, etc., and that the
results can be very misleading.

Outliers are troublesome because we want our statistical estimates to reflect
the *main body* of the data, not just single observations.

SUGGESTIONS ABOUT WHAT TO DO.

Some, or all of these suggestions may be helpful, depending
upon your specific study.

1. Identify the outliers. Use univariate analyses (show min, max, skewness
and kurtosis), scatterplots, residual plots, Normal probability plots,
regression outlier diagnostics including standardized residuals, Hat diagonals,
Cook's D stats, etc., which can be examined, e.g., by using the
"INFLUENCE" and "R" options in PROC REG's MODEL
statement. The ODS GRAPHICS of PROC REG automatically produce some nice
outlier diagnostic plots.

2. Discuss the outliers very specifically. Are they mistakes? (If so,
fix.) Do they represent unusual circumstances that differ dramatically from the
study objectives, for example, the outliers are from Oklahoma, and the study
involves Texas? (If so, delete them - but explain the reason clearly.) Are they
just plain unusual values? (If so, discuss why they are unusual, and what are
the practical applications of knowing this? Does it suggest other
variables that might be included in the model?) Outliers often provide valuable
insight into particular observations. Knowing *why* an observation is an
outlier is very important. For example, outlier identification is a key part of
quality control. While statistical methods are used to identify outliers, non-statistical theory (subject matter) is needed to
explain why outliers are the way that they are.

It is usually* not *permissible to
conclude that an observation is an "unusual circumstance" (eg, that it is like an "Oklahoma" observation)
just because it is an outlier. In other words, it is not permissible to delete
outliers automatically, without disclosure, just because they are
outliers. This is the same as "sweeping them under the rug,"
also known as "scientific misconduct" in the extreme case.

Outliers provide interesting case studies.
They should* always *be* identified *and *discussed*. They
should* never *be ignored, or "swept under the rug." In any
scientific research, full disclosure is the ethical approach, including a
disclosure and discussion of the outliers.

In fact, in many analyses the outliers are the most interesting things. A
prominent engineer/statistician is known to have claimed that many of his
patents were the results of outliers. Specifically, the outliers were *unusually
good* outcomes, and by mimicking the conditions under which the outlier
occurred resulted in a patent.

3. There are "decision rules" for deciding whether an
observation is an outlier, and you can find these in texts. These
"rules" miss the point, in my opinion. Do not delete outliers in
bunches just because they "failed" one of the "decision
rules." In fact, try not to pay too much attention to the so-called
"decision rules." Take a more thoughtful approach. If you
wish to assess influence, you can compare analyses with and without particular
outliers.

You may well come to the conclusion that some of the extreme values are indeed mistakes of calculation or other type of mistake in the data, and that the data values should therefore not be trusted. However, simple deletion of the extremes for this reason misses the point that the same errors may exist in the "non-extreme" data values. So, those should be deleted too. Even though the data values are not "outliers," it still doesn't mean the data are "good." Garbage in, garbage out. If you are going to delete some data because they are mistakes, attempt to identify the source of the mistake itself (e.g., a calculation error) and apply that deletion rule to the entire data set, not just to the outliers.

Outlier analysis requires context. You need to evaluate the context of the unusual observations in terms of the underlying science. Then you can decide what to do, whether to delete them, or use a model that accommodates them.

4. Study whether the outliers be diminished through
log, rank, or other transformation without harming important model properties
such as linearity.

5. Assess the influence of the outliers through deletion. It is best to do this
using one-at-a-time deletion (it is generally __not__ a good idea to delete
outliers "en masse"). Compare the resulting analyses - do the
essential conclusions remain unchanged? Do not hide any analyses - this
would be "data snooping" at best, scientific misconduct at worst.

Assess how and whether your final conclusions are altered when outliers are included/excluded. This may involve comparing
results of significance tests, or it may involve testing the various models on
out-of-sample (validation) data.

6. Model your data generating process as completely as possible using a chosen p(y|x). Thus, if your data
generating process spits out an outlier every once in a while, then your model
should also spit out an outlier every once in a while! You can do this
using maximum likelihood and/or Bayesian methods using an appropriate
non-normal distribution for Y|X=x (for examples, a t distribution, or a mixture
distribution). This will make the predictions of your model match
reality, and will make your model more useful for planning, e.g., for
exceptions as described above. When you choose an appropriately
heavy-tailed distribution for Y, it has the effect of automatically downweighing the outliers in the estimate of the regression
function. The resulting model provides the additional benefit of allowing
you to predict frequency and size of the outliers (the exceptions).

It is easy to use maximum likelihood estimation with a t-distribution error distribution using PROC MODEL of SAS/ETS and PROC GLIMMIX of SAS/STAT (which also gives you the ability to model repeated measures, hierarchical, and clustering effects simultaneously). To estimate the model "by hand", see my excel spreadsheet under the "nonnormal" tab.

7. Use robust regression methods. "Quantile regression

" is a newer technique that may give you exactly what you want - a picture of the distribution of Y (possibly nonnormal) for each X=x. You can model the median, the 5th percentile (for VaR analysis in Finance, eg), or whatever other quantile you want. Other robust methods include Least Median of Squares (the "LMS" subroutine of SAS/IML), Least Absolute Deviation, median smoothers, and M estimation. See here for details and references.8. In data mining applications,
sometimes outliers are deleted en masse, for example by deleting 5% of the
observations according to an exceedance
criterion. The goals are somewhat different for data mining vs structural modeling. For data mining, the idea is
to come up with a "black box" that predicts future observations
well. If the 5% rule works for predicting data in the validation sample,
fine. But one should also check other rules, eg,
2%, 10%, 1%, etc. The bottom line for data mining is
"future" predictive accuracy (out-of-sample, not
in-sample). If you are engaged in data mining and not structural
modeling, and if outlier deletion offers
demonstrable improvements in out-of-sample prediction accuracy, then do
it. But even from the data mining standpoint, one might do better using a
more sophisticated tool.

9. Sometimes “Trimming” or “Winsorizing” are
used, where a certain % of extreme outliers are either simply deleted or
replaced with less extreme data values. I would consider these a last
resort, as they distort the meaning of the model and its parameters
considerably, and the assessment of appropriate standard errors is
problematic. Further, the approach misses the fundamental goal of
regression, to model the distribution of Y for a given X, and instead crams it
into a truncated distribution. Worse,
these truncated distributions are then assumed normal.

You don't get the right results if you just "delete 5% and proceed as
usual." If the process produces outliers, then you should use a
model that also produces outliers, because the model is meant to mimic nature.

**Further note:**

**R-Square in the Presence of Outliers**

In cases where there are outliers, R-square is less useful because it is based on squared deviations, and some deviations are extremely large, distorting the R-square value. In cases where the outliers are in the X-space, R-square might by artificially inflated. In cases where the outlier is in Y-space, the R-square might be artificially deflated, and in these cases it may more appropriate to evaluate prediction accuracy using the median absolute deviations rather than the mean squared deviations.