# 5.5. Robust statistics

Robust statistics were developed because
empirical data that considered samples from normal distributions often
displayed clearly non-normal characteristics, which invalidates the analyses if
one assumes normality. They are usually introduced early on in discussions of
measures of central tendency. For example, medians are far more resistant to
the influence of outliers (observations
that are deemed to deviate for reasons that may include measurement error,
mistakes in data entry, etc.) than are means, so the former are considered
more robust. Even a small number of outliers (as few as one) may adversely
affect a mean, whereas a median can be resistant when up to 50% of observations
are outliers. On the other hand, screening for outliers for removal may be
subjective and difficult for highly structured data, where a response variable
may be functionally related to many independent variables. If “outliers” are
removed, resulting variance estimates are often too small, resulting in overly
liberal testing (i.e. *p* values are too small).

What are the
alternatives when one cannot assume that data are generated by typical
parametric models (e.g. normal, Poisson, binomial distributions)? This may be a
result of contamination (e.g. most of the data comes from a normal distribution
with mean μ and variance σ_{1}^{2 } but a small percentage comes from a normal
distribution with mean μ and variance σ_{2}^{2}, where σ_{2}^{2}
>> σ_{1}^{2}), a symmetric distribution with heavy tails,
such as a *t* distribution with few degrees of freedom, or some highly
skewed distribution (especially common when there is a hard limit, such as no
negative values, typical of count data and also the results of analytic
procedures estimation; e.g. titres). Robust statistics are generally applicable
when a sampling distribution from which data are drawn is symmetric. “Non-parametric”
statistics are typically based on ordering observations by their magnitude, and
are thus more general, but have lower power than either typical parametric
models or robust statistical models. However, robust statistics never “caught
on” to any great degree in the biological sciences; they should be used far
more often (perhaps in most cases where the normal distribution is assumed).

Most
statistics packages have some procedures based on robust statistics; R has
particularly good representation (e.g. the MASS package). All typical
statistical models (e.g. regression, ANOVA, multivariate procedures) have
counterparts using robust statistics. Estimating these models used to be
considered difficult (involving iterative solutions, maximisation, etc.), but
these models are now quickly estimated. The generalised linear class of models
(GLM) has some overlap with robust statistics, because one can base models on,
e.g. heavy-tailed distributions in some software, but the approach is
different. In general, robust statistics try to diminish effects of
“influential” observations (i.e. outliers). GLMs, once a sampling distribution
is specified (theoretical sampling distributions include highly skewed or
heavy-tailed ones, though what is actually available depends on the software
package) consider all observations to be legitimate samples from that
distribution. We recommend analysing data in several different ways if
possible. If they all agree, then one might choose the analysis which best
matches the theory (the sampling distribution best reflecting our knowledge of
the underlying process) of how the data arose. When methods disagree, one must
then determine why they differ and make an educated choice on which to use. For
example, if assuming a normal distribution results in different confidence
limits around means than those obtained using robust statistics, it is likely
that there is serious contamination from outliers that is ignored by assuming a
normal distribution. A recent reference on robust statistics is Maronna *et al*. (2006), while the
classic one is Huber (1981).