# 10.3.1. Descriptive analysis

Any statistical analysis of data should begin with simple data description and presentation using summary statistics, tables and plots. While the main interest may well be in modelling, the initial analysis is still an essential first step. The results of such analysis with well-designed graphs and/or tables can reveal unsuspected patterns in the data, and will ensure that the obvious characteristics of the data are clearly understood by readers of any resulting report.

Responses to any simple survey question clearly require this approach, e.g. to determine the proportion of respondents in a postal survey who are currently beekeepers, where this cannot be determined in advance of choosing the sample, or the proportion wishing to remain anonymous, or the proportion experiencing any colony losses over a specified period. The data on any categorical variable can also be presented in a bar chart and/or a contingency table, with frequencies and relative frequencies, for an overview of the responses, the range of values and the most common category. This will also help in identifying invalid responses.

Extending this analysis to more than one categorical variable, e.g. to compare the proportions of losses experienced by respondents in different countries, or by geographical area within a single country, or for different sizes of beekeeping operation, two-way tables are useful. Relevant follow-up tests include chi-squared tests of association or homogeneity, which will permit the statistical investigation of the possible significance of differences in sample proportions. Even if observations contributing to each cell in the table are not all independent, the results of this can inform any subsequent modelling, by identifying potential risk factors for colony loss, for example, to be included in the model.

For questions with a
quantitative response, of most interest is some measure of a typical or central
value. The most appropriate measure depends on the distribution of the numerical
responses. Where these are fairly symmetrically distributed, and there are not
many extreme atypical values, the best measure is the *mean *or arithmetical average of the
observations. However if the distribution of the data is very skewed, and/or there
is a fairly large proportion of extreme atypical values, then the mean can be
seriously misleading. For example, in the distributions of number of colonies
kept per beekeeper, or honey yield, the existence of a few very large numbers
of colonies kept or correspondingly high honey yields has the effect that the
mean will give a grossly inflated idea of what is a typical value. The number
of lost colonies per beekeeper also tends to have a highly positively skewed
distribution. For such cases the *median *is preferred. This is the
central observation, or the mid-point between the two central observations,
after the data have been arranged in increasing order of magnitude.

Almost as important is
some measure of dispersion of the observations around the mean or median,
whichever has been chosen as being most appropriate. The usual choices are
either the *standard deviation* for variables for which the mean is used,
or the *inter-quartile range* for situations where the median is the
appropriate measure of a typical value. (Any first level statistics textbook,
such as Ott and Longnecker (2009) or Samuels *et al.* (2010), will describe the computation of these quantities).
Confidence intervals based on the mean are *z*-intervals
in the case of a large sample, or *t*-intervals
for smaller samples. Population means may be compared using Analysis of
Variance (ANOVA; see Ott and Longnecker (2009), and Pirk *et al.* (2013), assuming independent observations and independent
samples from normal distributions. For medians, nonparametric confidence
intervals and tests are available, including the Kruskal-Wallis test as a
nonparametric equivalent of ANOVA. Nonparametric procedures generally are
robust to data which does not conform exactly to the assumptions of the test
procedure.

Histograms are essential graphical tools to study the nature of the probability distribution of a quantitative variable such as number of colonies kept or number of colonies lost or honey yield, and hence to determine whether this is symmetric or skewed. Boxplots can also be useful in this regard. Comparing these between countries for example can indicate differences.

Comparing a histogram
visually with the theoretical density functions of a range of possible
probability distributions is also a simple first step in selecting and
justifying a plausible model for use in more advanced statistical modelling of
a dataset. The most frequently used probability model for the distribution of
continuous numerical data is the symmetric bell-shaped *Normal* or *Gaussian* distribution.
However for data which are clearly asymmetrical and skewed, the choice is
wider. For continuous positive data, the *Gamma* distribution provides a
large family of shapes of probability distribution, or the *Beta*
distribution can be used for positive data over a finite range between 0 and
some given positive value *a*. For skewed count data, the *Negative
Binomial* distribution may be appropriate. For example, data describing
number of colony losses contains many zeroes, but may also have some rather
high numbers lost. Various tests for goodness of fit can be used to see if any
of these models can be clearly ruled out, but often the final choice is
governed by considerations of convenience and mathematical tractability.