10.3.1. Descriptive analysis
Any statistical analysis of data should begin with simple data description and presentation using summary statistics, tables and plots. While the main interest may well be in modelling, the initial analysis is still an essential first step. The results of such analysis with well-designed graphs and/or tables can reveal unsuspected patterns in the data, and will ensure that the obvious characteristics of the data are clearly understood by readers of any resulting report.
Responses to any simple survey question clearly require this approach, e.g. to determine the proportion of respondents in a postal survey who are currently beekeepers, where this cannot be determined in advance of choosing the sample, or the proportion wishing to remain anonymous, or the proportion experiencing any colony losses over a specified period. The data on any categorical variable can also be presented in a bar chart and/or a contingency table, with frequencies and relative frequencies, for an overview of the responses, the range of values and the most common category. This will also help in identifying invalid responses.
Extending this analysis to more than one categorical variable, e.g. to compare the proportions of losses experienced by respondents in different countries, or by geographical area within a single country, or for different sizes of beekeeping operation, two-way tables are useful. Relevant follow-up tests include chi-squared tests of association or homogeneity, which will permit the statistical investigation of the possible significance of differences in sample proportions. Even if observations contributing to each cell in the table are not all independent, the results of this can inform any subsequent modelling, by identifying potential risk factors for colony loss, for example, to be included in the model.
For questions with a quantitative response, of most interest is some measure of a typical or central value. The most appropriate measure depends on the distribution of the numerical responses. Where these are fairly symmetrically distributed, and there are not many extreme atypical values, the best measure is the mean or arithmetical average of the observations. However if the distribution of the data is very skewed, and/or there is a fairly large proportion of extreme atypical values, then the mean can be seriously misleading. For example, in the distributions of number of colonies kept per beekeeper, or honey yield, the existence of a few very large numbers of colonies kept or correspondingly high honey yields has the effect that the mean will give a grossly inflated idea of what is a typical value. The number of lost colonies per beekeeper also tends to have a highly positively skewed distribution. For such cases the median is preferred. This is the central observation, or the mid-point between the two central observations, after the data have been arranged in increasing order of magnitude.
Almost as important is some measure of dispersion of the observations around the mean or median, whichever has been chosen as being most appropriate. The usual choices are either the standard deviation for variables for which the mean is used, or the inter-quartile range for situations where the median is the appropriate measure of a typical value. (Any first level statistics textbook, such as Ott and Longnecker (2009) or Samuels et al. (2010), will describe the computation of these quantities). Confidence intervals based on the mean are z-intervals in the case of a large sample, or t-intervals for smaller samples. Population means may be compared using Analysis of Variance (ANOVA; see Ott and Longnecker (2009), and Pirk et al. (2013), assuming independent observations and independent samples from normal distributions. For medians, nonparametric confidence intervals and tests are available, including the Kruskal-Wallis test as a nonparametric equivalent of ANOVA. Nonparametric procedures generally are robust to data which does not conform exactly to the assumptions of the test procedure.
Histograms are essential graphical tools to study the nature of the probability distribution of a quantitative variable such as number of colonies kept or number of colonies lost or honey yield, and hence to determine whether this is symmetric or skewed. Boxplots can also be useful in this regard. Comparing these between countries for example can indicate differences.
Comparing a histogram visually with the theoretical density functions of a range of possible probability distributions is also a simple first step in selecting and justifying a plausible model for use in more advanced statistical modelling of a dataset. The most frequently used probability model for the distribution of continuous numerical data is the symmetric bell-shaped Normal or Gaussian distribution. However for data which are clearly asymmetrical and skewed, the choice is wider. For continuous positive data, the Gamma distribution provides a large family of shapes of probability distribution, or the Beta distribution can be used for positive data over a finite range between 0 and some given positive value a. For skewed count data, the Negative Binomial distribution may be appropriate. For example, data describing number of colony losses contains many zeroes, but may also have some rather high numbers lost. Various tests for goodness of fit can be used to see if any of these models can be clearly ruled out, but often the final choice is governed by considerations of convenience and mathematical tractability.