# 1. Introduction

Bees are animals and, as such, are inherently variable at the molecular, individual, and population levels. This intrinsic variability means that a researcher needs to separate the various sources of variability contained in the measurements, whether obtained by observational or experimental research, into signal and noise. The former may be due to treatments received, bee age, or innate differences in resistance. The latter is largely due to the genetic (and phenotypic expression of it) background that characterises individual living organisms. Statistics is the branch of mathematics we use to isolate and quantify the signal and determine its importance, relative to the inherent noise. For the researcher, with an eye toward the statistical analysis to come, and before data collection starts, one should ask:

1) Which variables (VIM, 2008) am I going to measure and what kind of data will those variables generate?

2) What degree of accuracy do I want to achieve and what is the corresponding sample size required?

3) Which statistical analysis will help me to answer my research question? This is related to the question. What kind of underlying process produces data like those I will be collecting?

4) From what population do I want to sample? (What is the statistical population/ statistical universe?) For example, do I want to make inferences about the local, national, continental, or worldwide population?

One function of statistics is to summarise
information to make it more usable and easier to grasp. A second is inductive,
where one makes generalisations based on a subset of a population or based on
repeated observations (through replication or repeated over time). For example,
if 50 workers randomly sampled from 20 colonies all produce 10-hydroxydecanoic acid (10-HDAA, one of the
major components in the mandibular gland, especially in workers; Crewe, 1982; Pirk* et al.*, 2011), one could infer that all workers produce
10-HDAA. An example of inferring a general pattern from repeated observations
would be: If an experiment is repeated 5 times and yields the same result each
time, one makes a generalisation based on this limited number of experiments.
One should keep in mind that, if one is measuring a quantitative variable,
irrespective of how precise measuring instruments are, each experimental unit/
replicate produces a unique data value. A third function of statistics is based
on deductive reasoning and might involve statistical modelling, in the
classical or Bayesian paradigm, to understand the basic processes that produced
the measurements, possibly by incorporating prior information (e.g. predicting
species distributions or phylogenetic relationships/trees; see Kaeker
and Jones, 2003). In this article we will cover, albeit
incompletely, all three functions of statistics. We have largely focused on research with bee
pathogens, in part because these are of intense practical and theoretical
interest, and in part because of our own backgrounds. However, bee biology rightly
includes a much greater spectrum of research, and for much of it there are
specialised statistical tools. Some of
the ones we discuss are broadly applicable but, by necessity, this section can
only provide an uneven treatment of current statistical methods that might be
used in bee research. In particular, we
do not discuss multivariate methods, other than principal components, Bayesian
approaches, and touch only lightly on simulation and resampling methods, all
are current fields of investigation in statistics. Molecular, and in particular, genomic
research has spawned substantial new statistical methodology, also not covered
here.

Furthermore, we restrict ourselves here to providing guidelines on statistics for certain kinds of honey bee research, as mentioned above, with referrals to more detailed sources of information. Fortunately, there are excellent statistical tools available, the most important of which is a good statistician.

The statistics
we describe can be roughly grouped into two main areas, one having to do with
sampling to estimate population characteristics (e.g. for pathogen prevalence =
proportion of infected bees in an apiary or a colony), and the other having to
do with experiments (e.g. comparing treatments, one of which may be a control).
Due to the complex social structure of a bee hive, and the peculiar
developmental and environmental aspects of bee biology, sampling in this
discipline has more components to consider than in most biological fields. Some
statistical topics are relevant to both sampling and experimental studies, such
as sample size and power. Others are primarily of concern for just one of the
areas. For example, when sampling for pathogen prevalence, primary issues
include representativeness, and how or when to sample. For experiments, they
include hypothesis formulation and development of appropriate statistical
models for the processes (which includes testing and assumptions of models). Of
course, good experiments require representative samples, and also require a
good understanding of sampling. Both areas are important for data acquisition
and analysis. We start with statistical issues related to sampling.