# 3.1.3.1. Documenting and naming honey bee variation

To detect, document and define previously unknown variation is a rather demanding task. First, it requires measuring a broad set of characters, in order not to miss the relevant morphological differences by not measuring a particular feature. In fact, even the conventional set already constitutes a compromise between desirable accuracy and work load, and is perhaps representing a dangerous restriction, because a (hitherto unknown) variant might be crucially different in some characters not considered. Therefore, although the logical recommendation would be to measure all characters possible, we recommend to measure at least a core set of the characters presented in Table 3 (bold type), together with the 20 point coordinates of wing morphometry as shown in Fig. 2. From these measures, any indices or compound measures can be calculated. This minimum of common measurements would also ensure a sufficiently broad base for referencing the new variation against the known character distributions, and provide a sufficiently accurate account of features representing a numerical description of the morphological variation of the bees.

Procedures for analysis are supplied in common statistical software packages, such as SPSS, Systat, Statistica or others. Analysis should be based on the mean of 10 to 15 colony members rather than on single workers. Otherwise, because of the relatedness of workers to each other, some degree of pseudo-replication, difficult to account for, would be introduced. Only the measurements of the primary characters, not the derived compound measurements or indices are used for analysis. In the investigation of unknown variation, a first aim is to investigate the sample character sets, in reference to samples from already identified and confirmed groups.

There are two main methods (detailed below) to detect whether the samples under study represent one or more groups. The most common and recommended primary method is principal component analysis (PCA), which reduces the numbers of dimensions (each character is a dimension) to two or three main factor dimensions, into which the positioning of samples can be plotted and then inspected visually. In an easy situation, the new samples separate into one or few clusters distinct from all reference clusters. However, quite frequently, they only occupy distinct areas of a point cloud and are not clearly set apart, signifying some not unusual overlap with other groups. For interpretation, plotting several dimensions against each other, combined with local labelling of the samples can be helpful to clarify the positioning.

A second useful method to identify grouping patterns is k-means clustering, a procedure separating the samples under analysis into different, predefined, numbers of groups, and indicating by an F-statistics based goodness of fit test which number of groups fits best. In addition, k-means group memberships can be matched against the geographical distribution of the samples to investigate the consistency of grouping with local coherence or ecological zones. Hierarchical clustering procedures may also be used to supplement the analysis, as long as the number of samples is not too high. Distance matrices can be very helpful for clarifying relations to other groups.

If the above methods have led to a sensible group definition, this should finally be verified by discriminant analysis which determines the significance of group differences and the accuracy with which samples are re-allocated into their correct groups.

A major difficulty in finding and verifying groups is to differentiate gradual, clinal change in character distributions from truly distinct groupings. The main caveat is that gaps in the sampling pattern can easily give the impression that two or more distinct groups exist, which may even be verified statistically, although the distinctness would disappear if the whole range had been sampled evenly (Radloff and Hepburn, 1998). In contrast, true groups are characterized by sudden morphological changes in relation to their geographic origin, which can be verified by geographic plots and relations to physiogeography. However, even in the case of clinal changes over extended regions the necessity might arise to name the different ends of the cline differently, although they only represent parts of a continuous distribution. There is no obvious general solution to this problem.