10.1 Assessing data quality
Prior to the analysis, some assessment and possible improvement of the quality of the data is essential. This is of utmost importance when these data are to be used in statistical models. Errors of different kinds can easily result in false inferences of general patterns, meaning that effort expended in complex modelling may be largely wasted if the data are unreliable.
As numerical data is not directly measured but derived from surveys, the means of data collection used in the surveys has to be designed in such a way that respondents have limited opportunity to generate extreme or erroneous responses. Thorough data validation must precede modelling procedures, i.e. checking for out-of-range data (invalid responses), and inconsistent responses. The proportion of missing values is also an indication of data quality. See De Leeuw et al. (2008), chapters 17-22, for an overview of quality control and data validation for survey data.
If the results of data checking suggest that the data are unreliable, then it may be sensible to limit analysis to simple procedures, or else interpret the results of model fitting with some caution. This is also true for small data sets where complex model fitting may not be feasible.
If the selected sample size is known, as for example in a randomized sample, the overall non-response rate can be calculated as a first indicator of quality, as a survey with a high non-response rate (a low achieved sample size) may be unrepresentative of the population of interest. Assessing non-response involves comparison of the actual sample size and the planned sample size (see § 9 on choice of sample size).
Examining responses to individual questions is also necessary. For each question, several simple quality measures may be calculated:
(1) The missing data rate can be checked (for partial non-response).
A high proportion of missing data may indicate inappropriate or sensitive questions, for example those which will be important to reconsider for the question design of future surveys. Missing data may be left as missing for the purposes of the analysis, or a data imputation method may be used to replace the missing data with a plausible value (De Leeuw et al. (2008), chapter 19).
(2) The proportion of invalid values can be checked. The size of any deviations from what is a valid response is also of interest.
The response may be a value outside the valid range of responses for that variable, such as a percentage above 100 or a negative number of colonies lost, or it may be a suspiciously extreme value. This problem occurs when the question was not correctly answered by the respondent, or when the data was not correctly captured at the point of data collection or data entry. A question with many invalid answers should probably be reformulated. If there is no way of checking what is the correct answer, the response should be considered as missing data and should be omitted from the analysis.
(3) The proportion of inconsistent values must be checked.
It may be clear from examination of the data that the responses to some questions are inconsistent with the responses to some other questions. For example, the calculation of the number of colonies lost in periods when bee management is practised may give a different answer from the number of colonies stated as having been lost. A variable recording the difference in these two quantities may be used as a filter to remove cases with inconsistent data from analysis.
These data quality descriptors can be obtained from descriptive
analysis, for example using summary statistics including the range of a
variable, tabulations, cross-tabulations and histograms.