2.2.1. Quality filtering
Quality control of NGS data will coevolve with sequencing technology. Currently, the following steps are recommended:
- Check for sequence errors.
Quality filtering steps are conducted in any of the above pipelines and begin
with the analysis of raw sequence to remove sequences with:
a. low quality scores (sequences with average quality scores less than 25),
b. short length (sequences less than 50% of the expected sequence length, e.g. for 454 titanium data we recommend removing reads <200 bases),
c. ambiguous base calls (for 454 data, reads with ambiguous base calls are correlated with other errors and should be discarded (Huse et al., 2007), for other platforms, sequences can be trimmed at the first ambiguous base call, then discarded if they no longer meet the specified length criteria),
d. mismatches to the primer sites or barcodes, i.e. reads where there is a sequencing error in the primer sequence or that do not match any of the barcodes used to label which sample each read came from,
e. for 454 data, sequences that have runs of 6 or more of the same nucleotide (homopolymers), which is a characteristic error for 454 sequencing (Schloss et al., 2011).
- Check alignment. As an additional quality control step and to prepare data for downstream analyses, align the sequences against a 16S rRNA database such as SILVA (Quast et al., 2013). Sequences that fall out of the alignment window are discarded. We recommend NAST based aligners (Schloss, 2009; Caporaso et al., 2010a) as they create high quality alignments when used with the SILVA database.
- Check for chimeric sequences. As long as microbial community analyses continue to rely on PCR methods, chimeric sequences will remain a problem for which to monitor. Chimeras occur when sequences from two separate templates are combined during the reaction, and chimeras thereby artificially inflate diversity estimates. Several chimera checkers have been developed (e.g. Edgar et al., 2011; Haas et al., 2011), and although some chimeras will remain undetectable, the rate can be greatly reduced with these tools. As honey bee gut symbionts are still not well represented in the curated reference databases, we currently recommend using UCHIME in de novo mode, which does not require a reference database (Edgar et al., 2011).