2.2.2. Identifying operational taxonomic units (OTUs)
The next step is to cluster sequences into OTUs. No single sequence identity threshold exists for matching bacterial species, as defined by phenotype (Schloss and Westcott, 2011). Therefore, the standard practice is to use ≥ 97% sequence identity to cluster sequences into OTUs. Several programs are available to cluster sequences into OTUs including PyroTagger (Kunin and Hugenholtz, 2010), CD-HIT (Huang et al., 2010), UCLUST (Edgar, 2010), and mothur (Schloss and Westcott, 2011).
Undetected chimeras can persist as spurious OTUs, and because the chimera rate and sequencing errors are compounded with sequencing depth in NGS, it is important to randomly subsample each community using a standardized number of sequences (Schloss et al., 2011). We recommend subsampling a minimum of 1000 sequences, or the highest number that will avoid discarding too many samples. Additionally, many studies exclude rare and/or singleton OTUs from further analysis. One approach is to analyse the data twice, once with singleton OTUs included and then again with them excluded, and report the results of both analyses in publication supplemental files (e.g. Martinson et al., 2012).