8.8.4. Assembly of shotgun sequences vs. read mapping
For sequences generated by shotgun sequencing, it is generally desirable to assemble all sequencing reads into contigs (aggregates of nearly identical sequences from the same region and species) prior to statistical analysis, since this can reduce computational needs greatly while retaining vital statistics including the number of reads per contig. Once the computationally intensive assembly of contigs has taken place (using for example the Metavelvet routine, http://metavelvet.dna.bio.keio.ac.jp/) datasets can be reduced by many orders of magnitude. This is critical if online or ‘cloud’ databases are searched for microbial matches since the data transfer speeds alone for such searches can be measured in days when using raw sequence reads. In addition, contigs are by definition longer than any individual read and therefore also can provide a more secure match to distant taxa. The count data for sequenced reads per contig provides the measure of depth that, once scaled to contig length, allows estimates of microbial frequency. Once metagenomic sequences have been assembled, moderate experiment can often be enacted without cost to the user at public resources such as GALAXY (https://main.g2.bx.psu.edu/). As with any complicated statistical procedure it is highly possible to get erroneous matches and statistical results, and researchers are advised to enlist the help of colleagues with current
In practice, metagenomic analyses are also carried out by mapping (aligning with high probability) individual sequence reads to members of a reference database, and algorithms (including Tophat, http://genomics.jhu.edu/software.html) have been developed that are extremely efficient at doing so. For diagnostic regions with highly conserved sequences (e.g., parts of the rRNA operons) both assembling and mapping are problematic and query sequences often cannot be placed securely to even family-level matches. In this case, it is best to bin sequences at a higher taxonomic level (even Order) rather than force matches into a possibly erroneous taxon. Nevertheless, as genome sequencing of microbial species is increasing exponentially, even rare and distant taxa tend to have a fully sequenced family member in the public databases, as described below in section 8.8.5.