In recent times metagenome studies seldom stop with the question of who is present in the environment. They venture out to the next big question of what do the present microbes do and how do they do it. For this merely sequencing marker genes, like 16S/18S/ITS, is not going to be sufficient. That is where the concept of Whole Metagenome Shotgun sequencing comes in, where instead of amplifying marker genes the entire DNA content is sequenced.
This method usually leads to accumulation of much more data then amplicon based sequencing, but the initial processing of such data becomes a bit easier. That does not imply it will be faster than amplicon based technique, just that it has less steps, because for such data brute force searching is the best method. What this means is that after initial QC one would directly try matching each sequence with universal databases to identify closest matches.
QC of raw data/merging of paired end reads:
This is the first and the most basic step for any NGS data, here we refine our raw output to keep only the highest quality reads and is basically same as the one in Amplicon sequencing technique.
When looking at reads sequenced using paired end(PE) chemistry, it is essential to know that merging the paired ends is the first step. Now if your question is why, then the answer is not quite simple. While it is true that in case of high throughput sequencing the bases in middle of the read are of higher quality then the ends, but in case of PE reads trimming the low quality ends of individual reads would compromise the chances of overlapping and merging of trimmed reads, thus leading to drastically low read count with acceptable lengths. Now there are multiple tools that can do this merging, but I like to use PANDAseq, because it uses multiple methods to determine optimal overlaps and gives very good quality output.
After merging of PEs, or if PE was not used is to trim the reads and remove the low quality ends of the reads. This trimming is done on the basis of something called as PHRED scores, this score signifies the probability of the base call to be incorrect, i.e. score of 10 means that there is 1 in 10 chance that the base call was incorrect, while PHRED score of 60 means that there is a 1 in 106 chance that the base call is incorrect. So higher the score better is the probability of the base being correct. These scores are listed in the quality line of the fastq file in ASCII format, the currently used standard in quality scoring is called PHRED33, meaning the lowest PHRED score (0) is assigned the ASCII symbol equivalent to number 33. The chart and explanation here (https://www.drive5.com/usearch/manual/quality_score.html) will give you a better understanding of this system. There are tools like FASTQC which help you visualize the quality of reads in your fastq files, and tools like trimgalore which can trim your reads to only high quality portions and also take care of primers and adapters.
Note 1: De-multiplexing of the reads need to be done prior to this step as per platform protocols.
Note 2: for short reads like those from Illumina or 454 platforms quality score of 28 and above is good, while score of 24-28 is still acceptable. Calculation of this score also depends on the length of the read, thus for technologies that produce long reads, like PacBio and Nanopore, the bases will definitely have a much lower score. Thus for these long reads quality assessment changes, and we usually rely on the platform specifications for more clarity.
Read Assignment/Identification:
Now unlike Amplicon based technique we do not need to cluster the reads based on similarity, or rather we cannot because unlike there the reads will not be similar to each other, rather they may be as unrelated from each other as possible. So we proceed directly to assigning or annotating these reads.
Usually the database that is the most preferred and used is the nr database from any INSDC member. This database is nothing but a non redundant set of all the protein sequences present in the protein database of any INSDC member (NCBI/EMBL/DDBJ). But this presents a problem of its own, that is the size, this nr database can go anywhere between over 50 gb to download and much more to store after decompressing. And to use this file for alignment will take very pricey infrastructure, so then what do we do. Fortunately there are some tools specialized for this very purpose, lets list a few of them out:
- Blast
- Kraken
- Diamond
- Kaiju
- WIMP (Nanopore)
While the above tools technically match the reads to a version of nr database, WIMP, kraken and kaiju focus on assigning taxonomy only, and blast and diamond can potentially provide both taxonomy and function. BLAST and Diamond are basically our traditional alignment solutions that can align reads to database sequence and create output files that we see on blast web-server, Diamond is just more faster and can process very large datasets more efficiently. Kraken and kaiju have created a version of the nr database that their tools can use, and are not understandable by us humans, and their output is only taxonomic assignments for each read according to NCBI taxonomy IDs. The usage of all 4 tools listed above is quite simple and listed very nicely in their tutorials.
After running those tools up there you will need to visualize that data, and that needs further processing. Taxonomy assigners give their out put as just the read ID followed by taxonomy assigned for that read, and I have a script you can use (Script to create BIOM formatted files using almost any taxonomy assigner.) to create taxonomy profiles from these results. If you use blast or diamond you can use a visualization suite like MEGAN to visualize the taxonomy and functions of reads in your data sets. It is also a good idea to run alignment of your data against KEGG also using blast or diamond for more accurate functional studies.
Then there are also servers like MG-RAST that can do an end to end annotation of reads starting from raw reads to complete visualization solutions. It is a great server for carrying out analysis but the thing is the pipeline is fixed and there is very limited customization available, so it is very much possible that an atypical data set may be poorly annotated by the pipeline, though they have a very active support and maintenance team that can help you address your issues quickly.
Now what is important to know is that there is no one correct way to carry out analysis of such diverse data, every method is correct as long as you know what your research objective is. As an example some studies where read length are very short carry out assembly of the reads to contigs before identifying the reads, it makes sense for short reads as such short reads may otherwise show inconclusive assignments, but assembly also increases the chances of creating chimeras, which lead to bad results. Then there are some studies where the infrastructure is limited will carry out gene calling and then use the called genes for further studies, but they may lead to missing out on atypical genes or non translated segments can go un-annotated. So it is important to weigh your options with available infrastructure and required research goals to modify, add, or remove steps in any published methodology.
Binning:
While as mentioned above we do not cluster the reads based on similarity for whole metagenome shotgun reads, there is a possibility of grouping the reads based on some inherent signatures like k-mer profiles or HMM profiles. This grouping using such signature patterns can potentially segregate the reads into taxonomic buckets (called bins). So reads in a single bin can be from a single organism, and can also be used to recreate entire genomes from the reads. And there are studies that have successfully recreated entire genomes of novel microbes which are abundant in a biome. Some algorithms that are commonly used for binning can be found on the wikipedia page: Binning_(metagenomics)