When we look at a environment with microbiological perspective, the first question that comes to our mind is which microbes are present there, and the most fastest and most comprehensive way to answer this questions by using amplification based approach to carry out metagenomics. It is a fairly known fact that there are certain genes, like ribosomal genes that will be present in every bacteria and archea, are conserved enough to be amplified by a single primer but diverse enough to indicate different species. Genes like 16s rRNA genes are often called as microbial molecular chronometer, meaning the amount of divergence among these genes is good enough to study phylogenetic relatedness of the species. The 16S rRna gene has 9 hyper-variable regions, and the sequences outside of these regions are relatively more conserved, it is these 9 regions that act as the molecular chronometers. It is usually a good practice to sequence the whole 16S gene, but it is found that sequencing only V3-V4 region can also give pretty accurate taxonomic composition in-case of budget constraints.
In simple words sequences like 16S can be used to estimate taxonomic diversity in a given sample. There are multiple databases in public domain that have been compiled to identify which 16S sequence belongs to which organism. Generally it is believed that these genes give a good resolution of OTUs (Operational Taxonomic Units) till the family level, though species level identification is possible, it is very unreliable. The most popular databases for the 16S genes are:
Database | Link | Remarks |
Greengenes | https://greengenes.secondgenome.com/ | Last update: 2013 |
SILVA | https://www.arb-silva.de/ | Still Continously updated |
NCBI | ftp://ftp.ncbi.nlm.nih.gov/blast/db/16SMicrobial.tar.gz | Updated daily with sequences in NCBI, may be redundant. |
I usually prefer using SILVA as they are specifically curated for this task, and have teams dedicated for the purpose of updating and refining the sequences.
So the basic outline of methodology for this type of study is:
Step 1: DNA extraction
Step 2: DNA Amplification
Step 3: Library Preparation for sequencing
Step 4: Sequencing
Step 5: Data analysis
The first 4 steps will vary depending on multiple factors like, sequencing platform, region of the gene to be amplified, instruments available, etc. I will do my best to find some guidelines on these steps for my readers and add it to the bottom of this post, but for now let us look at the various steps involved in Data analysis
Now before you start with processing the data, main things you should know about your data are:
Which Gene? (16S/18S/ITS)
Which Region of the Gene? (complete/V3-V4/V1-V3)
Platform and Chemistry Used? (Illumina/454/Nanopore, paired-end/mate-pair/none)
Which Database you want to use? (Silva/greengenes/NCBI)
After you gather this information next step is to decide whether you would like to use BLAST+Megan or QIIME or QIIME2. BLAST is a widely known and comfortable method for aligning the reads to database sequences, and Megan is able to take care of the remainder of the process, so this method is fairly easy and straight forward, but there is less flexibility in controlling the intermediate steps. On the other hand both QIIME and QIIME2 give much more control over every step of the processing and starting with QIIME2 there is also a GUI available to make it more user friendly. I have personally found use of QIIME2 better, but I have extensive experience of using QIIME1, and a script of all the steps in QIIME1 processing can be found at (https://www.bioinformatics-india.dev/scripts-and-codes/) .
Let us look at what are the basic steps in Data Analysis using QIIME
QC of raw data/merging of paired end reads:
This is the first and the most basic step for any NGS data, here we refine our raw output to keep only the highest quality reads.
When looking at reads sequenced using paired end(PE) chemistry, it is essential to know that merging the paired ends is the first step. Now if your question is why, then the answer is not quite simple. While it is true that in case of high throughput sequencing the bases in middle of the read are of higher quality then the ends, but in case of PE reads trimming the low quality ends of individual reads would compromise the chances of overlapping and merging of trimmed reads, thus leading to drastically low read count with acceptable lengths. Now there are multiple tools that can do this merging, but I like to use PANDAseq, because it uses multiple methods to determine optimal overlaps and gives very good quality output.
Next step after merging of PEs( or first step if PE was not used) is to trim the reads and remove the low quality ends of the reads. This trimming is done on the basis of something called as PHRED scores, this score signifies the probability of the base call to be incorrect, i.e. score of 10 means that there is 1 in 10 chance that the base call was incorrect, while PHRED score of 60 means that there is a 1 in 106 chance that the base call is incorrect. So higher the score better is the probability of the base being correct. These scores are listed in the quality line of the fastq file in ASCII format, the currently used standard in quality scoring is called PHRED33, meaning the lowest PHRED score (0) is assigned the ASCII symbol equivalent to number 33. The chart and explanation here (https://www.drive5.com/usearch/manual/quality_score.html) will give you a better understanding of this system. There are tools like FASTQC which help you visualize the quality of reads in your fastq files, and tools like trimgalore which can trim your reads to only high quality portions and also take care of primers and adapters.
Note 1: De-multiplexing of the reads need to be done prior to this step as per platform protocols.
Note 2: for short reads like those from Illumina or 454 platforms quality score of 28 and above is good, while score of 24-28 is still acceptable. Calculation of this score also depends on the length of the read, thus for technologies that produce long reads, like PacBio and Nanopore, the bases will definitely have a much lower score. Thus for these long reads quality assessment changes, and we usually rely on the platform specifications for more clarity.
Clustering of reads
Ones you have a fasta file with tons of good quality reads, the next step is to cluster them based on similarity. This is done because usually reads with 97% similarity are said to be from same species, and there maybe multiple reads from a single species and using computational resources to identify each of them individually is not a good idea as it will not only take more time but also more power. Besides identifying taxonomy for 1 representative from the cluster is enough to label all members of a cluster. QIIME simplifies this by providing modules for carrying out this step using multiple algorithms. QIIME calls this step as OTU picking. These algorithms can be specifically be divided into 3 classes. First is reference based clustering, where the clusters are formed based on similarity with sequences in reference DBs like SILVA; Second is the de novo clustering using algorithms like UCLUST, VCLUST, swarm, cd-hit, etc.; and Third method uses hybrid of de novo and reference based called open-reference based clustering (more info: http://qiime.org/tutorials/otu_picking.html). This is additionally followed by picking a representative from each cluster that will be used in the next step.
Note 1: Optionally chimera checking can be done after this step, to remove possible chimeric sequences (reads made with fusion of two unrelated reads/strands of DNA)
Assigning Taxonomy to the reads
Now that we have a very defined set of non redundant reads, we can use them to create a taxonomic profile of the sample. This step is basically just alignment of the representative reads to the database sequences and labeling the reads with taxonomy. This is where the above mentioned databases (greengenes/SILVA/NCBI) come into play. Once assigned, these assignments are formatted into a readable table showing the accurate taxonomic profile of the sample. This formatting takes care of the representatives and members of the clusters to give a comprehensive profile. Usually this profile is stored in a JSON/HDF5 format called BIOM format. This BIOM format is accepted by multitude of tools thus making downstream analysis easier.
Further Analysis
With this, the primary analysis of the sample ends, it is typically followed by further analysis steps like beta diversity(diversity between samples), alpha diversity(diversity within a sample), co-relation analysis (using PAST), Functional Prediction (Tax4Fun+KO2Path), etc.
Each of these analysis requires further processing steps of the taxonomic profile, let us look at these few possibilities in our next post. Till then have a good day.