Let us look at a few scripts I have made and how they function. (the links take you directly to github)
- Script to create BIOM formatted files using almost any taxonomy assigner.
- The script was designed to actually convert WIMP output provided by Oxford Nanopore’s epi2me tool.
- Butwith minor modifications it can use output of any taxonomy assigner to create biom files.
- These biom files can be converted to HDF5 format and used in any tool that uses this format (e.g. QIIME, LEfSe, etc.)
- It creates BIOM files for each taxonomical hierarchy, for each domain individually as well as combined.
- Individual files can be used to asses diversity in individual domains:
- Bacteria
- Archea
- Viruses
- Eukaryota (Plants, Animals, Fungi, etc.)
- One can also create a list of keywords which will be used to extract reads relating to that keywords as fasta files. e.g. if I want to extract all reads assigned to psuedomonas in a file to analyze them further, it can be done by adding keywords.
- Script : https://github.com/mbshah/ncim-bioinfo/blob/master/virCodes/nanopore_epi2me_csv_parsev3.pl
- QIIME wrapper scripts
- Scripts that can be used to carryout basic QIIME analysis upto alpha and beta diversities
- all step output and inputs are managed
- takes paired end fastq files directly.
- each step can be modified to your liking by just opening the file in notepad or and text editor
- end_to_end_qiime.pl: https://github.com/mbshah/ncim-bioinfo/blob/master/end-to-end-qiime.pl
- end_to_end_qiime_swarm.pl uses swarm otu algorithm, we found it to be better: https://github.com/mbshah/ncim-bioinfo/blob/master/end-to-end-qiime_swarm.pl
- requires merge.pl from above.
- Script to simplify download of sequences from NCBI:
- The api to download sequences from NCBI can at times be unreliable and can end unexpectedly
- So I wrote this script to make the download easier
- It takes into account:
- unstable internet connection: it will retry downloading the sequences until it can do so successfully in a batch of 1000
- downloading from behind proxy: make appropriate changes to line 7 in the script
- continuing from a failed download: if script ends due to power failure or any other reason, it can continue where it last left off
- skipping results that are already present in the folder: the above component is also used in case you would like to update the results after particular times to include newer annotations.
- Script was orignally designed to download only single whole genome sequence per taxonomy ID to create only viral whole genome blast DBS, this filtering can be turned off by changing line 13.
- Script: https://github.com/mbshah/ncim-bioinfo/blob/master/virCodes/retrieve2.pl
- Additionally also requires:
- gb_utils_xml.pl to maipulate xml files: https://github.com/mbshah/ncim-bioinfo/blob/master/virCodes/gb_utils_xml.pl
- dbmaker.pl to create blast db from the fasta files commenting out line number 160 will disable this blast maker step. https://github.com/mbshah/ncim-bioinfo/blob/master/virCodes/dbmaker.pl
- KO2Path and Path2class
- script to understand the output of tax4fun better.
- more details can be found here: https://github.com/mbshah/metgenomics
- also creates profiles.
- Initial QC of raw reads
- merges paired end reads
- trims using trim_galore
- requires pre-installation of trim galore.
- merge.pl: https://github.com/mbshah/ncim-bioinfo/blob/master/merge.pl
- Some more scripts including these can be found here: https://github.com/mbshah/ncim-bioinfo