Explanations
Purpose of each nextflow script:
start.nf
This script is designed to initialize the environment needed for proper execution of the
main.nf
script. This start.nf
script only need to be run once, as it is primarily tasked
with donwloading test data, databases, and pre-processing said databases.
main.nf
This script is designed to be the primary analysis tool. main.nf
can be repeated as
many times as desired, with various processing options, and can be parallelized for large
scale analysis. main.nf
takes in shotgun metagenomic data and quantifies the absolute
abundance of Antimicrobial Resistant Genes and 16S rRNA Genes for the input
data/samples. Input data can be provided as local fastq files or fetched from the NCBI
Sequence Read Archive when given an SRSA ID.
modules.nf
This script is designed to house all the processes (functions) needed for the start.nf
and main.nf
workflows. These processes are the functional code and commands
needed to drive the analysis. Individual processes in modules.nf
can be called
independtly of the workflow if users wish to carry out merely a fraction of the entire
bioinformaic pipeline, or desire a non-automated step-by-step analysis.
When to use each nextflow script:
start.nf
This script should be the first script executed upon downloading the GitHub Repository.
If users already have all their data and processed databases stored locally, this script is
irrelevant.
main.nf
This script is run when the user wishes to know the absolute ARG and 16S rRNA
abundance found in paired-end shotgun metagenomically sequenced data of animal
agricutural origin. The pipeline is run for every set of paired-end reads presented to the
workflow as an input.
modules.nf
This script is only run when the user wishes to execute a single process individually and
not carry out the entire workflow. Otherwise, it serves as a formatted group of
functions that are imported by the actual workflows: start.nf
and main.nf
.
Potential issues with each nextflow script:
start.nf
-Needs a significant amount of memory and time, lack of either could induce errors.
-Database downloads other than the defaults prove tricky at times.
main.nf
-Currently relies on access to my labs software directory for
BBDuk and CD-HHIT-EST. Future docker container development will resolve this issue.
modules.nf
-Some processes are hard coded for slurm resources allocations, which requires
access to an HPC and slurm scheduler.