Bioinfomatics workflow management tool

Do you use a pipeline for biostatistical analysis? Recently, I used one, and although there was a learning curve, it was quite rewarding. I would like to share my experience.

My aim was to set up a pipeline that takes genetic data (either VCF or PLINK files) and clinical data, and then performs longitudinal GWAS with all the necessary Quality Control (QC) steps included. It should perform the following tasks:

  1. Read genetic data.
  2. Standardize the data. (Split multiallelic sites, perform Liftover to Hg38)
  3. Perform basic GWAS QC: sample call rate, genotyping call rate, relatedness check, heterozygosity check, and then population separation.
  4. Finally, conduct a longitudinal analysis (using either a linear mixed-effects model or survival analysis) for each population.

These are quite stereotypical analyses, and I tried Nextflow to automate this after reading this article (https://www.biorxiv.org/content/10.1101/2020.08.04.236208v1.full), It worked quite well, and now my colleague has taken it over and Dockerized it so that it works on various platforms (GitHub - michael-ta/longitudinal-GWAS-pipeline: Repository for Nextflow pipeline to perform GWAS with longitudinal capabilities). We are testing this on the AMP-PD platform, so please stay tuned if you are interested. However, my point isn’t to advertise our tool, but to introduce Nextflow for those who do routine bioinformatics analysis. It is relatively easy to learn and makes your analyses standardized and reproducible.

Do you find yourself repeating the same tasks? If you use workflow management software like Nextflow, what would you recommend and what challenges have you encountered?

3 Likes

Great topic! I use a workflow management tool called jetstream that someone at TGen developed for my bulkRNA workflow and my Oxford Nanopore long read workflow. I chose jetstream at the time because the developer was around the corner at the office from me and I could easily ask him questions. :slight_smile: I’ve recently switched to singularity containers, which has made the workflow more portable, and have ported my workflow to different SLURM based HPC clusters, but not between SLURM and the cloud.

I do want to do more in nextflow in the future instead of jetstream as nextflow seems to be more of a community standard - I find nextflow seems to be outpacing snakemake and common workflow language (CWL) these days, at least outside of WDL/Cromwell usage at the Broad. I appreciate the linked article @hirotaka - it is neat to have something empirical in support of nextflow in addition to word of mouth through the informatics community.

Nextflow actually just had their annual hackathon in Barcelona (as well as virtually), and their annual summit is happening now.

They also have a series of publicly available pipelines through nf-core, and there is an associated slack channel for user support and development.

Thank you @ehutchins for your comments and inputs! These are all great points! I haven’t spent much time on browsing nf-core but I should’ve done so to avoid re-inventing a wheel.

One limitation of nextflow is that terra is currently not supporting it so we cannot use it on AMP-PD platform. Workaround is to just use google cloud and access AMP-PD data in the google bucket.