Do you use a pipeline for biostatistical analysis? Recently, I used one, and although there was a learning curve, it was quite rewarding. I would like to share my experience.
My aim was to set up a pipeline that takes genetic data (either VCF or PLINK files) and clinical data, and then performs longitudinal GWAS with all the necessary Quality Control (QC) steps included. It should perform the following tasks:
- Read genetic data.
- Standardize the data. (Split multiallelic sites, perform Liftover to Hg38)
- Perform basic GWAS QC: sample call rate, genotyping call rate, relatedness check, heterozygosity check, and then population separation.
- Finally, conduct a longitudinal analysis (using either a linear mixed-effects model or survival analysis) for each population.
These are quite stereotypical analyses, and I tried Nextflow to automate this after reading this article (https://www.biorxiv.org/content/10.1101/2020.08.04.236208v1.full), It worked quite well, and now my colleague has taken it over and Dockerized it so that it works on various platforms (GitHub - michael-ta/longitudinal-GWAS-pipeline: Repository for Nextflow pipeline to perform GWAS with longitudinal capabilities). We are testing this on the AMP-PD platform, so please stay tuned if you are interested. However, my point isn’t to advertise our tool, but to introduce Nextflow for those who do routine bioinformatics analysis. It is relatively easy to learn and makes your analyses standardized and reproducible.
Do you find yourself repeating the same tasks? If you use workflow management software like Nextflow, what would you recommend and what challenges have you encountered?