How to Prioritize Targets in the Age of Omics?

As researchers, we’re grappling with the growing influx of omics data and the need for effective target prioritization, as our current Excel trackers fall short. I’m curious how researchers and colleagues approach this challenge. Are there established frameworks or initial screening mechanisms for organizing and filtering omics data? I’d greatly appreciate any insights or connections to individuals with experience in this area, who could share their expertise. @laura.winchester @LauraIbanez @NHatcher114 @vdardov @nameyeh @cruchagac @fbbriggs @kraty @jaeyoon.chung @gdp22 @rochet071369 @SLeslie @ehutchins @amclean @zarshad @omar.mabrouk @marekpiatek @tnanasi
@blehallier @mquinton @KHapponen

3 Likes

Hi @zkurlawala , great question. The scale and complexity of omics data can make it hard to wrangle, especially across multiple conditions or timepoints.

I’m aware of a couple tools like Galaxy (open-source, GUI-based, no coding needed) for preprocessing, or Bioconductor (open-source, R-based, for biological (especially omics) data), which are used quite widely and may alleviate Excel-induced pressures.

Wondering if any of our other community members with interest/expertise in genetic data could weigh in on how they approach this problem? (@waldoe, @mariariverapaz, @peixott, @hamptonl, @mejoh, @jbmchls)

Hello,

Big data is always a complex problem. The possibility of having data at a population level has changed the field, but it is really hard to work with since the bottleneck is no longer generating the data, but analyzing it.

Actually, I am struggling with this problem while working on the Veterans Affairs dataset, a dataset with over 600,000 samples. Our approach is “divide and conquer”: we infer local ancestry in ten different sets, and we are running the regressions in sets of 1,000 variants.

Now about tools, I recommend try to use data science tools that was developed to run in big datasets, which I can mention Pandas. Unfortunately I am just a regular Bioinformatician that works with population genetics and genetic epidemiology, so my knowledge is really restrict.

About Excel: Avoid. We have some papers (Gene name errors are widespread in the scientific literature - PMC, Gene name errors: Lessons not learned - PMC, Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics - PubMed) showing that Excel is a huge problem to the scientific literature. I love the example of the gene September 7, oh sorry, SEPT7 that excel convert to date.

2 Likes

Thanks for the insights, @peixott , it’s great to hear how you and your colleagues are approaching the analysis pipelines for these big datasets. I used Pandas in my PhD for analyzing large-ish MEG datasets, but still those were much smaller than most of these genetic datasets. You find Pandas effective at scale?

Hello,

Pandas is a really useful tool, but it’s not perfect. One of the libraries I use in my work, admix-kit, uses a library based on Dask to handle datasets, and it seems to be a better option for working with big data than Pandas.

The biggest challenge is finding the tool you’re comfortable with. You can have the best tool available, but if you don’t know how to use it or don’t like using it is useless. You can tighten a screw with a knife, even if it’s not the ideal tool. At the end of the day, what matters is whether the result was delivered (and delivered correctly).

2 Likes

I am fairly naive - but this is a task that is looming in my near future. I have bookmarked this tool to explore: https://mixomics.org/

1 Like

I was going to add mixOmics - thanks Farren! I made a post about it in the past that’s related to this discussion. There are a lot of online resources to help you get started.

With mixOmics, you can start with a list of variables of interest, or as @mdeleeuw pointed out, you can use the sparse Partial Least Squares Discriminant Analysis (sPLS-DA) tool in each *-omics domain to identify features in each *-omics set before putting them all together.

One thing to keep in mind is that data needs to be normalized for covariates within each *-omics type before analyzing across *-omics.

Also thank you @peixott for bringing up the pitfalls of excel with receipts, so to speak (meaning literature articles). This continues to be an important issue for researchers to keep in mind.

Does that help answer your question, @zkurlawala ? Thank you for bringing up this important topic.