How to Prioritize Targets in the Age of Omics?

zkurlawala · April 16, 2025, 1:00pm

As researchers, we’re grappling with the growing influx of omics data and the need for effective target prioritization, as our current Excel trackers fall short. I’m curious how researchers and colleagues approach this challenge. Are there established frameworks or initial screening mechanisms for organizing and filtering omics data? I’d greatly appreciate any insights or connections to individuals with experience in this area, who could share their expertise. @laura.winchester @LauraIbanez @NHatcher114 @vdardov @nameyeh @cruchagac @fbbriggs @kraty @jaeyoon.chung @gdp22 @rochet071369 @SLeslie @ehutchins @amclean @zarshad @omar.mabrouk @marekpiatek @tnanasi
@blehallier @mquinton @KHapponen

gginnan · April 16, 2025, 6:54pm

Hi @zkurlawala , great question. The scale and complexity of omics data can make it hard to wrangle, especially across multiple conditions or timepoints.

I’m aware of a couple tools like Galaxy (open-source, GUI-based, no coding needed) for preprocessing, or Bioconductor (open-source, R-based, for biological (especially omics) data), which are used quite widely and may alleviate Excel-induced pressures.

Wondering if any of our other community members with interest/expertise in genetic data could weigh in on how they approach this problem? (@waldoe, @mariariverapaz, @peixott, @hamptonl, @mejoh, @jbmchls)

peixott · April 16, 2025, 6:54pm

Hello,

Big data is always a complex problem. The possibility of having data at a population level has changed the field, but it is really hard to work with since the bottleneck is no longer generating the data, but analyzing it.

Actually, I am struggling with this problem while working on the Veterans Affairs dataset, a dataset with over 600,000 samples. Our approach is “divide and conquer”: we infer local ancestry in ten different sets, and we are running the regressions in sets of 1,000 variants.

Now about tools, I recommend try to use data science tools that was developed to run in big datasets, which I can mention Pandas. Unfortunately I am just a regular Bioinformatician that works with population genetics and genetic epidemiology, so my knowledge is really restrict.

About Excel: Avoid. We have some papers (Gene name errors are widespread in the scientific literature - PMC, Gene name errors: Lessons not learned - PMC, Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics - PubMed) showing that Excel is a huge problem to the scientific literature. I love the example of the gene September 7, oh sorry, SEPT7 that excel convert to date.

gginnan · April 16, 2025, 6:54pm

Thanks for the insights, @peixott , it’s great to hear how you and your colleagues are approaching the analysis pipelines for these big datasets. I used Pandas in my PhD for analyzing large-ish MEG datasets, but still those were much smaller than most of these genetic datasets. You find Pandas effective at scale?

peixott · April 16, 2025, 6:54pm

Hello,

Pandas is a really useful tool, but it’s not perfect. One of the libraries I use in my work, admix-kit, uses a library based on Dask to handle datasets, and it seems to be a better option for working with big data than Pandas.

The biggest challenge is finding the tool you’re comfortable with. You can have the best tool available, but if you don’t know how to use it or don’t like using it is useless. You can tighten a screw with a knife, even if it’s not the ideal tool. At the end of the day, what matters is whether the result was delivered (and delivered correctly).

fbbriggs · May 16, 2025, 12:13pm

I am fairly naive - but this is a task that is looming in my near future. I have bookmarked this tool to explore: https://mixomics.org/

ehutchins · June 3, 2025, 4:03pm

I was going to add mixOmics - thanks Farren! I made a post about it in the past that’s related to this discussion. There are a lot of online resources to help you get started.

With mixOmics, you can start with a list of variables of interest, or as @mdeleeuw pointed out, you can use the sparse Partial Least Squares Discriminant Analysis (sPLS-DA) tool in each *-omics domain to identify features in each *-omics set before putting them all together.

One thing to keep in mind is that data needs to be normalized for covariates within each *-omics type before analyzing across *-omics.

Also thank you @peixott for bringing up the pitfalls of excel with receipts, so to speak (meaning literature articles). This continues to be an important issue for researchers to keep in mind.

Does that help answer your question, @zkurlawala ? Thank you for bringing up this important topic.

zkurlawala · September 23, 2025, 5:24pm

Thank you all for your responses. @ehutchins @fbbriggs @peixott @gginnan @Bradford @bmarebwa I haven’t used MixOmics or other tools before and don’t fully understand its capabilities. Thanks for sharing. I plan to discuss with our data science advisors to better evaluate its advantages and limitations.

I have 2 additional questions on this topic. I realize the answers may be complex, but I want to raise them given the importance of considering the eventual translational value of omics data:

How can we identify promising targets within omics datasets that are suitable for biomarker discovery—particularly when two proteomics datasets on the same samples can yield different results? How do we build confidence in a target that looks robust in one dataset but is not significant in another?
How can we develop a framework of criteria—beyond p-value, FDR, and effect size—that also incorporates biological relevance?

fbbriggs · September 23, 2025, 6:36pm

Hey @zkurlawala , to Q1: it raises several study design questions: i) were confounders appropriately considered, ii) are the study population different in other ways (i.e., geography, inclusion/exclusion criteria, treatment histories), and iii) were samples processed differently (i.e., different tubes, reagents, etc). Also if doing a discovery-replication framework, a 1-sided p-value is appropriate for the replication analyses. If direction of effects are in opposite directions, then likely a spurious association or a biomarker with promising for specific subgroups. For Q2: fascinating suggestion, but the challenge is how to define biological relevance and accepting that current knowledge/databases are incomplete with a preponderance of information for specific relationships/tissues/processes. In the general causal criteria framework ( Bradford Hill criteria - Wikipedia ), biological plausability is useful to guide interpretations, I think it would be difficult to consider it within some Bayesian framework…

Topic		Replies	Views
MultiOmics data Analyzing and Reusing Data genetic-data , data-format , meta , amp-pd	4	110	August 19, 2024
What tags would be helpful to find and organize conversations? Ideas and Inspiration meta , how-to , community-guidelines	16	138	April 1, 2024
Proteomics data : one general term with many meanings Accessing and Understanding Data genetic-data , proteomics , amp-pd	0	53	December 21, 2023
Prior experience with PDBP biological and omics data Accessing and Understanding Data how-to , data-interpretation , pdbp , omics-data	6	87	August 18, 2023
Navigating the PD data landscape: online browsing tools Accessing and Understanding Data data-access , repository , tools	1	64	October 25, 2023

How to Prioritize Targets in the Age of Omics?

Related topics