Deriving comorbidities from the PPMI data for Parkinson's sub-typing

vcatterson · July 18, 2025, 4:44pm

Today I’d like to talk about identifying comorbidities within the PPMI dataset. Comorbidities are any condition which exists alongside the primary disease, and can include those that often occur in concert with PD, such as anxiety, depression, REM sleep behavioural disorder, daytime sleepiness, and bowel and bladder dysfunction. There are other conditions which are less strongly linked with PD, although we could speculate a mechanism by which they share an origin, such as autonomic dysfunction causing hypertension. And there are many diseases for which there appears to be no link or common mechanism, such as cancer.

When sub-typing Parkinson’s, or looking for common patient journeys through the disease, the clusters of comorbidities can be illuminating. For this reason, I started digging into the comorbidity data within PPMI, with the goal of creating a table of data for the PD and Healthy Control patients which could be used as part of a clustering process.

Where are comorbidities in PPMI?

There are three tables in PPMI which touch on comorbidities:

General physical exam: This identifies the body part where a clinician found an “abnormality”. Data is recorded using a controlled ontology with terms such as skin, lungs, psychiatric, or neurological.
Concomitant medications log: This identifies all non-levodopa-equivalent medications prescribed, with a controlled ontology for the indication.
Other clinical features: This table identifies features often associated with PD, using a controlled ontology with terms such as depression, anxiety, cognitive fluctuations, and dyskinesia.

Note, there is another table called the Medical Conditions Log, which contains two fields. The first is a free text field, which is difficult to regularize because it doesn’t use a controlled ontology. The second identifies a hospital department or category, such as “dermatological” or “pulmonary”, but I found the meaning of this table hard to discern from the documentation. As a result, I focused on the three tables listed above.

Correlation between data tables

So the general physical exam, concomitant medications, and other clinical features all give a view into comorbidities, and we might expect overlap or agreement between them. For example, the medications ontology contains “Anxiety”, the clinical features records the presence of anxiety (FEATANXITY), and the physical exam records psychiatric conditions. If someone experiences anxiety, we may expect this to be recorded across all three data columns.

However, this is not always the case! The table below shows the Pearson correlation coefficient calculated for the pairwise comparison of these three columns. We can see that there is relatively low correlation, ranging from 0.13 between the feature and general physical exam, up to 0.32 between medication and general physical exam. We may have a priori expected a value closer to 1, representing higher correlation.

It’s not just psychiatric conditions which display relatively low concordance. The table below shows correlation between medications for constipation, the feature of bowel dysfunction (FEATBWLDYS), and gastrointestinal abnormality on the physical exam. The correlations are even lower here than for anxiety! Of course, there are more types of bowel dysfunction than just constipation, and more types of gastrointestinal abnormality than just bowel dysfunction, but the highest correlation is actually between gastrointestinal abnormality at physical exam and medication for constipation.

This pattern repeats for all other comorbidities which are covered by two or more of these three data tables. My interpretation is that not all conditions are under treatment, not all conditions are medicated, and not all symptoms are present at physical exam. As a result, a given comorbidity may be captured by only one or two of the data tables.

Representing comorbidities

If we want to include comorbidities within analysis or modelling, it may be more useful to consider these three data tables as giving hints towards a “tendency to comorbidity” rather than a binary presence/absence. Since there is no perfect overlap of ontology terms between the three sources (ie bowel dysfunction is more than just constipation), we can’t simply combine categories with a voting scheme. An alternative approach is to perform Principal Component Analysis (PCA) on all three data tables, to uncover the principal components of comorbidity within the data.

When I did this, I discovered that 30 components explained 90% of the variance within the data (see below). Including 30 features within a machine learning model is better than including all 76 of the raw data columns, as the information content within a given component is higher, and the ML model is more likely to correctly determine the relationship between the input and the outcome being predicted.

As a final step, we can examine the loadings on each input feature that comprises a given component. In this way we can uncover some meaning behind the component. For example, the plot below shows that component 2 is driven primarily by a neurological abnormality at the physical exam (PE: Neurological), and medications for pain.

Summary

Comorbidities may be a crucial element of understanding sub-types of Parkinson’s, but there are multiple data tables in PPMI which give a view on whether or not a given comorbidity is present. Instead of focusing only on medications, physical exam, or other clinical features, we can combine all three together using PCA, and select only the components which explain the most variance within the data. In this way, we can capture more nuanced information about comorbidities, and hopefully unlock deeper understanding of Parkinson’s.

What do others think? Have you worked with comorbidities in PPMI? How did your thinking and workflow differ? Please let me know below!

gginnan · July 22, 2025, 2:55pm

Thanks for this, @vcatterson! It can seem like a daunting task to tease apart the various subtypes and comorbidities present in the wide spectrum of PD patients’ physiologies, especially given the lack of developed ontologies in that space. Thanks for sharing how you’re approaching this problem!

Curious if others who work with PPMI data have developed different methods or approaches to analyzing comorbidity data? @pbchan @stnava @nameyeh @alberto.imarisio @ZhengjieYang @maylabel @KHapponen @Jonason @GaboPG97

danieltds · July 30, 2025, 7:50pm

Hi Victoria,

Thanks for the detailed explanation on how you handled comorbidities in the PPMI dataset. I’ve also played around with this data before, but definitely not at the level you did.

In my Useful PPMI Clinical Codes repo, there’s a script that pulls some comorbidities and meds from free text. It’s not as thorough as your approach, but it did the job for a few basic things I wanted to look into. For example, when GLP-1 agonists started getting attention as possibly protective for PD, I used this method to extract that info, and that worked quite fine. It does take some manual checking and isn’t perfect, which is why I liked your method, as you showed how combining different ways of getting at the data can really make the most of it.

Good job and thanks for bringing the topic!

vcatterson · August 5, 2025, 4:46pm

Thanks for highlighting your repo, Daniel, and demonstrating how you did something similar! It’s great to share perspectives on how to clean and use this data, as we all need to do slightly different things with it.

The GLP-1 agonist work is also very interesting! Thanks for sharing!

Topic		Replies	Views
An Introduction to the PPMI Dataset Accessing and Understanding Data ppmi , data-access , featured-content	4	270	August 17, 2023
Which are the correct identifiers for patients in the PPMI cohort? Analyzing and Reusing Data meta , ppmi , data-interpretation , documentation	6	139	August 8, 2023
Prior experience with PDBP biological and omics data Accessing and Understanding Data how-to , data-interpretation , pdbp , omics-data	6	63	August 18, 2023
Useful PPMI Clinical Codes - Code Available Analyzing and Reusing Data ppmi , clinical-data , code	3	77	March 1, 2025
Replacing Clinico-Pathologic Convergence With Multimodal Data Divergence Ideas and Inspiration subtyping , data-interpretation , multimodal	5	50	January 16, 2024

Deriving comorbidities from the PPMI data for Parkinson's sub-typing

Where are comorbidities in PPMI?

Correlation between data tables

Representing comorbidities

Summary

Related topics