Hello everyone,
I am analysing the different subgroups available in PPMI’s curated data, and there are 6 people in the healthy cohort in which 2 have PRKN subgroup, 3 have GBA subgroup, and 1 has PINK1 subgroup. All of them have a primary diagnosis (PRIMDIAG) of “No PD nor other neurological disorder” (17) at baseline.
Looking at the description of the Prodromal cohort, it seems these genetic risk factors would put these people in the Prodromal cohort rather than the healthy cohort. I understand there was some change in terms of PPMI inclusion criteria in the past, but I thought after those changes everything got harmonised, so I don’t understand why these people got assigned to the Healthy cohort.
However, I’m actually wondering whether there has been some error in the generation of the curated dataset, because when I look for these people in the “Participant_Status” CSV, the corresponding `ENRL…` variables are all either zero or empty, which would mean no mutation was recorded and thus the subgroup definition in the curated dataset is wrong. Also in the “Data Dictionary” tab of the curated dataset, the original variables used for the subgroup are indicated to include `CON…` which don’t seem to be available anywhere in LONI, so once again makes me believe this came from some old generation that is not correct anymore and not consistent with the available CSVs in PPMI (so I’m either missing something or this needs to be corrected).
Would anyone be able to help me understand what might have happened here? Thanks!
3 Likes
This is a great question! I don’t have an answer, unfortunately, but will be tracking this post as I’m really interested in the outcome.
2 Likes
Hi, just as a quick update on this topic. I was looking to the new curated dataset from a few weeks ago (2025-11-12). I can see there was an update for the “subgroup”, and now the “Original Variables” do not include the `CON…` anymore, it was added that the original dataset used now includes the “iu_genetic_consensus”, and in the Derivation Notes we now have a clarification saying “Genetic consensus variant data from the “iu_genetic_consensus” file are used when available. If genetic consensus data is missing, use enrollment (ENRL) indicator.”
After checking the “iu_genetic_consensus” CSV now I can see how the genetic subgroups for these healthy people came from, so at least this part is clearer now (I could swear I didn’t see this file in LONI before, but the version date is for around a week before I created this topic, so I guess I just missed it).
However, I believe some things are still not clear: how is it possible that this “genetic consensus” data is different from the “Participant_Status” CSV? And, in this case, shouldn’t this still mean that if these people have these genetic risks at baseline, they should have gone into the Prodromal cohort instead?
I feel I’m missing some key methodological information.
4 Likes
I have another follow-up to this issue, I feel there are some inconsistencies going on with the COHORT assignment, in which these 6 healthy people are just part of a deeper problem we do not understand…
I further discovered 74 people in the Prodromal cohort in which their primary diagnosis (with variable PRIMDIAG) at baseline is 1, meaning “Idiopathic PD”. I also discovered 5 people in the Prodromal cohort whose primary diagnosis at baseline is 5, meaning “Dementia with Lewy bodies”. How can we have these people assigned to a cohort that is supposed to be for people at risk of Parkinson’s, and according to the information online “No clinical diagnosis of PD or other parkinsonism or of dementia”?
Finally, I also discovered 7 people in the SWEDD cohort, whose primary diagnosis at baseline is 17, meaning “No PD nor other neurological disorder”. If the SWEDD cohort is defined as “participants who were clinically diagnosed with PD but had a normal DAT SPECT on visual inspection”, how can this people be in the SWEDD cohort?
I will appreciate any help/opinion on this issue. I still haven’t gotten an answer from PPMI regarding the initial topic of the healthy people with genetic risks.
2 Likes
just from my own experience with the prodromal cohort, we get people without a diagnosis that we give a research diagnosis to. so someone can sign up as technically prodromal, because they don’t carry a clinical diagnosis and they may not notice their symptoms actually. then we see them and voila, they do have motor symptoms enough for a diagnosis but we’re not their clinician and we don’t diagnose people or provide care during a research visit. so we put down a research diagnosis and advise them to follow up with their doctor etc. since they don’t have a clinical diagnosis, they can fit the bill and get included in the study still. we also change diagnoses at times, we’ve had people where we diagnosed them with an atypical parkinsonian but it was a research diagnosis and we may wait out to see the progression to see if it was indeed accurate or not. the current pd diagnostic criteria no longer excludes dementia either, so someone can have dementia prior to pd and get called pd still based on that criteria. weird, doesn’t make sense, controversial, but that’s just what we operate on… i only recently joined the data collection for ppmi though, it’s been 2-3 months, so the actual ppmi crew might have a better insight, but that’s just my experience so far… we haven’t had any swedd so i’m not sure how that group works yet
3 Likes
Thanks for this explanation, I think it makes sense! Doesn’t this mean there should be a better documentation on how these different research/clinical diagnosis are assigned, and by whom? It seems that the way PPMI is documented, there is some sort of harmonised/centralised way to assign people to different cohorts/subgroups/diagnosis, and if that’s not the case, it makes these inconsistencies in the data a bit confusing (why would things be documented in one way but then done in another?)
I’m just a computer scientist so I have no idea whether what you are describing is controversial or not, all I’m saying is that it would be important to have a clear documentation on how things are assigned (being them controversial or not). I guess this has clear impacts on how researchers understand what’s in the data and thus ultimately how research can be correctly conducted on this dataset. I really appreciate your explanation here, I’m just thinking a bit out loud why I think this is very confusing and thus it should lead to a clearer documentation in PPMI 
4 Likes
FIRST OF ALL, YOU CANNOT SAY “I’M JUST A COMPUTER SCIENTIST” AS IF IT’S LESS THAN SOMETHING. no self-degradation allowed on my watch! you’re an important part of this field and all our contributions matter!!!
i’m not gonna claim to know the ppmi documentation as well but in general you can’t really name clinicians who are not part of the study as the diagnosticians, it’s a bit unfair to them to have their names out in the open when they may have just seen a participant once during a primary care visit and without the PPMI details that get collected. many people get diagnosed without imaging or any of the newer biomarker studies, it’s based on motor symptoms and then elimination process to make sure other potential reversible diseases are not to blame. there’s a reason why the diagnostic criteria keeps getting updated and there are debates about whether it should be defined from the biological standpoint and not the clinical profile. it’s not because clinicians are completely horrible at recognizing symptoms, it’s because Parkinson’s is an elimination type of diagnosis, we have the red flags on the criteria to look out for, we have a bunch of different reasons that cause symptoms to consider and you need access to tests, procedures to be able to weave everything out. PPMI funds many biomarker efforts thankfully, but those initiatives are not common in clinical practice yet. also it does make a difference to be seen by a specialist in a research study that has been trained for years and gotten the experience to notice subtle symptoms, in a primary care environment-where many people have access to rather than a specialist clinic- some symptoms are easy to miss in the early phases. when you don’t have a clinical appointment with someone, you just cannot give them a diagnosis that will change their life forever. you don’t have the proper resources, time, and they don’t have the proper expectations walking into a research visit. so you can’t just change their diagnosis, participants have access to some of their results, imagine seing a clinical diagnosis that you weren’t aware of in a research portal… when we see urgent things (stroke on imaging, blood pressure/diabetes indicators etc), we do advise them to alert their clinician and act right away, that’s a whole different scenario of course. especially for the prodromal cohort, everyone i’ve seen so far have come through ads, they see an ad on social media and sign up. some with family history, some without any idea on Parkinson’s but fit the bill for prodromal inclusion criteria. i’m not gonna put a clinical diagnosis on any of them, i advise them to establish care with their primary care provider for routine controls if they have concerns etc. but unless they come with a clinical diagnosis, it’s not my place to clinically diagnose them, it’s a research visit.
it’s just the human aspect of the study i guess, data from human trials are never ever perfect for a reason. our rating scale scorings don’t even match perfectly all the time, things are subjective. this is why we need all hands on deck to improve the diagnosis, it’s obvious that everyone from different backgrounds brings in unique perspectives that can help crack the code.
6 Likes
I don’t have anything major to add, other than to thank you both for continuing this discussion! Thanks to @tiago.azevedo for highlighting more “unexpected” cases in the data, and many thanks to @ecebayram for explaining more about the research diagnosis process for PPMI! This is very valuable information.
Ece, I take your point that the researcher can only perform a research diagnosis, and doesn’t have the ongoing relationship with the participant to give a clinical diagnosis. But I also agree with Tiago that the details of the true meaning of the COHORT and PRIMDIAG labels are very opaque to those of us working with the data. Maybe the answer is to enhance the PPMI documentation to explain more about the research diagnosis “flow chart”, ie what is taken into account by the researcher to end up with a specific label in each case? Maybe this is something for the Data Modalities task force to discuss more this year?
2 Likes
i do feel someone who’s more on top of the ppmi documentation will have waaaaay better insight than my second hand information too
who knows, maybe i’m completely wrong!
2 Likes
Was taking a look at the PPMI Data User Guide & Groups/Subgroups Guidance Document (which live here) and would suggest also reviewing in case you haven’t already.
In reviewing the former, Sections 3.2-3 note:
In reviewing the user guide and annotated data dictionary I’m less clear on how the PRIMDIAG (“most likely diagnosis”) value is generated. It looks to be a value that has been updated over time and mapped from several different places: PRIMDXPD, DIAGQUES, PRODDIAG, CLINDX, which look to be variables from other CRFs or prior versions of the EDC.
Part of me wonders if the discrepancies your seeing between COHORT and PRIMDIAG in some instances could be that the former was previously determined by the consensus committee with the latter being the researcher administering the visit?
Agree with @ecebayram’s statement that no study’s data is perfect and also that diagnosing participants is based on observations and best practice diagnostic criteria at the time of the study visit, which certainly has evolved over the period in which PPMI has been operating.
Anyway! I can ping someone at the Data Management Core to see if they have any insight into this and get back to the group.
4 Likes
Thanks, Josh! I did read the “Participant categorization” description, but it has been a while
It’s always helpful to get a refresher on the exact text.
It’s very helpful to get your thoughts on PRIMDIAG too: I also wondered if any discrepancies were a bit of a legacy from the consensus committee days.
And of course, you’re right that no study’s data is perfect, but PPMI provides such a unique and comprehensive resource for the community! I think these questions about the precise meaning of certain fields is simply a result of PPMI being SO broad and SO longitudinal in scope, that documentation is many times harder than for smaller studies!
3 Likes
The part I want to highlight to think about, and I’m agreeing with @jgottesman here, the COHORT is based on study enrollment, and doesn’t change over time, and PRIMDIAG can change from visit to visit because it’s linked to the clinician’s notes at that visit. At least that is my understanding.
I looked at this very closely years ago before the current harmonization, and actually published an SOP detailing the analysis groups that we used for our analyses at the time, and now that data is represented in the COHORT and subgroup categories.
4 Likes
Oh and one note on the Prodromal cohort. In the older legacy version, there were people that enrolled under the genetic registry, and some of these participants were migrated over to the PD or Prodromal cohorts.
I think some of the discrepancies can be explained by the 2010 legacy study and the ongoing 2020 study. I find it helpful to look at the inclusion/exclusion criteria for both.
For instance, there is this bit for the exclusion criteria for the Prodromal cohort:
Current or active clinically significant neurological disorder or psychiatric disorder (in the opinion of the Investigator).
5 Likes
Thanks, @ehutchins ! @tiago.azevedo, I was able to chat with a member of the PPMI study team who informed me:
“…you are correct that cohort is determined at time of initial Clinical consent/enrollment and stays the same for the duration of the participation, regardless of a new research, clinical diagnosis, or consensus dx following initial consent.
…the other 2 additional diagnosis CRFs could change during a ppts time of follow up – the primdiag Primary Research Diagnosis (made by PI) and newclindx Clinical Diagnosis (reported by ppt if a new clinical dx was made outside the research setting).
Hope this helps!
6 Likes
Very helpful! Thanks @ehutchins and @jgottesman for the extra context 
2 Likes