Which are the correct identifiers for patients in the PPMI cohort?

danieltds · July 23, 2023, 10:12pm

This question might be basic, but I think it is nonetheless important. The PPMI dataset (official website) contains different columns portraying information on a patient’s diagnoses. The main identifiers are present in the table called “Participant_Status,” such as “COHORT or COHORT_DEFINITION”

However, this table also includes the variables “CONCOHORT” and “CONCOHORT_DEFINITION,” which represent, according to the dataset’s code list, “Cohort per Consensus Committee” and “Decoded Value for CONCOHORT.” I have read that there are regular consensus meetings to discuss certain aspects of patients and determine their diagnosis. I believe these two latter variables are the most accurate and up-to-date to use when selecting patients according to their cohort, but I wanted some confirmation.

Here I will show images of the number of patients in each category as of the data downloaded in July 12.

COHORT_DEFINITION

CONCOHORT_DEFINITION

Additionally, the dataset also contains the variable called “'PRIMDIAG” (Most likely primary diagnosis) in the ‘Primary_Clinical_Diagnosis’ table. I couldn’t find more information regarding how this variable is determined, but I believe it represents a clinical diagnosis made during each follow-up visit by the physician examining the patient. This column has the following possible responses:

Does this column has some validity to define patients or for research purposes?

jgottesman · July 25, 2023, 4:47pm

Will preface this by saying that I have very little experience with PPMI data but… would suggest taking a look at the recently updated PPMI User Guide: link.

I believe @rooparajan @vdardov @ehutchins @gdp22 @fbbriggs and @hirotaka have all worked with PPMI data to varying degrees – how have you handled this?

ehutchins · July 25, 2023, 9:06pm

All great questions! Prior to the Sept. 2021 data harmonization, I wrote an SOP in order to define genetic and disease status in the PPMI cohort, because we wanted to stratify based on some of the information that was obtained after enrollment.

The part that is most relevant to your question - I used primary diagnosis in order to flag participants that either:

Rule 5.1 - have a different disease other than control or PD listed
Rule 5.2 - are prodromal at screening/baseline and progress to PD at a later visit
Rule 5.3 - were enrolled in the genetic registry/cohort as genetic unaffected and are prodromal or PD at a later visit

One thing when looking at PPMI data that is super important to keep in mind - there are many visits and dates, so whenever you pull data, you want to look at not just the PATNO column but also the EVENT_ID column and use those two columns to bin your data.

All that said, this was prior to the data harmonization in September 2021 for the study expansion. There are two things to consider here:
1. COHORTS/CONCOHORTS
The original study enrollment cohorts and the updated cohorts. From the data dictionary:

MOD_NAME	ITM_NAME	PAG_NAME	DSCR
PATIENT_STATUS	COHORT	NULL	Enrollment Cohort
PATIENT_STATUS	COHORT_DEFINITION	NULL	Decoded Value for COHORT
PATIENT_STATUS	COMMENTS	NULL	Comments
PATIENT_STATUS	CONCOHORT	NULL	Cohort per Consensus Committee
PATIENT_STATUS	CONCOHORT_DEFINITION	NULL	Decoded Value for CONCOHORT

So yes - “COHORT” and “COHORT_DEFINITION” represent the original enrollment criteria/assignments. CONCOHORT" and “CONCOHORT_DEFINITION” represent the analytic data cohort assignments. Mostly - some SWEDD participants are doing follow-ups but they are not enrolling more, and some genetic cohort participants have moved over to the Prodromal or PD Concohort groups (though there is some historical difference in the frequency of data collection depending on the type of data).

MOD_NAME	ITM_NAME	CODE	DECODE
PATIENT_STATUS	COHORT	1	Parkinson’s Disease
PATIENT_STATUS	COHORT	2	Healthy Control
PATIENT_STATUS	COHORT	3	SWEDD
PATIENT_STATUS	COHORT	4	Prodromal
PATIENT_STATUS	COHORT	7	Genetic Registry - PD
PATIENT_STATUS	COHORT	8	Genetic Registry - Unaffected
PATIENT_STATUS	COHORT	9	Early Imaging (original study participants only)

MOD_NAME	ITM_NAME	CODE	DECODE
PATIENT_STATUS	CONCOHORT	0	non-PD, non-Prodromal, non-HC (participants to be excluded)
PATIENT_STATUS	CONCOHORT	1	Parkinson’s Disease
PATIENT_STATUS	CONCOHORT	2	Healthy Control
PATIENT_STATUS	CONCOHORT	3	SWEDD
PATIENT_STATUS	CONCOHORT	4	Prodromal

2. ANALYTIC COHORT
As part of the data harmonization in September 2021, there are analytic cohort assignments as well. This is a separate doc from LONI: PPMI_Consensus_Committee_Analytic_Datasets_08May2023.xlsx, available under:

START HERE:
Quick start
Consensus Committee Analytic Dataset

This is the document that has the consensus diagnosis and what is currently recommended by the committee to be used for all analyses. There are supporting docs with more descriptions under “Quick Start” as well.. I’m not sure how it is related to primary diagnosis but I’d be interested to see if someone has crosschecked the two documents.

I know that is a lot of information… hopefully that answered your question!

fbbriggs · July 26, 2023, 4:46pm

@ehutchins this is amazing - thank you for sharing!

hirotaka · July 27, 2023, 5:42pm

I agree with @fbbriggs, @ehutchins’s summary is really great!

I am also trying to familiarize myself to the new cohort assignment. Although the code book for PATIENT_STATUS has 7 values for COHORT as @ehutchins shown, some of these values are not in the actual dataset as @danieltds shown.

In my case, I need to analyze separately for those who joined the study in the ordinary mechanism from those who were selectively recruited by carrying specific mutations. Because the latter group, the genetically enriched group, is from very specific in terms of risk, population structure and lifestyle. The study would be confounded if not separated. However, participants from these different recruitment mechanisms are mixed in “Parkinson’s Disease” or “Prodromal” in the current assignment at the “Cohort” level. So far the variable I refer to is the “Subgroup” column in the Consensus_Committee_Aalytic_Datasets. Here is the comparison between AMP-PD 's study_arm (ver2.5) and PPMI’s new “Cohort” and “Subgroup” from the consensus datasets.

By looking at the Subgroup level, I could retrieve the original recruitment mechanisms although there are a few exceptions. Also, as you see, there are people not in the Consensus datasets but in AMP-PD. Ideally having the list of all people who joined PPMI regarding their recruitment arms would be great. Maybe it is available in LONI and I haven’t looked enough.

For PRIMDIAG, I am also not sure about the validity and the relationship with Consensus but, if I recall correctly, the value can change over time. You might get more info from the study protocol and the case report form here..

Very important topic and please keep us updated of your data exploration @danieltds !

danieltds · July 30, 2023, 8:25pm

Thank you very much @ehutchins and @hirotaka! Your answers were very clarifying and explanatory, and I was able to see that there are more details that I still need to learn about patient designation. I’ll take advantage of my response to confirm whether I’ve understood correctly what you told us and to summarize the steps. If you notice any mistakes or have additional information, just let me know!

After all, how should I recognize each individual person in the PPMI?
1 - Go to LONI where the PPMI data is and download all the files available under ‘START_HERE’
2 - Download any data you deem necessary and want to analyze that contains the PATNO of the participants
3 - Use some technique to join databases based on PATNO using the (1) database that you want to analyze and the (2) specific spreadsheet of your interest present in PPMI_Consensus_Committee_Analytic_Datasets_08May2023 (downloaded in START_HERE - for example: using the spreadsheet of patients with PD)
4 - If you are analyzing PD, for example: If you aim to select patients with clinically manifest PD regardless of their genetic status, use the CONPD column for this. If you want to choose patients with idiopathic PD, use this same column but exclude individuals who also have positive results (1) in other genetic columns ([‘CONLRRK2’, ‘CONGBA’, ‘CONSNCA’, ‘CONPRKN’, ‘CONPINK1’])

More information can be found in the document called Guide to the PPMI analytic dataset (also present in START_HERE).

Specific Questions:
1 - The document PPMI_Consensus_Committee_Analytic_Datasets_08May2023 mentions there being 788 idiopathic PD, however, after selecting all patients with CONPD = 1 and excluding patients with a value of 1 in the genetic columns described above, I ended up with 791 patients - why did I have 3 more than described?
2 - From what I could see, the variables in Primary_Clinical_Diagnosis represent a clinical diagnosis given by the local investigator at the time he evaluated the patient, at each follow-up. It seems that the PPMI committee reviews each case where there is diagnostic doubt and reaches a consensus if the patient actually has PD. The number of suspicions of another diagnosis ends up being much higher than the change that the PPMI committee actually concludes, which, according to the document, were only 2. Attached, I leave a preliminary analysis with the latest available data for each patient regarding their “Primary Clinical Diagnosis” for you to see that a very small amount of these alternative diagnoses are considered.

I took the opportunity to combine the data from patients defined as without PD in PPMI_Consensus_Committee_Analytic_Datasets_08May2023 and who were not part of the genetic group with that of Primary Clinical Diagnosis, obtaining the following result:

My opinion: The use of Primary Clinical Diagnosis to explore differences between PD patients and those with assumptions of alternative diagnoses is possible, but the PPMI committee most often does not agree with the change of diagnosis, making this analysis, therefore, very exploratory.

Also, an additionnal question: is there any problem in revealing participants PATNO in my posts in this forum? Does it goes against some data sharing rule?

ehutchins · August 8, 2023, 11:45pm

Thank you for the additional information, @hirotaka ! That is an interesting point about prodromal - I appreciate the breakdown of the Subgroup, as I would consider the Prodromal Genetic group to be vastly different from Prodromal hyposmia or RBD.

RE: PRIMDIAG. I agree with @hirotaka and @danieltds - if using this data point, I would check the reported date on this in comparison to your other data points, as it can change over time.

@danieltds RE: PATNOs. As a general rule I do not share PATNO information or anything that you need a registered account to access, because I tend to err on the side of caution. This is point 3 in the PPMI DUA:

I will require anyone on my team who utilizes these data or anyone with whom I share
these data to comply with this Data Use Agreement by registering with the PPMI database
and agreeing to these terms.

Topic		Replies	Views
An Introduction to the PPMI Dataset Accessing and Understanding Data ppmi , data-access , featured-content	4	151	August 17, 2023
New PPMI Cohort Designation Methodology Accessing and Understanding Data meta , ppmi , methodology , documentation	1	40	December 3, 2024
Useful PPMI Clinical Codes - Code Available Analyzing and Reusing Data ppmi , clinical-data , code	3	49	March 1, 2025
Discovering patients who got a PD diagnosis in PPMI and Fox Insight cohorts Analyzing and Reusing Data fox-insight , how-to , ppmi , data-interpretation , documentation	4	54	June 22, 2023
Deriving comorbidities from the PPMI data for Parkinson's sub-typing Analyzing and Reusing Data how-to , ppmi , data-analysis , comorbidities	1	14	July 22, 2025

Which are the correct identifiers for patients in the PPMI cohort?

Related topics