Hi,
my student and I are looking to leverage machine learning models and the PPMI dataset to examine the progression of various symptoms. As a first step, we linked each feature to its description in the data dictionary. However, we ran into a problem. We found features in the data that are not in the dictionary so we don’t know what they mean.
Here is the list of features we could not find: ‘MCAALTTM’, ‘MCACUBE’, ‘MCACLCKC’, ‘MCACLCKN’, ‘MCACLCKH’, ‘MCALION’, ‘MCARHINO’, ‘MCACAMEL’, ‘MCAFDS’, ‘MCABDS’, ‘MCASNTNC’, ‘MCAABSTR’, ‘MCAREC1’, ‘MCAREC2’, ‘MCAREC3’, ‘MCAREC4’, ‘MCAREC5’, ‘MCADATE’, ‘MCAMONTH’, ‘MCAYR’, ‘MCADAY’, ‘MCAPLACE’, ‘MCACITY’, ‘CLCKPII’, ‘CLCK2HND’, ‘CLCKNMRK’, ‘CLCKNUIN’, ‘CLCKALNU’, ‘CLCKNUSP’, ‘CLCKNUED’, ‘LNS1A’, ‘LNS1B’, ‘LNS1C’, ‘LNS2A’, ‘LNS2B’, ‘LNS2C’, ‘LNS3A’, ‘LNS3B’, ‘LNS3C’, ‘LNS4A’, ‘LNS4B’, ‘LNS4C’, ‘LNS5A’, ‘LNS5B’, ‘LNS5C’, ‘LNS6A’, ‘LNS6B’, ‘LNS6C’.
They seem to be individual items of certain tests (e.g., MCACUBE may be the cube drawing item from the MoCA) and they are all categorical.
Does anyone know where we could find the definition for these features?
Another question I would like to ask is: if those are in fact individual items of clinical scales, should we include them in our machine learning models since the scales are usually validated as a whole, not on an individual item basis?
Indeed, those variable names aren’t present in the data dictionary that comes with our downloaded datasets, however, it really seems most of the variables you listed are from the MOCA exam, specifically, from their individual items. I think that, up to “CLCKNUED”, all those are from the MOCA. After that, I’m not sure from which datasets those other columns are from.
Regarding the definition of those terms, I think you need to look both at the “Montreal_Cognitive_Assessment__MoCA__” dataset and the original scale (can check here). The columns seems to be in a direct order in which each test is performed.
Regarding how you should treat these variables, it depends on your research interest. For example, not all types of dementia and cognitive deficits are the same. Patients with Alzheimer’s tend to present with more pronounced deffects in memory, temporal and spatial orientation domains, with other domains also become impaired as the disease progressess. Regarding PD, I don’t know if there is a specific consensus but it comes to my mind that executive function, visuospatial cognition and attention are the most pronounced ones. Therefore, one option is to sum some of those individual questions to calculate the score of those specific subdomains and to evaluate how each of those domains in the MOCA could predict your outcome of interest, especially those that are most associated with PD-related cognitive decline. However, as you’ve mentioned, the scale as a whole is what we mostly utlize.
If you’re instered in this topic, I’ve found this article, that could aid you: Parkinson disease-associated cognitive impairment | Nature Reviews Disease Primers
I also think that these are individual items from the MoCA, but I was wondering if anyone had official documentation on this. Also, as you said, the ones after “CLCKNUED” don’t seem to refer to the MoCA.
Regarding their inclusion in machine learning models, I am concerned about the explainability and interpretability of the models. For example, if an important feature of the future model is the total MoCA score, we can say that cognitive impairment is an important factor because the total score of the MoCA provides a measure of cognitive impairment. However, if an important feature of the model is the score for the cube drawing item in the MoCA, what does this mean? On its own, it’s not a measure of cognitive impairment or executive function. So, do you think having clear definitions of each feature in your dataset is necessary prior to moving to the actual data analysis?
I think that, if your focus is on evaluating cognitive impairment as whole, I would just use the total MOCA score, and not the individual features or the cognitions domains present inside the score.
Thanks for the question @jf.daneault and feedback @danieltds !
As a reminder to other users who may be working with PPMI data, the data dictionary (and other guidance) can be accessed from the PPMI website (available here; screenshot below).
The annotated data dictionary has descriptions for most PPMI features, but some of them are more well-defined than others. Regarding the features you listed in your question, I would recommend you backtrace them using the Clinical AM3 Case Report Form (CRF) Packet (available here; screenshot below)
Within the CRF, you can view the forms themselves which correspond to the fields you are referring to. There you can find the Montreal Cognitive Assessment (MoCA; pgs 110-111; screenshot below). You were correct in assuming that variable names beginning with ‘MC’ are referring to MoCA field values. For example, ‘MCAALTTM’ refers to the first MoCA question (‘Alternating Trail Making’).
Similarly, the variable names beginning with ‘LNS’ derive from the Letter-Number Sequencing CRF (pg 42; screenshot below). So for example, ‘LNS3C’ would be question 3c from the LNS CRF.
Hope this helps, please also feel free to contact resources@michaeljfox.org with questions about PPMI (or other MJFF-sponsored) datasets!
That’s a great answer, Greg! Thank you for providing this input! I didn’t know that!