Parkinson's Insight Engine (PIE)

cameronreidhamilton · April 2, 2025, 5:42pm

Hi, everyone!

I wanted to summarize our progress with the Parkinson’s Insight Engine (PIE). First, I want to thank @vcatterson and @ehutchins for their excellent work on this project. Victoria’s work on concomitant medications is hugely helpful.

PIE aims to provide a framework for getting up and running with analysis of the PPMI data as fast as possible by automating the tedious parts (data loading, preprocessing, feature engineering, etc.) so the researcher can focus on what matters - deriving insights from the data.

There’s still much work to be done on this project, but I’m happy that we’ve nearly finished the data loading and merging portion of this data processing pipeline, as it’s easily the most involved part. I imagine the remaining portions will be much easier to implement. Please let me know if you would like to contribute to this project!

Parkinson’s Insight Engine (PIE) - Project Summary

Overview

The Parkinson’s Insight Engine (PIE) is a Python framework designed to load, process, and integrate data from the Parkinson’s Progression Markers Initiative (PPMI) dataset. The framework provides a modular, extensible architecture for working with the diverse data types in the PPMI dataset, including clinical assessments, biospecimens, imaging, and more.

Key Components

1. Data Loader System

The core of the framework is a unified data loading system that:

Provides a consistent interface for accessing all PPMI data types
Allows selective loading of specific data modalities
Handles directory structure and file format variations
Implemented robust error handling and logging

The main entry point is the DataLoader class in data_loader.py, which coordinates loading from various specialized loaders.

2. Specialized Data Loaders

The framework includes specialized loaders for different data modalities:

Subject Characteristics (sub_char_loader.py): Loads demographic and baseline data
Medical History (med_hist_loader.py): Loads medical history tables as separate dataframes
Motor Assessments (motor_loader.py): Loads and merges MDS-UPDRS and other motor assessment data
Non-Motor Assessments (non_motor_loader.py): Loads cognitive, psychiatric, and other non-motor assessments
Biospecimen Data (biospecimen_loader.py): Loads proteomics, metabolomics, and other biomarker data

Each loader handles the specific nuances of its data type, including:

Finding and filtering relevant files
Transforming data into analysis-ready formats
Merging related tables when appropriate
Handling duplicates and inconsistencies

3. Data Processing Features

The loaders implement several sophisticated data processing techniques:

Smart Merging: Merges on [PATNO, EVENT_ID] when both are available, or just PATNO when necessary
Column Deduplication: Intelligently combines duplicate columns that appear during merges
Data Pivoting: Transforms long-format data (e.g., test name/value pairs) into wide-format (one column per test)
Consistent Naming: Standardizes column names and adds prefixes to prevent collisions

4. Logging and Diagnostics

The framework includes comprehensive logging to track:

Files being processed
Data transformation steps
Warnings about missing or problematic data
Summary statistics of loaded data

Current Implementation Status

So far, we have implemented:

Core Framework:

Unified DataLoader class with modality selection
Consistent error handling and logging patterns
Support for different directory structures
Completed Loaders:
Subject characteristics loader
Medical history loader
Motor assessments loader
Non-motor assessments loader
Initial biospecimen loader (Project 151 pQTL data). I’m currently working on this loader.
Data Processing:
Column deduplication logic
Suffix sanitization for merged dataframes
Pivot table transformations for test/value pairs

Next Steps

Planned enhancements include:

Additional Biospecimen Loaders: Implement loaders for remaining biospecimen data types
Imaging Data Support: Add loaders for MRI, DaTscan, and other imaging modalities. Integrate the MATLAB/shell script code @AmgadDroby and his team have written into this Python framework
Wearables Data: Implement loaders for accelerometer and other sensor data
Data Validation: Add validation checks to ensure data quality
Data Preprocessing: Develop standardized preprocessing pipelines
Feature Extraction, Feature Selection, Classification/Regression, Data Visualization
Documentation: Create comprehensive documentation and usage examples

danieltds · April 9, 2025, 6:06pm

This initiative seems great, @cameronreidhamilton! During all my analyses using this data, indeed, this part of data cleaning, harmonization and integration took me quite some time, so having a tool that does this for me will be invaluable for all future analyses. Do you have an expected release date for this tool so that I can see this in action?

I also suggest one possible enhancement (for a later version): I think that some codes I created as part of the
Data Modality and Methodology Task Force project would be interesting to be incorporated in that pipeline. Here follows the link

cameronreidhamilton · June 10, 2025, 5:44pm

Hi Everyone!
PIE is now ready to be used. Please check it out and let me know what you think! Please also submit any issues you encounter on the GitHub repo and I will resolve them.

I’m excited to see what you think.

Thanks!
Cameron

Topic		Replies	Views
Useful PPMI Clinical Codes - Code Available Analyzing and Reusing Data ppmi , clinical-data , code	3	49	March 1, 2025
An Introduction to the PPMI Dataset Accessing and Understanding Data ppmi , data-access , featured-content	4	151	August 17, 2023
Hello! I'm Cathy from New York Introductions	3	29	April 8, 2024
Intro to ADNI and adnimerge for AD/PD Researchers Accessing and Understanding Data data-access , resources	3	54	June 10, 2025
Getting Started With PD Data Webinar Accessing and Understanding Data how-to , data-access , webinar	7	60	May 13, 2025