Hi, everyone!
I wanted to summarize our progress with the Parkinson’s Insight Engine (PIE). First, I want to thank @vcatterson and @ehutchins for their excellent work on this project. Victoria’s work on concomitant medications is hugely helpful.
PIE aims to provide a framework for getting up and running with analysis of the PPMI data as fast as possible by automating the tedious parts (data loading, preprocessing, feature engineering, etc.) so the researcher can focus on what matters - deriving insights from the data.
There’s still much work to be done on this project, but I’m happy that we’ve nearly finished the data loading and merging portion of this data processing pipeline, as it’s easily the most involved part. I imagine the remaining portions will be much easier to implement. Please let me know if you would like to contribute to this project!
Parkinson’s Insight Engine (PIE) - Project Summary
Overview
The Parkinson’s Insight Engine (PIE) is a Python framework designed to load, process, and integrate data from the Parkinson’s Progression Markers Initiative (PPMI) dataset. The framework provides a modular, extensible architecture for working with the diverse data types in the PPMI dataset, including clinical assessments, biospecimens, imaging, and more.
Key Components
1. Data Loader System
The core of the framework is a unified data loading system that:
- Provides a consistent interface for accessing all PPMI data types
- Allows selective loading of specific data modalities
- Handles directory structure and file format variations
- Implemented robust error handling and logging
The main entry point is the DataLoader class in data_loader.py, which coordinates loading from various specialized loaders.
2. Specialized Data Loaders
The framework includes specialized loaders for different data modalities:
-
Subject Characteristics (sub_char_loader.py): Loads demographic and baseline data
-
Medical History (med_hist_loader.py): Loads medical history tables as separate dataframes
-
Motor Assessments (motor_loader.py): Loads and merges MDS-UPDRS and other motor assessment data
-
Non-Motor Assessments (non_motor_loader.py): Loads cognitive, psychiatric, and other non-motor assessments
-
Biospecimen Data (biospecimen_loader.py): Loads proteomics, metabolomics, and other biomarker data
Each loader handles the specific nuances of its data type, including:
- Finding and filtering relevant files
- Transforming data into analysis-ready formats
- Merging related tables when appropriate
- Handling duplicates and inconsistencies
3. Data Processing Features
The loaders implement several sophisticated data processing techniques:
- Smart Merging: Merges on [PATNO, EVENT_ID] when both are available, or just PATNO when necessary
- Column Deduplication: Intelligently combines duplicate columns that appear during merges
- Data Pivoting: Transforms long-format data (e.g., test name/value pairs) into wide-format (one column per test)
- Consistent Naming: Standardizes column names and adds prefixes to prevent collisions
4. Logging and Diagnostics
The framework includes comprehensive logging to track:
- Files being processed
- Data transformation steps
- Warnings about missing or problematic data
- Summary statistics of loaded data
Current Implementation Status
So far, we have implemented:
- Core Framework:
-
Unified DataLoader class with modality selection
-
Consistent error handling and logging patterns
-
Support for different directory structures
-
Completed Loaders:
-
Subject characteristics loader
-
Medical history loader
-
Motor assessments loader
-
Non-motor assessments loader
-
Initial biospecimen loader (Project 151 pQTL data). I’m currently working on this loader.
-
Data Processing:
-
Column deduplication logic
-
Suffix sanitization for merged dataframes
-
Pivot table transformations for test/value pairs
Next Steps
Planned enhancements include:
- Additional Biospecimen Loaders: Implement loaders for remaining biospecimen data types
- Imaging Data Support: Add loaders for MRI, DaTscan, and other imaging modalities. Integrate the MATLAB/shell script code @AmgadDroby and his team have written into this Python framework
- Wearables Data: Implement loaders for accelerometer and other sensor data
- Data Validation: Add validation checks to ensure data quality
- Data Preprocessing: Develop standardized preprocessing pipelines
- Feature Extraction, Feature Selection, Classification/Regression, Data Visualization
- Documentation: Create comprehensive documentation and usage examples