Parkinson's Insight Engine (PIE)

Hi, everyone!

I wanted to summarize our progress with the Parkinson’s Insight Engine (PIE). First, I want to thank @vcatterson and @ehutchins for their excellent work on this project. Victoria’s work on concomitant medications is hugely helpful.

PIE aims to provide a framework for getting up and running with analysis of the PPMI data as fast as possible by automating the tedious parts (data loading, preprocessing, feature engineering, etc.) so the researcher can focus on what matters - deriving insights from the data.

There’s still much work to be done on this project, but I’m happy that we’ve nearly finished the data loading and merging portion of this data processing pipeline, as it’s easily the most involved part. I imagine the remaining portions will be much easier to implement. Please let me know if you would like to contribute to this project!

Parkinson’s Insight Engine (PIE) - Project Summary

Overview

The Parkinson’s Insight Engine (PIE) is a Python framework designed to load, process, and integrate data from the Parkinson’s Progression Markers Initiative (PPMI) dataset. The framework provides a modular, extensible architecture for working with the diverse data types in the PPMI dataset, including clinical assessments, biospecimens, imaging, and more.

Key Components

1. Data Loader System

The core of the framework is a unified data loading system that:

  • Provides a consistent interface for accessing all PPMI data types
  • Allows selective loading of specific data modalities
  • Handles directory structure and file format variations
  • Implemented robust error handling and logging

The main entry point is the DataLoader class in data_loader.py, which coordinates loading from various specialized loaders.

2. Specialized Data Loaders

The framework includes specialized loaders for different data modalities:

  • Subject Characteristics (sub_char_loader.py): Loads demographic and baseline data

  • Medical History (med_hist_loader.py): Loads medical history tables as separate dataframes

  • Motor Assessments (motor_loader.py): Loads and merges MDS-UPDRS and other motor assessment data

  • Non-Motor Assessments (non_motor_loader.py): Loads cognitive, psychiatric, and other non-motor assessments

  • Biospecimen Data (biospecimen_loader.py): Loads proteomics, metabolomics, and other biomarker data

Each loader handles the specific nuances of its data type, including:

  • Finding and filtering relevant files
  • Transforming data into analysis-ready formats
  • Merging related tables when appropriate
  • Handling duplicates and inconsistencies

3. Data Processing Features

The loaders implement several sophisticated data processing techniques:

  • Smart Merging: Merges on [PATNO, EVENT_ID] when both are available, or just PATNO when necessary
  • Column Deduplication: Intelligently combines duplicate columns that appear during merges
  • Data Pivoting: Transforms long-format data (e.g., test name/value pairs) into wide-format (one column per test)
  • Consistent Naming: Standardizes column names and adds prefixes to prevent collisions

4. Logging and Diagnostics

The framework includes comprehensive logging to track:

  • Files being processed
  • Data transformation steps
  • Warnings about missing or problematic data
  • Summary statistics of loaded data

Current Implementation Status

So far, we have implemented:

  1. Core Framework:
  • Unified DataLoader class with modality selection

  • Consistent error handling and logging patterns

  • Support for different directory structures

  • Completed Loaders:

  • Subject characteristics loader

  • Medical history loader

  • Motor assessments loader

  • Non-motor assessments loader

  • Initial biospecimen loader (Project 151 pQTL data). I’m currently working on this loader.

  • Data Processing:

  • Column deduplication logic

  • Suffix sanitization for merged dataframes

  • Pivot table transformations for test/value pairs

Next Steps

Planned enhancements include:

  • Additional Biospecimen Loaders: Implement loaders for remaining biospecimen data types
  • Imaging Data Support: Add loaders for MRI, DaTscan, and other imaging modalities. Integrate the MATLAB/shell script code @AmgadDroby and his team have written into this Python framework
  • Wearables Data: Implement loaders for accelerometer and other sensor data
  • Data Validation: Add validation checks to ensure data quality
  • Data Preprocessing: Develop standardized preprocessing pipelines
  • Feature Extraction, Feature Selection, Classification/Regression, Data Visualization
  • Documentation: Create comprehensive documentation and usage examples
7 Likes

This initiative seems great, @cameronreidhamilton! During all my analyses using this data, indeed, this part of data cleaning, harmonization and integration took me quite some time, so having a tool that does this for me will be invaluable for all future analyses. Do you have an expected release date for this tool so that I can see this in action?

I also suggest one possible enhancement (for a later version): I think that some codes I created as part of the
Data Modality and Methodology Task Force project would be interesting to be incorporated in that pipeline. Here follows the link

4 Likes

Hi Everyone!
PIE is now ready to be used. Please check it out and let me know what you think! Please also submit any issues you encounter on the GitHub repo and I will resolve them.

I’m excited to see what you think.

Thanks!
Cameron

5 Likes