Parkinson's Disease Detection
Oct 2022 → Jan 2023
A comparison of nine machine learning classifiers trained on biomedical voice features to detect Parkinson's disease from sustained vowel recordings, identifying which algorithm best captures the acoustic signatures of the condition.

Not the Neurons in ML models
Overview
Parkinson's disease affects motor control, including the fine muscles of the larynx, producing measurable changes in frequency, amplitude, and harmonic structure of voice. This project trains nine supervised classifiers and one unsupervised method on a dataset of 195 voice recordings to determine which approach best identifies the disease from acoustic features alone.
Problem
Parkinson's disease degrades phonation in ways that are not audible to the untrained ear but are quantifiable in the signal. Detecting these changes early can support diagnosis before motor symptoms become severe. The challenge is determining which classification approach reads these acoustic signatures most reliably, given a small, high-dimensional dataset.
Approach
Rather than training a single model and reporting its accuracy, this project trains nine supervised algorithms and one unsupervised method on the same dataset with the same evaluation methodology. This produces a fair comparison and real understanding of why some approaches work better than others for this type of problem.
The dataset consists of 195 voice recordings from 31 subjects (23 with Parkinson's, 8 without), each described by 24 acoustic features: jitter measures for cycle-to-cycle frequency variation, shimmer measures for amplitude variation, noise-to-harmonics ratios, and non-linear dynamical measures derived from chaos and fractal geometry theory.
The supervised models are Random Forest, Logistic Regression, Decision Tree, K-Nearest Neighbour, Support Vector Machine, Perceptron, Gaussian Naive Bayes, LightGBM, and XGBoost. K-Means Clustering is included as an unsupervised baseline to reveal whether the two populations are geometrically separable without labels.
Results
The Support Vector Machine achieved the highest accuracy, exceeding 80%. This is consistent with the broader literature on SVM applied to the MDVP voice dataset, where the kernel's ability to find a maximum-margin separating hyperplane in a high-dimensional feature space suits the geometry that acoustic biomarkers produce.
The 195-sample dataset is small enough that variance in model performance is genuinely informative about the relationship between inductive bias and data structure. Observing an SVM outperform a neural perceptron on a small, high-dimensional dataset directly is different from reading about it in a textbook.
Tech Stack
- Language: Python
- Libraries: Scikit-learn, XGBoost, LightGBM, NumPy, Pandas
- Dataset: UCI Parkinson's Disease Dataset