A Comparative Study of Principal Component Analysis with Ensemble Learning for Classification of Medical Data
Abstract
Dimensionality reduction is a critical component in the analysis of medical data, specifically when addressing challenges like multicollinearity, noise, and high-dimensional feature spaces that can decrease classification performance. While principal component analysis (PCA) is a traditional choice, its utility in medical datasets is often hindered by outliers, corrupted observations, and low interpretability, as principal components are linear combinations of all original variables. This research compares PCA, robust PCA (RPCA), and sparse PCA (SPCA) integrated with random forest (RF) and extremely randomized trees (ERT). A simulation study revealed that while all PCA variants struggle with low class separation, RPCA and SPCA significantly outperform standard PCA in the presence of outliers. This study utilized a diabetes dataset that underwent thorough preprocessing, including median imputation, normalization, and the synthetic minority over-sampling technique (SMOTE) to address class imbalance. Model optimization involved cross-validation of the RPCA regularization parameter and the SPCA sparsity parameter based on the area under the receiver operating characteristic (ROC) curve (AUC). At the same time, RF and ERT hyperparameters were optimized using a two-stage random and grid search approach. Final empirical results demonstrate that the RPCA-ERT model is superior, achieving an accuracy of 0.8954 and a sensitivity of 0.9434, underscoring its effectiveness in managing contaminated medical data.