Abstract
Bias in Machine Learning Associated With Weak Baselines, Data Leakage, and Inadequate Measures Reporting
Randall J. Ellis,1 Chirag J. Patel1
Objective
New omics modalities hold promise for biomarker discovery for disease prediction. Accessible biobank samples have fueled a massive literature, some of which may be irreproducible, be subject to lack of generalizability, or report inflated results due to data leakage,1 ie, spurious relationships between input and target variables that arise as artifacts of data collection, sampling, or preprocessing. These biases frequently result in models developed in one context not generalizing to real-world contexts. Here, we demonstrated and quantified bias in biomedical multiomic machine learning results due to irrelevant or unreported baselines, data leakage, and inadequate performance metrics reporting.
Design
In November 2024, using data from the UK Biobank, we conducted a cohort study of all 607 disease outcomes defined in the dataset, using blood plasma proteomics data (collected from 2006-2010) and demographics data. We demonstrated how simple demographics (age, sex, education) compare competitively with putatively novel models that incorporate an expansive number of omics input features, how randomly chosen omics features compare with those chosen according to data-driven feature selection methods, the impact of data leakage on performance (eg, normalizing the data before making train-test cross-validation splits), and how presenting the area under the receiver operating characteristic curve (AUROC) gives a biased view of performance. We presented additional case studies of using baseline risk calculators for cardiovascular disease and atherosclerotic cardiovascular disease to assess how omics factors compared with traditional risk scores for associations with 10-year risk of heart disease or stroke. We used a stratified cross-validation approach and assessed the AUROC, sensitivity, specificity, and positive and negative predictive value.
Results
Across a majority of 607 disease outcomes, demographic baselines performed competitively (0%-10% difference in mean AUROC) in comparison with the combination of demographics and omics factors when looking at the variability of performance across cross-validation folds, which are underreported in the literature. We found that data leakage influenced predictive performance (mean AUROC increases of 5%-30%). Accounting for disease prevalence and class balance played a significant role in the interpretation of machine learning results, particularly for positive predictive value. Omics factors contributed marginal increases in performance for cardiovascular disease and atherosclerotic cardiovascular disease compared with traditional risk score calculators (less than 5% mean AUROC).
Conclusions
The robustness and transparency of biobank-based omics research is enhanced if several practices are followed: (1) rigorous and relevant baselines would provide critical tests of candidate models to justify their clinical use and value, (2) preventing data leakage would preclude inflated results, and (3) comprehensive performance metrics reporting would give transparent measures of progress for researchers in the omics community. Our results complement other work showing the impact of data leakage1 and the marginal improvements of plasma proteomics beyond demographic baselines.2
References
1. Kapoor S, Narayanan A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns. 2023;4(9):100804. doi:10.1016/j.patter.2023.100804
2. Deng YT, You J, He Y, et al. Atlas of the plasma proteome in health and disease in 53,026 adults. Cell. 2025;188(1):253-271. doi:10.1016/j.cell.2024.10.045
1Department of Biomedical Informatics, Harvard Medical School, Boston, MA, US, chirag_patel@hms.harvard.edu.
Conflict of Interest Disclosures
None reported.
Funding/Support
This work was supported by grants from the National Institute on Aging (RF1AG074372) and National Library of Medicine (5T15LM007092-33).
Role of the Funder/Sponsor
The funders were not involved in the development or implementation of this study.
Acknowledgment
We acknowledge the support of our funders and Audrey Airaud for early-stage input on these experiments.