Using GPT to Identify Changes in Clinical Trial Outcomes Registered on ClinicalTrials.gov

Abstract

Using GPT to Identify Changes in Clinical Trial Outcomes Registered on ClinicalTrials.gov

Xiangji Ying,¹ Colby J. Vorland,² Kiran Ninan,¹ Jean-Pierre Oberste,¹ Andrew W. Brown,³ Riaz Qureshi,⁴ Sirui Zhang,⁵ Nicholas J. DeVito,⁶ Matthew Page,⁷ Ian J. Saldanha,⁸ Halil Kilicoglu,⁹ Evan Mayo-Wilson¹

Objective

Registries, such as ClinicalTrials.gov, and guidelines, including SPIRIT 2025 (Standard Protocol Items: Recommendations for Interventional Trials) and CONSORT 2025 (Consolidated Standards of Reporting Trials), define clinical trial outcomes using 5 elements: domain, measurement, metric, aggregation method, and time point. Changes between prospective registration and results reporting can introduce bias. Readers can manually compare trial documents, but doing so is resource intensive. Automated methods could facilitate checking and improve peer review. We used a chatbot to define outcomes registered on ClinicalTrials.gov, identify changes between prospective registrations and registry results, and describe those changes.

Design

We conducted a cross-sectional study using prospectively registered, completed randomized clinical trials from ClinicalTrials.gov (January 2000-January 2024). Building on the 5-element framework, we developed rules and categories to define outcomes and outcome changes. We used GPT-4o, o1, and o3 mini (Open AI) and developed and optimized prompts using a training set of 225 trials (2221 outcomes). We validated performance on 150 trials (1459 outcomes). We divided the task into 16 structured subtasks in the following steps: (1) defining outcomes in both versions, (2) matching outcomes, and (3) detecting changes in outcome elements. We provided cosine and Jaccard distances to help the chatbot match and compare outcomes. We evaluated all outcomes using the 3 chatbot models in January 2025. We selected the answer given by at least 2 of the models. Two human raters independently evaluated the chatbot results. We present preliminary findings based on unreconciled ratings of changes (final findings will be presented at the conference).

Results

The 150 validation trials reported a mean (SD) of 6.9 (5.0) outcomes in prospective registrations and 8.4 (6.0) outcomes in final versions. The final versions included 832 outcomes matched with prospective registrations and 428 additional outcomes and omitted 199 outcomes from prospective versions. We achieved 99.8% (95% CI, 99.4%-99.9%) accuracy in matching outcomes between versions (Table 25-1142). Accuracy for detecting changes was 87.9% (95% CI, 85.5%-90.1%) for measurement, 88.1% (95% CI, 85.7%-90.2%) for metric, 87.2% (95% CI, 84.8%-89.4%) for aggregation method, 94.5% (95% CI, 92.7%-95.9%) for cutoff, and 94.8% (95% CI, 93.1%-96.2%) for time point. On average, the language learning models completed the entire process (ie, matching, defining, comparing) in approximately 2 minutes per trial. It took humans 27 minutes per trial to evaluate the chatbot’s response.

Conclusions

A prompt-based approach was highly accurate in defining clinical trial outcomes and identifying outcome changes in ClinicalTrials.gov. This approach could be expanded to identify changes between registrations and manuscripts. Although it did not achieve perfect accuracy, our prompt-based approach could help editors and peer reviewers detect likely discrepancies that warrant further review.

¹Department of Epidemiology, University of North Carolina Gillings School of Global Public Health, Chapel Hill, NC, US, evan.mayo-wilson@unc.edu; ²Department of Epidemiology and Biostatistics, Indiana University School of Public Health–Bloomington, IN, US; ³Department of Biostatistics, University of Arkansas for Medical Sciences; Arkansas Children’s Research Institute, Little Rock, AR, US; ⁴Department of Ophthalmology, School of Medicine; Department of Epidemiology, School of Public Health, University of Colorado Anschutz Medical Campus, Denver, CO, US; ⁵Department of Epidemiology, School of Public Health, Brown University, Providence, RI, US; ⁶The Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK; ⁷Methods in Evidence Synthesis Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria, Australia; ⁸Departments of Epidemiology (Primary) and Health Policy and Management (Joint), Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, US; ⁹School of Information Sciences, University of Illinois at Urbana-Champaign, IL, US.

Conflict of Interest Disclosures

None reported.

Funding/Support

National Library of Medicine, National Institutes of Health (R01LM014079).

Role of Funder/Sponsor

The funder played no roles in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the abstract; and decision to submit the abstract for presentation. The views expressed do not necessarily reflect the views of the National Institutes of Health.

Additional Information

Our protocol is publicly available and we will share prompts, code, and datasets on the Open Science Framework (https://osf.io/2tyh3/).

International Congress on
Peer Review and Scientific Publication

Enhancing the quality and credibility of science

Abstract