Quality and Comprehensiveness of Peer Reviews of Journal Submissions Produced by Large Language Models vs Humans

Abstract

Quality and Comprehensiveness of Peer Reviews of Journal Submissions Produced by Large Language Models vs Humans

Fares Alahdab,^1,2,3 Juan Franco,^3,4 Helen Macdonald,⁴ Sara Schroter⁴

Objective

Peer reviewer fatigue is on the rise, as reviewers are overburdened with increasing tasks and manuscripts to review, making it challenging for journal editors to secure a sufficient number of high-quality reviews.¹ Despite the potential benefits of using large language models (LLMs) in editorial and peer review processes² and the potential to support reviewers with tasks, it is not yet known how good they are at peer review. Additionally, covert use of LLMs in peer review is increasingly suspected, with limited knowledge of positive or negative impact. We aim to compare the quality and comprehensiveness of peer reviews produced by 5 LLMs compared with peer reviewers for research submissions.

Design

A comparative study of peer reviews produced by LLMs (GPT4o, GPTo3, Claude 3.5, Gemini 1.5 Pro, and Gemini 2.0 Flash) compared with 2 human peer reviews for research submissions between May 2024 and June 2025. Manuscripts were uploaded to Google’s Vertex AI, a private and secure BMJ Group workspace for using LLMs; the same prompt was used for all submissions (single-shot prompting). An experienced editor performed the ratings of all LLM and human reviews using the Review Quality Instrument (RQI),³ a tool for assessment of review quality comprising 8 questions, each on a 5-point Likert scale. Secondary outcomes include a comprehensiveness score, based on elements of the manuscript that the reviews focused on, and whether they were evaluative in nature. Additionally, once the first decision has been made on the manuscripts, we plan to invite authors to complete a survey about their perceptions of both the LLM review reports and the peer reviewer reports for their manuscripts. We also plan to have a second editor perform the ratings and compute interrater agreement. Reviews produced by LLMs are not included in the decision-making for the manuscripts. Wilcoxon rank sum tests were used to compare LLM and human scores for each RQI item.

Results

Preliminary data include 35 reviews (25 by LLMs and 10 by peer reviewers) of 5 BMJ submissions. Across 8 RQI items, LLM-generated reviews had higher mean (SD) scores than human reviewers on identifying strengths and weaknesses (4.12 [0.53] vs 2.70 [0.95]; P < 0.001); providing useful comments on the writing, organization, tables, and figures (3.72 [0.79] vs 1.80 [0.63]; P < .001); and constructiveness (4.00 [0.58] vs 3.00 [1.05]; P = .004) (Table 25-1146). Complete data will include 200 submissions to 4 BMJ journals, the secondary outcomes, and authors’ feedback on the LLM reviews.

Conclusions

LLM-generated reviews matched or exceeded human reviewers on a few key dimensions of review quality. A fuller analysis will shed more light on the potential value of LLM peer reviews and how they could complement human peer reviewers’ work.

References

1. Publons. Global State of Peer Review report. https://publons.com/community/gspr

2. Liang W, Zhang Y, Cao H, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. arXiv. Preprint posted October 2, 2023. doi.org/10.48550/arXiv.2310.01783

3. van Rooyen S, Black N, Godlee F. Development of the review quality instrument (RQI) for assessing peer reviews of manuscripts. J Clin Epidemiol. 1999;52(7):625-629. doi:10.1016/s0895-4356(99)00047-5

¹University of Missouri-Columbia, Columbia, MO, US, fares.alahdab@health.missouri.edu; ²University of Texas Health Science Center, Houston, TX, US; ³BMJ Evidence-Based Medicine, London, UK; ⁴BMJ, London, UK.

Conflict of Interest Disclosures

None reported.

International Congress on
Peer Review and Scientific Publication

Enhancing the quality and credibility of science

Abstract