AI-Augmented Peer Review, Collaboration Dynamics, and Human Reviewer Performance
Abstract
Ashia Livaudais,1 Dmitri Iourovitski1
Objective
As AI models improve in quality and affordability, their role in scientific evaluation grows increasingly relevant.1,2 We investigated the quality of reviews produced by AI tools and humans, alone or in combination (ie, AI augmented), reviewer accuracy at distinguishing AI-generated and AI-augmented peer reviews from human reviews, and whether awareness of AI augmentation affected human reviewers’ perceptions of review quality.
Design
This mixed-methods study was conducted from July to September 2024 and received institutional ethical approval. We defined peer review subtasks³ (eg, evaluation of methodological rigor) by analyzing reviews posted to a selection of 50 manuscripts on OpenReview. We selected 100 manuscripts in physics, mathematics, and machine learning and identified 133 participants across 3 continents without conflicts of interest to review the manuscripts and evaluate review quality. The 60 reviewers and 73 meta-reviewers included 48 senior/faculty researchers, 13 journal editors, 42 industry researchers, 23 graduate students, and 7 masters’ degree students. Reviewers provided peer review reports for selected manuscripts, and meta-reviewers rated peer review quality using a 0 to 5 scale, with higher scores indicating higher quality. Participants were randomly assigned roles. Manuscripts were anonymized and reviewed by Symby (Symby Labs) (a new AI tool fine-tuned for scientific evaluation) GPT-4 (OpenAI), and Claude Sonnet 3 (Anthropic), alone and in combination with humans (AI-augmented) as well as by humans alone. Half of reviewer participants were informed their reviews might be AI-augmented regardless of actual augmentation. Half of meta-reviewers were informed of potential AI involvement and were asked to also detect AI-generated reviews. Statistical analysis used 1-way ANOVA to compare review quality scores, with post-hoc Tukey Honest Significant Difference tests. Independent sample t tests compared review quality scores among informed and uninformed groups. χ² testing assessed AI detection accuracy.
Results
Review quality scores were highest for AI-augmented human reviews: Symby+ human (4.2), GPT-4 + human (3.9), Symby (3.8), Claude Sonnet 3 + human (3.5), Claude Sonnet 3 (3.4), human (3.3), and GPT-4 (3.1). Human reviewers informed that reviews may be augmented with AI produced review outputs that received higher scores compared with uninformed reviewers (3.6 vs 3.2; t58 = 2.4, P = .02). Informed meta-reviewers gave lower scores overall (3.0 vs 3.7; t71 = −3.1, P = .003). Meta-reviewers from all disciplines had 39% accuracy in distinguishing AI-generated reviews (χ² = 0.6; P = .44).
Conclusions
We found AI-augmented human reviews were ranked higher quality than human-only and AI-only reviews, suggesting AI-augmented human reviews could provide feedback on par with or superseding humans. Awareness of AI-augmentation affected reviewer ratings. Meta-reviewers did not accurately distinguish AI-generated reviews. Study limitations include a convenience sample drawn from 3 quantitative disciplines, review quality measured with a simple scale, no accounting for clustering of reviews within manuscripts, English-language only participation, and participants’ awareness of being studied.
References
1. Latona GR, Ribeiro MH, Davidson TR, Veselovsky V, West R. The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates. arXiv. Posted online May 3, 2024. doi:10.48550/arXiv.2405.02150
2. Checco A, Bracciale L, Loreti P, et al. AI-assisted peer review. Humanit Soc Sci Commun. 2021;8(25). doi:10.1057/s41599-020-00703-8
3. Bornmann L. Scientific peer review. Annual Review of Information Science and Technology. 2013;45(1):197-245. doi:10.1002/aris.2011.1440450112
1SymbyLabs, Huntsville, AL, US, livaudais@symbyai.com.Conflict of Interest Disclosures
While the specific version of the Symby tool described in this manuscript is not being commercialized, the authors are involved in ongoing research and development activities that may lead to future commercial applications based on similar technological approaches and methodologies.
Acknowledgment
We extend our gratitude to the 133 participants—ranging from master’s degree students to senior editors—who contributed their time and expertise to this study. We also acknowledge the OpenReview platform for providing access to review data that informed the design of our study.