Abstract

Automatic Classification of Peer Review Recommendation

Diego Kozlowski,1 Clara Boothby,2 Rosemary Steup,2 Pei-Ying Chen,2 Vincent Larivière,3,4 Cassidy R. Sugimoto5

Objective

Peer review plays a fundamental role in scholarly publishing, but its legitimacy has been increasingly questioned. A growing literature discusses how reviewers’ demographic characteristics and biases might lead to disparities in research dissemination.1,2 Because the extent to which reviewers are able to determine the outcomes for papers may vary, it is important to look at the relationship between reviewers’ recommendations and editors’ decision-making. However, reviewer recommendations are often embedded in the text of the review. This work proposes a method for automatic detection of recommendations based on review text.

Design

The automatic classification used a rule-based algorithm that searched for the presence of 1 or more phrases that signal the reviewer’s recommendation: accept, minor revision, major revision, or reject categories, as defined on the hand-coding process. The algorithm considered the different combinations of signal phrases to define the outcome. The list of signal phrases was iteratively built on 3 rounds of hand-coding and fuzzy matching sentences, while the combinations were defined to maximize the precision, on the hand-coded cases. This study used Publons’ data set, which contained 3,310,791 reviews from 25,934 journals; while 600 cases were hand-coded, a subset of 200 reviews was used to evaluate the performance. The gender of reviewers was inferred by matching first and last names to curated lists of country-specific gendered names, including the US Census.3

Results

The overall accuracy on the test was that 81% of assigned recommendations were correct according to hand coding (n = 149). Since the inclusion of additional phrases is associated with lowered accuracy, this might indicate an upper bound in our experiment, given the limits of the current data and the idiosyncrasies of peer review language. Nonetheless, the algorithm’s accuracy was comparable to the rate of agreement between human hand coders (n = 60 [88%]). Over the full data, 14.3% of reviews were assigned a recommendation by this method (n = 473,443). This was comparable to the hand-coded identification of 18.3% of reviews containing an explicit recommendation (n = 399). From these results, we concluded that the inclusion of an explicit recommendation remains relatively uncommon in peer review, with the majority of peer reviewers leaving a final decision on the manuscript as the responsibility of the editor, but there was large variation between journals. Initial results nonetheless showed gender differences in reviewing behavior, with higher retrieval rates associated with reviewers who identified as men.

Conclusions

This work is among the first benchmarks for automatic classification of review recommendations on a large-scale, cross-domain database. Though preliminary, it paves the way for future developments, including studies of potential biases and inequalities in scholarly publishing through examination of the relationship between reviewer characteristics and review outcomes.

References

1. Lee CJ, Sugimoto CR, Zhang G, Cronin B. Bias in peer review. J Am Soc Inf Sci Tec. 2013;64(1):2-17. doi:10.1002/asi.22784

2. Sun M, Barry Danfa J, Teplitskiy M. Does double-blind peer review reduce bias? evidence from a top computer science conference. J Am Soc Inf Sci Tec. 2022;73(6):811-819. doi:10.1002/asi.24582

3. Larivière V, Ni C, Gingras Y, Cronin B, Sugimoto CR. Bibliometrics: global gender disparities in science. Nature. 2013;504:211-213. doi:10.1038/504211a

1Faculty of Science, Technology and Medicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg; 2Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA, crboothb@iu.edu; 3École de bibliothéconomie et des sciences de l›information, Université de Montréal, Montréal, QC, Canada; 4Observatoire des Sciences et des Technologies, Université du Québec à Montréal, Montreal, QC, Canada; 5School of Public Policy, Georgia Institute of Technology, Atlanta, GA, USA

Conflict of Interest Disclosures

None reported.

Funding/Support

This work was supported by the Doctoral Training Unit on Data-Driven Computational Modelling and Applications (DRIVEN), which is funded by the Luxembourg National Research Fund under the PRIDE programme (PRIDE17/12252781).

Acknowledgment

We thank Brad Demarest and Chaoqun Ni for contributions to an earlier phase of the project. We also thank the team at Publons who implemented the gender inference algorithm on their data before giving us access.

Additional Information

Diego Kozlowski is a co–corresponding author.

Poster