Counterfactual Evaluation of Peer Review Assignment Strategies in Computer Science and Artificial Intelligence

Martin Saveski,1 Steven Jecmen,2 Nihar Shah,2 Johan Ugander1


Artificial intelligence (AI) has become pervasive to assign reviewers to papers.1 The assignment relies on 3 key sources of data1: (1) AI-computed similarities between the text of the submitted paper and reviewers’ past articles, (2) reviewer-provided preferences expressing which papers they would like to review, and (3) overlap between the paper’s topics as specified by authors and reviewers’ self-reported areas of expertise. However, it is unknown which of these sources, or combination thereof, lead to the best outcomes of the reviewer assignment.


To assign reviewers to papers, 2 venues recently used randomized algorithms2 designed to combat fraud: the 2021 Theory and Practice of Differential Privacy (TPDP) Workshop with 35 reviewers and 95 full papers and the Association for the Advancement of Artificial Intelligence (AAAI) 2022 Conference on Advancement in Artificial Intelligence with 3145 reviewers and 8450 full papers. To compute overall similarities between each reviewer-paper pair, TPDP weighted the AI-computed text similarities by weight (wtext, range 0-1) and reviewers’ preferences by weight (1 − wtext); AAAI weighted the AI-computed text similarities by weight (wtext, range 0-1) and the overlap between the papers and reviewers’ topical areas by weight (1 − wtext) (reviewers’ preferences were also included in AAAI but not considered in this study). The randomized assignment2 then maximized similarity of assigned reviewer-paper pairs, subject to the probability of any reviewer being assigned to any paper being at most 0.5 in TPDP and 0.52 in AAAI. In this study, the randomization in the assignment was leveraged to estimate the counterfactual quality of alternative assignment strategies. How the overall quality of the reviewer-paper assignment was affected was investigated by (1) introducing randomness in the assignment process and (2) varying weights of different sources of information. The quality of any counterfactual reviewer-paper assignments was measured using reviewers’ self-reported expertise and confidence in their review.


The results are tabulated in Table 26.3 First, introducing randomness by limiting the probability of any reviewer-paper assignment led to a marginal reduction in assignment quality for TPDP and a slightly larger reduction in AAAI. Second, for TPDP, placing more weight on the AI-computed text similarities (wtext = 0.8) instead of equally weighting the text similarities and the reviewers’ preferences (wtext = 0.5) resulted in a higher reviewer-paper assignment quality. Third, for AAAI, placing more weight on the AI-computed text similarities (wtext = 0.75) instead of equally weighting the text similarity and the reviewer-paper topical area overlap (wtext = 0.5) led to a similar assignment quality.


Randomness in the reviewer assignments can help improve AI-based automated assignment by enabling counterfactual analysis of alternative assignment strategies, in addition to its original goal of mitigating fraud, but leads to a small reduction in assignment quality.


1. Shah N. Challenges, experiments, and computational solutions in peer review. Commun ACM. 2022;65(6):76-87. doi:10.1145/3528086

2. Jecmen S, Zhang H, Liu R, Shah N, Conitzer V, Fang F. Mitigating manipulation in peer review via randomized reviewer assignments. Adv Neural Inf Process Syst. 2020;33:12533-12545.

3. Imbens GW, Manski CF. Confidence intervals for partially identified parameters. Econometrica. 2004;72(6):1845-1857. doi:10.1111/j.1468-0262.2004.00555.x

1Stanford University, Stanford, CA, USA; 2Carnegie Mellon University, Pittsburgh, PA, USA, nihars@cs.cmu.edu

Conflict of Interest Disclosures

None reported.


This work was supported by the US National Science Foundation CAREER award (1942124) which supports research on the fundamentals of learning from people with applications to peer review.


We thank Gautam Kamath and Rachel Cummings for allowing us to conduct this study in TPDP and Melisa Bok and Celeste Martinez Gomez from OpenReview.net for helping with the APIs of OpenReview.net.