Development of a Global Dataset for Peer Review in Astronomy

Vicente Amado Olivo,1 Wolfgang Kerzendorf1


The great astronomical observatories accept thousands of proposals per year from astronomers hoping to receive telescope time. Specifically, the Space Telescope Science Institute receives approximately 1000 proposals per year for the Hubble Space Telescope, with this number projected to double as the James Webb Space Telescope has safely launched.2 In astronomy, a Time Allocation Committee (TAC) reviews all proposals submitted for the use of a telescope and identifies the proper expert to review the proposal. The goal of the study was to develop a database of all active astronomers and their publications that assists in the identification of experts for the peer review of observing proposals, expanding on work done by Kerzendorf et al1 and Strolger et al.2

Design The database creation and modeling study has expanded the reviewer pool to all around the world, instead of simply relying on the TAC’s personal networks. The Semantic Scholar Open Research Corpus (S2ORC) data set allowed for the creation of a preliminary database consisting of authors, their full-text publications, and associated metadata. The identification of experts for peer review was systematically done by leveraging an astronomer’s body of work (ie, scientific publications). An author’s publications and the observing proposal were numerically represented using machine learning models to identify which astronomer’s expertise is similar for review of the proposal. Various methods were compared to disambiguate author names using name-based techniques. However, authors with full names having more than 3 words were excluded owing to formatting issues (currently investigating methods to address the issue). A preliminary prototype using machine learning and natural language processing models was tested using 918 proposals from the European Southern Observatory (significant metrics to evaluate expertise are being researched).


The S2ORC data set, which consists of 12 million full-text publications, was filtered to only astronomy publications using publication arXiv identifiers. The database contains 212,839 publications and a total of 1,801,916 nonunique authors from 1991 to 2020. Three author name disambiguation algorithms were compared: first initial, all initials, and hybrid method.3 The 3 methods were validated using an initial subset of 1538 ORCID identifiers matched to astronomers. A contamination rate is the percentage of validated astronomers whose identity became compromised due to merging or splitting of names. The contamination rates of the 3 methods were 1.77%, 15.52%, and 2.02%, respectively.


The developed database has expanded the possible reviewer pool from several hundreds known to the TAC to all active astronomers worldwide. A larger pool of reviewers allows for more accurate expertise matching.


1. Kerzendorf WE, Patat F, Bordelon D, van de Ven G, Pritchard TA. Distributed peer review enhanced with natural language processing and machine learning. Nat Astron. 2020;4(7):711-717. doi:10.1038/s41550-020-1038-y

2. Strolger LG, Porter S, Lagerstrom J, Weissman S, Reid IN, Garcia M. The Proposal Auto-Categorizer and Manager for time allocation review at Space Telescope Science Institute. AJ. 2017;153(4):181. doi:10.3847/1538-3881/aa6112

3. Milojević S. Accuracy of simple, initials-based methods for author name disambiguation. J Informetrics. 2013;7(4):767-773. doi:10.1016/j.joi.2013.06.006

1Michigan State University, East Lansing, MI, USA, amadovic@msu.edu

Conflict of Interest Disclosures

None reported.


Funding was received from the Space Telescope Science Institute–Hubble Space Telescope Science Policies Group.

Role of Funder/Sponsor

The funder collaborated with members from the Space Telescope Science Institute to oversee the development of the database.