Accuracy and Precision of a Neural Network Author Name Disambiguator
Abstract
Vicente Amado Olivo,1 Wolfgang Kerzendorf,1 Nutan Chen,2 Joshua V. Shields,1 Bangjing Lu,1 Andreas Flörs3
Objective
The process of identifying peer reviewers is becoming more difficult due to a surge in submissions and declining acceptance rates for review invitations.1 Responses to a 2024 IOP Publishing online survey suggest the peer review system is unevenly distributed, with 10% of reviewers conducting 50% of all reviews, while early-career researchers and scholars from underrepresented regions remain underused despite being willing to review.2 Reviewer identification tools have been introduced to assist editors, but the tools are proprietary and closed source, limiting their accessibility and transparency to the broader scientific community. To expand the global pool of peer reviewers, open-source, AI-powered methods are needed to uniquely identify researchers within the expanding scientific literature and match them with appropriate review opportunities. Current author name disambiguation methods often rely on extensive data features, such as institutional affiliations, email addresses, or publication venues, which are not consistently available across digital libraries. We describe the development of a name disambiguation tool with fewer data features.
Design
Given the lack of existing author disambiguation tools for the Smithsonian Astrophysical Observatory/National Aeronautics and Space Administration Astrophysics Data System, we developed the Neural Author Name Disambiguator (NAND), a similarity-based neural network, and trained it on pairs of publications authored by individuals sharing the same name labeled with Open Researcher and Contributor ID (ORCID) identifiers. The training dataset included 2,698,778 pairs balanced between classes (0 if they shared the same ORCID and 1 if they had different ORCIDs), 553,496 unique publications, and 125,486 ORCID profiles of authors from the 2020 ORCID open data snapshot (and annual public data release). NAND was trained to disambiguate publications with minimal features (ie, author name, title, and abstract). We validated NAND performance using standard classification metrics: accuracy (percentage of correctly classified publication pairs), precision (true positives / [true positives + false positives]), recall (true positives / [true positives + false negatives]), and F1 scores (true positives / [true positives + 0.5(false positives + false negatives)]).
Results
NAND achieved a mean (SD) 94.64% (0.04%) accuracy on test set pairs of authors within the same name block (eg, J. Smith or Y. Wang). Mean (SD) precision, recall, and F1 scores were 96.67% (0.05%), 95.21% (0.11%), and 95.94% (0.03%), respectively.
Conclusions
Combining the disambiguation model with semantic expertise matching techniques could establish a practical framework for identifying qualified and willing reviewers across global institutions.3 The open framework may help to reduce the burden on overused reviewers by expanding the global pool of available reviewers. While results are promising, this analysis was limited to physics publications; further validation is needed to assess generalizability to other scientific domains.
References
1. Publons. 2018 Global State of Peer Review. Accessed January 29, 2025. https://publons.com/static/Publons-Global-State-Of-Peer-Review-2018.pdf
2. Brigham L, Brent-Jones E, Coombs A, et al. State of peer review 2024. Accessed January 29, 2025. https://ioppublishing.org/state-of-peer-review-2024-results/
3. Kerzendorf WE, Patat F, Bordelon D, van de Ven G, Pritchard TA. Distributed peer review enhanced with natural language processing and machine learning. Nat Astronomy. 2020;4:711-717. doi:10.1038/s41550-020-1038-y
1Michigan State University, East Lansing, MI, US, amadovic@msu.edu; 2Volkswagen Group, Wolfsburg, Germany; 3GSI Helmholtzzentrum für Schwerionenforschung, Darmstadt, Germany.Conflict of Interest Disclosures
None reported.
Funding/Support
This work is supported in part by the National Science Foundation Research Traineeship Program (DGE-2152014) to Vicente Amado Olivo. Additionally, we gratefully acknowledge the European Space Agency for funding support through a traineeship for Vicente Amado Olivo.
Additional Information
We acknowledge the support and guidance from Markus Kissler-Patig and Jan Reerink at the European Space Agency. The authors used the free versions of ChatGPT (OpenAI) and Claude (Anthropic) for brainstorming and editing for logical flow. We take responsibility for the integrity of the content generated.