Accuracy and Precision of a Neural Network Author Name Disambiguator

Abstract

Vicente Amado Olivo,¹ Wolfgang Kerzendorf,¹ Nutan Chen,² Joshua V. Shields,¹ Bangjing Lu,¹ Andreas Flörs³

Objective

The process of identifying peer reviewers is becoming more difficult due to a surge in submissions and declining acceptance rates for review invitations.¹ Responses to a 2024 IOP Publishing online survey suggest the peer review system is unevenly distributed, with 10% of reviewers conducting 50% of all reviews, while early-career researchers and scholars from underrepresented regions remain underused despite being willing to review.² Reviewer identification tools have been introduced to assist editors, but the tools are proprietary and closed source, limiting their accessibility and transparency to the broader scientific community. To expand the global pool of peer reviewers, open-source, AI-powered methods are needed to uniquely identify researchers within the expanding scientific literature and match them with appropriate review opportunities. Current author name disambiguation methods often rely on extensive data features, such as institutional affiliations, email addresses, or publication venues, which are not consistently available across digital libraries. We describe the development of a name disambiguation tool with fewer data features.

Design

Given the lack of existing author disambiguation tools for the Smithsonian Astrophysical Observatory/National Aeronautics and Space Administration Astrophysics Data System, we developed the Neural Author Name Disambiguator (NAND), a similarity-based neural network, and trained it on pairs of publications authored by individuals sharing the same name labeled with Open Researcher and Contributor ID (ORCID) identifiers. The training dataset included 2,698,778 pairs balanced between classes (0 if they shared the same ORCID and 1 if they had different ORCIDs), 553,496 unique publications, and 125,486 ORCID profiles of authors from the 2020 ORCID open data snapshot (and annual public data release). NAND was trained to disambiguate publications with minimal features (ie, author name, title, and abstract). We validated NAND performance using standard classification metrics: accuracy (percentage of correctly classified publication pairs), precision (true positives / [true positives + false positives]), recall (true positives / [true positives + false negatives]), and F1 scores (true positives / [true positives + 0.5(false positives + false negatives)]).

Results

NAND achieved a mean (SD) 94.64% (0.04%) accuracy on test set pairs of authors within the same name block (eg, J. Smith or Y. Wang). Mean (SD) precision, recall, and F1 scores were 96.67% (0.05%), 95.21% (0.11%), and 95.94% (0.03%), respectively.

Conclusions

Combining the disambiguation model with semantic expertise matching techniques could establish a practical framework for identifying qualified and willing reviewers across global institutions.³ The open framework may help to reduce the burden on overused reviewers by expanding the global pool of available reviewers. While results are promising, this analysis was limited to physics publications; further validation is needed to assess generalizability to other scientific domains.

References

1. Publons. 2018 Global State of Peer Review. Accessed January 29, 2025. https://publons.com/static/Publons-Global-State-Of-Peer-Review-2018.pdf

2. Brigham L, Brent-Jones E, Coombs A, et al. State of peer review 2024. Accessed January 29, 2025. https://ioppublishing.org/state-of-peer-review-2024-results/

3. Kerzendorf WE, Patat F, Bordelon D, van de Ven G, Pritchard TA. Distributed peer review enhanced with natural language processing and machine learning. Nat Astronomy. 2020;4:711-717. doi:10.1038/s41550-020-1038-y

¹Michigan State University, East Lansing, MI, US, amadovic@msu.edu; ²Volkswagen Group, Wolfsburg, Germany; ³GSI Helmholtzzentrum für Schwerionenforschung, Darmstadt, Germany.

Conflict of Interest Disclosures

None reported.

Funding/Support

This work is supported in part by the National Science Foundation Research Traineeship Program (DGE-2152014) to Vicente Amado Olivo. Additionally, we gratefully acknowledge the European Space Agency for funding support through a traineeship for Vicente Amado Olivo.

Additional Information

We acknowledge the support and guidance from Markus Kissler-Patig and Jan Reerink at the European Space Agency. The authors used the free versions of ChatGPT (OpenAI) and Claude (Anthropic) for brainstorming and editing for logical flow. We take responsibility for the integrity of the content generated.

Meeting Information

10th Congress information available here
Sponsors and Exhibitors

2025 Sponsors and Exhibitors are available here.
Past Congresses

See details on previous congresses here.