Peer Review Congress - Organizers and Advisory Board
Enhancing the quality and credibility of science

Science Journal Abstracts Misregistered in the Crossref Database

Abstract

Qinyue Liu,1 Yagmur Öztürk,1 Cyril Labbé1

Objective

Abstracts of scientific publications are widely used in scientometrics and text mining because they can be extracted in large quantities from bibliometric databases, and they convey important information about the publications. However, we found misregistered abstracts in Crossref published by Science and the Proceedings of the National Academy of Sciences (PNAS). We aimed at estimating the proportion of misregistered abstracts in our dataset and providing the textual similarity between the correct abstracts and the misregistered texts. A previous study also found misregistered metadata in Crossref.1

Design

Having recently worked on citation analysis, we developed a script to automatically collect a dataset of citations extracted from articles published between 2016 and 2024 in The Lancet, Cell, and Joule. Each citation context is paired with the cited abstract mainly extracted using Crossref API. This dataset includes 19,822 citation context-abstract pairs. We noticed that certain abstracts on Crossref are misregistered. For Science, a section that introduces and cites the current article, and for PNAS, a section called Significance, are misregistered as abstracts on Crossref. We developed a script to automatically find these cases from our dataset by searching for distinctive features (string “et al” for Science and “Significance” for PNAS). For the cases in Science, we also manually collected the correct abstracts, and used the Sentence-BERT (SBERT)2 model to calculate the textual similarity between the misregistered abstracts and the correct abstracts.

Results

Of the 821 abstracts of Science publications, 402 (48%) contained the distinctive features. We manually verified the results, and 400 of 402 abstracts were misregistered. The 2 correctly registered abstracts were for report types of publication where there was no other text to be misregistered. Of the 669 PNAS abstracts, 243 (36%) contained Significance at the beginning. Moreover, most of the misregistered abstracts in our dataset, as estimated, had high textual similarity scores (cosine similarity) with the correct abstracts. The scores range from 0.58 to 0.99, with an average of 0.83. This phenomenon occurs because the misregistered abstracts summarize the article and mention key points, resembling the original abstracts.

Conclusions

Metadata misregistration in large repositories like Crossref can compromise data quality. Although our method detected misregistered abstracts in a small portion of the dataset using distinctive features, its scalability requires validation with larger datasets.

References

1. Cioffi A, Coppini S, Massari A, et al. Identifying and correcting invalid citations due to DOI errors in Crossref data. Scientometrics. 2022;127:3593-3612. doi:10.1007/s11192-022-04367-w

2. Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics; 2019:3982-3992. doi:10.18653/v1/D19-1410

1Université Grenoble Alpes, French National Centre for Scientific Research, Grenoble INP, Laboratoire d’Informatique de Grenoble, Grenoble, France, qinyue.liu@univ-grenoble-alpes.fr.

Conflict of Interest Disclosures

None reported.

Funding/Support

The NanoBubbles project has received Synergy grant funding from the European Research Council, within the European Union’s Horizon 2020 program, grant 951393.

Role of Funder/Sponsor

The European Research Council had no role in design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the abstract; and decision to submit the abstract for presentation.

  
  • Meeting Information

    10th Congress information available here

  • Sponsors and Exhibitors

    2025 Sponsors and Exhibitors are available here.

  • Past Congresses

    See details on previous congresses here.