Abstract
Screening Articles for Tortured Phrases With a Regular Expressions–Based Detector
Alexandre Clausse,1 Guillaume Cabanac,1,2 Pascal Cuxac,3 Cyril Labbé4
Objective
The Problematic Paper Screener (PPS) features 10 detectors to identify problematic articles in the scientific record.1 The detectors search through metadata, references, or textual contents. The tortured phrases2 detector uses a list of more than 7000 identified suspect expressions (Table 25-1159) to query a database (Dimensions; Digital Science) with approximately 130 million scientific articles.3 The PPS displays the screening results on a public website. In this qualitative study, we introduce an alternative screening algorithm based on regular expressions (regex) and benchmark its effectiveness. Publishers are welcome to use this stand-alone algorithm that does not require a subscription to the database that we used.
Design
As of May 2025, we designed an algorithm that uses regex based on the PPS fingerprints list in the database to capture different forms of tortured phrases, such as “man-made consciousness,” “profound learning AND deep learning,” and “128 pieces”~5’ [128 bits] (ie, a tortured phrase with components in a 5-word sliding window). This approach is independent from search engines, as it matches fingerprints against the textual content of each article, ignoring their metadata, figure and table contents (including captions), and references. We benchmarked this new approach on the Hindawi extensible markup language (XML) corpus, which contains several hundred articles with tortured phrases, focusing on articles published from 2020 to 2022—the period with the most PPS-flagged articles (n = 3400).
Results
The regex-based algorithm flagged 2455 problematic articles, with 1948 also flagged by PPS, yielding a 58% overlap. After removing duplicates, 401 articles flagged by the PPS were missing from the Hindawi XML corpus, amounting to 3371 problematic articles in 139 different journals. The evaluation is ongoing; we analyzed the top 200 results and found 48 false-positive results due to database indexing issues related to special characters and figures and PPS querying issues. We also found 52 false-negative results due to fingerprints list issues and because we excluded figures, tables, and references. Overall, we obtained 100 true-positive results, as both detectors (ie, PPS and regex-based algorithm) extracted expressions on the fingerprints list from the same articles.
Conclusions
The regex-based algorithm yielded results comparable to the current PPS screening process. Other configurations should be tested, including figures, tables, and references in the screening, as tortured phrases can also appear there. We invite publishers to use this regex-based approach to screen incoming manuscripts. We noticed several querying issues with the database that we used, and some fingerprints need to be redesigned. We will investigate further the validation assessment, as there are still 3351 articles to be reassessed. This research could contribute to improving the regex and updating the PPS fingerprints list, while offering detailed feedback and bug reports to the database developer.
References
1. Cabanac G, Labbé C, Magazinov A. The ‘Problematic Paper Screener’ automatically selects suspect publications for post-publication (re)assessment. arXiv. Preprint posted online October 7, 2022. doi:10.48550/arXiv.2210.04895
2. Cabanac G, Labbé C, Magazinov A. Tortured phrases: A dubious writing style emerging in science—evidence of critical issues affecting established journals. arXiv. Preprint posted online July 12, 2021. doi:10.48550/arXiv.2107.06751
3. Herzog C, Hook D, Konkiel S. Bringing down barriers between scientometricians and data. Quantitative Science Studies. 2020;1:387-395. doi:10.1162/qss_a_00020
1Université de Toulouse, IRIT (UMR 5505), Toulouse, France, alexandre.clausse@irit.fr; 2Institut Universitaire de France, Paris, France; 3INIST CNRS, UAR 67, Vandoeuvre-lès-Nancy, France; 4Université Grenoble Alpes, CNRS, Grenoble INP, LIG UMR 5217, Grenoble, France.
Conflicts of Interest Disclosures
Guillaume Cabanac is the administrator of the Problematic Paper Screener, a public platform that uses metadata from Digital Science and PubPeer via no-cost agreements. Guillaume Cabanac and Cyril Labbé have been in touch with most of the major publishers and their integrity officers, offering pro bono consulting regarding detection tools to various actors in the field, including Clear Skies, Morressier, River Valley, Signals, and STM.
Funding/Support
Alexandre Clausse, Guillaume Cabanac, and Cyril Labbé received funding from the European Research Council. Guillaume Cabanac received funding from the Institut Universitaire de France.
