Abstract

Data Repurpose in AI Studies and Scientific Outcomes

Yulin Yu,1 Yong-Yeol Ahn,2 Daniel Romero1,2,3,4

Objective

Datasets are critical in artificial intelligence (AI) research, serving as the foundation for training, evaluating, and developing models. As interest in accelerating AI-driven discoveries grows, it becomes important to examine how strategically repurposing datasets, using them for research topics different from their original context, relates to scientific outcomes. Previous research suggests that combining existing knowledge in novel ways, or recombinant novelty, is associated with greater scientific influence.1,2 While recent studies have shown that combining datasets in unexpected ways correlates with impact, they have not examined how datasets interact with the topics they are used to study.3 To address this, we draw on transformational creativity theory, which modifies the accepted conceptual space by altering or removing existing dimensions, such as drug repurposing. Similarly, AI datasets can be repurposed for new research topics. This study addressed 2 questions: (1) are published articles that repurpose datasets associated with higher scientific disruption or recognition? and (2) do repurposed uses of datasets influence how future articles apply the data, and is this repurpose adoption associated with greater scientific outcomes? We conceptualized scientific impact along 2 dimensions: (1) the extent to which an article is recognized and immediately used by research communities, which can be measured by citations, and (2) the degree to which the research article disrupts existing research paradigms.

Design

We analyzed 13,637 published AI articles collected from Papers With Code and linked to the OpenAlex database from 2015 to 2021. We quantified data repurposing by comparing the semantic content of each article to previous articles using the same datasets. For each article, the number of other articles using the same dataset varied (though we set a minimum threshold of at least 1 prior reuse). Articles were embedded using sentence-level language models. We assessed the relationship between repurposing and scientific disruption, defined as the extent to which an article with repurposed data is cited instead of its references, using ordinary least squares regression. We used negative binomial regression to analyze citation counts. Control variables included dataset novelty, article novelty, author information (team size and experience), publication year, citation/reference counts, and disciplinary fields. These variables help account for potential confounders; for example, more recent papers or datasets may naturally have more prior reuse by other studies. We measured repurpose adoption by comparing each article’s similarity to prior vs future works using the same datasets. A positive score suggests that future articles adopted the new topical use.

Results

Greater data repurposing was significantly associated with higher disruption scores, corresponding to a 10.2%-SD increase (P < .001) (Figure 25-1139, A). However, articles with repurposed data were slightly negatively associated with citation counts compared with articles without repurposed data, corresponding to a 1.7%-SD decrease, but this was not statistically significant (P > .05) (Figure 25-1139, B). Articles with 1-SD higher repurpose adoption were modestly more disruptive (+2%; P < .001) and showed a stronger positive association with future citations (P < .001). This study is limited by reliance on the OpenAlex database, which may have incomplete metadata.

Conclusions

In this study, repurposing datasets was associated with greater disruption, though it may have initially reduced recognition. However, when novel data were adopted for reuse in future research, they were associated with both disruption and increased citations. These findings have implications for data sharing and research training aimed at promoting innovative reuse of data.

References

1. Uzzi B, Mukherjee S, Stringer M, Jones B. Atypical combinations and scientific impact. Science. 2013;342(6157):468-472. doi:10.1126/science.1240474

2. Leahey E, Lee J, Funk RJ. What types of novelty are most disruptive? Am Sociol Rev. 2023;88(3):562-597. doi:10.1177/00031224231168074

3. Yu Y, Romero DM. Does the use of unusual combinations of datasets contribute to greater scientific impact? Proc Natl Acad Sci U S A. 2024;121(15):e2318482121. doi:10.1073/pnas.2402802121

1School of Information, University of Michigan, Ann Arbor, US, yulinyu@umich.edu; 2Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, US; 3Center for the Study of Complex Systems, University of Michigan, Ann Arbor, US; 4Computer Science and Engineering Division, University of Michigan, Ann Arbor, US.

Conflict of Interest Disclosures

None reported.