PreprintMatch: a new tool to match manuscripts across multiple similarity metrics
Presenting author:
Introduction: Preprints are often posted to open access servers before or along with publication in a refereed journal, although it is not always clear when preprints are published in traditional journals.
Methods: We developed software to identify matches between bioRxiv preprints and PubMed papers using cosine similarity between word vector representations of the title and abstract and Jaccard similarity between authors. A support-vector machine was trained to determine if similarities cross a threshold for declaring a match. A validation set of 979 preprints was randomly generated, with our tool achieving F1=0.994 compared to F1=0.643 for bioRxiv.
Results: In the validation set, bioRxiv matched only 53% of true matches, the rest being false negatives, most of which had titles that differed. The tool was then run on all bioRxiv preprints. The median time from preprint release to publication was 31 weeks, with preprints taking longer to be published having less similarity to the final paper across all measured metrics. An average of 0.25 authors was added between preprint and paper publication, with authors from Chinese institutions adding authors significantly more often than authors from the U.S., U.K. and Germany, but not Russia.
Conclusions: We believe this novel tool and dataset could be a valuable asset for preprint and paper repositories and for researchers conducting analyses on preprints and how they evolve into papers.
Methods: We developed software to identify matches between bioRxiv preprints and PubMed papers using cosine similarity between word vector representations of the title and abstract and Jaccard similarity between authors. A support-vector machine was trained to determine if similarities cross a threshold for declaring a match. A validation set of 979 preprints was randomly generated, with our tool achieving F1=0.994 compared to F1=0.643 for bioRxiv.
Results: In the validation set, bioRxiv matched only 53% of true matches, the rest being false negatives, most of which had titles that differed. The tool was then run on all bioRxiv preprints. The median time from preprint release to publication was 31 weeks, with preprints taking longer to be published having less similarity to the final paper across all measured metrics. An average of 0.25 authors was added between preprint and paper publication, with authors from Chinese institutions adding authors significantly more often than authors from the U.S., U.K. and Germany, but not Russia.
Conclusions: We believe this novel tool and dataset could be a valuable asset for preprint and paper repositories and for researchers conducting analyses on preprints and how they evolve into papers.