|
Issues and Events
Experimenting with plagiarism detection on the arXivWhen plagiarism is suspected, the submission will be flagged, and the authors will get a message saying "your article has x% overlap with article 'a.' Do you really want to do this?" says Cornell University physicist Paul Ginsparg, the creator and overseer of the arXiv. The authors whose papers were copied from will not be notified. "This will be a fun experiment," Ginsparg says. "Will we train people to be more clever and to make more word changes? Or will there be a real change in their behavior?" Behavior did change when University of Virginia physicist Louis Bloomfield began using software to see if his students were cheating. Checking new arXiv submissions is a good idea, Bloomfield says. "People should know it's not okay to steal. It's not even okay to publish your own stuff over and over." After he reported students who had copied, they were prosecuted. Forty-five students either left the university or were found guilty, and three degrees were revoked. "I was immersed in seemingly endless honor trials. Two years of my life were burned up. There's a lot of trouble when you open this can of worms. Plagiarism shouldn't be tolerated, but you need a professional organization to handle the heat." The arXiv's automated scanning for overlapping text is a refinement of an algorithm used last year by Cornell computer science graduate student Daria Sorokina to look at the server's then nearly 300 000 documents. The algorithm assigns unique numbers to word sequences and then compares those numbers across documents. Common phrases such as "this work was supported in part by" are excluded. "There is nothing new about document fingerprinting," says Cornell computer scientist Johannes Gehrke, an adviser on the project. "The novelty here was the application to the arXiv." In the study, about 10% of arXiv manuscripts had text blocks that overlapped with other documents. After removing instances of authors reusing parts of their own text, different collaborators on a single project using the same text in separate conference abstracts, and other apparent false positives, less than 1% of manuscripts were still suspect, says Sorokina. Close examination of 20 pairs of documents with among the highest levels of overlap exposed 16 as plagiarism. "In one case, an author copied descriptions of five or six methods that he was comparing," says Sorokina. "He didn't cite the sources. But the work of comparing was his own." One of the most common types of plagiarism found was the lifting of introductory or background material, especially in PhD theses, says Ginsparg. "The surprising thing is that people submit to the same database where they found [what they copied]. It's mind boggling, given the existence of Google, given the existence of searching on full text, that people wouldn't have an intuition that they would be caught." "Some of it is different ethical norms," Ginsparg adds. "People in different countries, with different intellectual backgrounds, will sometimes argue that what they are doing is completely correct." The reassuring thing, he adds, "is that the most creative people, who are generating the ideas, don't have to start from someone else's article as a template. We'd be very surprised if authors of prominence showed up as perpetrators as opposed to victims." Document fingerprinting catches only word-for-word plagiarism. But work is under way in the data-mining community on author identification and detection of the flow of ideas, says Gehrke. "Detecting content-based similarities with more sophisticated methods on a macroscale will be the next step." In addition to implementing a check on new submissions to the arXiv, Ginsparg is talking to the editors of Physical Review Letters about applying the method to it and other American Physical Society publications. "More work needs to be done to include papers outside of the arXiv, and to go across journals," says Marty Blume, the recently retired APS editor-in-chief. "We have 30 000 submissions a year. We'll have to see how much [of the editors'] time it takes to run. And if we do it, what do we do with the results?" Toni Feder
|
|
|


This Publication
Scitation
SPIN
Scitopia
Google Scholar
PubMed