Increasing plagiarism
2025-05-14 04:55:38.624789+02 by Dan Lyke 0 comments
A friend today was showing me how he's getting audio processing code out of Google Gemini, and I had to wonder just how much of it was gonna lead to copyright issues. Anyway...
Colin Gordon @csgordon@discuss.systems
When you submit a paper to an ACM journal, it gets run through TurnItIn (yes, really) and the editors in chief have to look at the report and decide if there are plagiarism concerns. Most submissions have a small percentage (~5%) of verbatim-matching text, from a wide variety of sources. The matches are usually small turns of phrase, technical phrases, affiliations, or ACM copyright text 😛 The exceptions are generally extended versions of conference papers, where obviously large chunks of the extension match the original publication.
But recently I've noticed an up-tick, so far only in the wildly-out-of-scope papers that get desk rejected (mostly papers about using LLMs for NLP) of a high percentage of the paper's text (~30%) being flagged as matching, still from a wide variety of sources, but much larger chunks. A long phrase from here, most of a sentence from there, etc., from very scattered sources across different far-ranging fields. This seems unlikely to be from authors picking up phrases they like from papers they actually encountered. I can't help but think these papers have a high fraction of LLM-generated text, and that LLM-generated text on similar topics tends to output a lot of phrases and sentences repeatedly in aggregate, and these patterns are now getting picked up by traditional plagiarism checkers since there's so much LLM-generated text in the world now.