Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.
We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service.
Some applications of the algorithms described in this paper have been patented by the Regents of the University of California. However, as the patent states (in claim one), the winnowing technique is only patented insofar as it relates to `comparing the contents of a query document to the content on the World Wide Web'.
It is somewhat interesting that the obvious claims one and two of the patent application do not appear in the issued patent. However this may be because they are patented elsewhere. Finally, please note that, in general, patenting algorithms is quite controversial.
Please note that the code given in Figure 5 of the paper has (at least) two bugs: line 21 should read
for(int i=(r-1+w)%w; i!=r; i=(i-1+w)%w)
and not
for(int i=(r-1)%w; i!=r; i=(i-1+w)%w)
The other mistake is harder to correct: essentially it is an error on line 4 to load the buffer with INT_MAX's. The correct code should load the buffer with next_hash()'s and set both r and min equal to w-1. Finally, inside the main loop (beginning on line 13) the `advance' (the first two lines of the block) should be moved to the end, after the `if-else' statement.
If you have any questions or corrections please contact me via email.