On Jul 11, 2011, at 3:00am, Luca Natti wrote: > Somethig that gives also false positives ok, > because we can check by hand for the final decision on the doc. > > I need more specific directions with some examples , > because we have few time to implement this.
See "Winnowing: Local Algorithms for Document Fingerprinting" by Schleimer, Wilderson and Aiken. And any papers on MOSS, an implementation of the ideas contained in that paper, for detecting plagiarism. -- Ken > On Mon, Jul 11, 2011 at 10:35 AM, Em <[email protected]> wrote: > >> Hi Luca, >> >> how about quoting another researcher's work? Are you also interested in >> the amount of quotes in respect to the whole document? I think it is not >> impossible to let an algorithm find out whether some subsequences in >> both documents are correctly marked, but it might be hard. Depending on >> your business-case you might find out that there will be a lot of >> false-positives when judging someone's work as plagiarism. >> >> Another idea to find out similarity between the content of two documents >> is implemented in Nutch. Fortunately I found a piece of documentation in >> the solr-api-docs where you can read about it: >> >> http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html >> >> You could do something like that for content-blocks of a document >> (several sentences or a fixed window of words). This way you are able to >> find out similarities between documents where the author has rewritten a >> part of another researcher's work. >> This way you are able to find out phrases where the >> longest-common-subsequence is small but a human would see the >> similarities between both documents and the possiblity of a plagiarism. >> >> Regards, >> Em >> >> Am 11.07.2011 09:15, schrieb Luca Natti: >>> yes, i'm interested in plagiarism applied to research papers, university >>> notes, thesis. >>> Any theory and *best* snippets of code/examples is very appreciated! >>> thanks in advance for your help, >>> >>> >>> On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[email protected]> >> wrote: >>> >>>> If 'puzzling' means direct plagiarism, then some sort of >>>> longest-common-subsequence might be a better metric. >>>> >>>> If this isn't what the OP meant, then sorry! 'Puzzling' is a new term >> for >>>> me. >>>> >>>> On Friday, 8 July 2011, Sergey Bartunov <[email protected]> wrote: >>>>> You may start from >> http://en.wikipedia.org/wiki/Latent_semantic_analysis >>>>> >>>>> On 8 July 2011 12:47, Luca Natti <[email protected]> wrote: >>>>>> Is there a way to compute similarity between docs? >>>>>> And similarity by paragraphs? >>>>>> >>>>>> What We want to tell is if a research paper is original or made by >>>>>> "puzzling" other works. >>>>>> >>>>>> thanks! >>>>>> >>>>> >>>> >>>> -- >>>> >>>> http://tinyurl.com/andrew-clegg-linkedin | >> http://twitter.com/andrew_clegg >>>> >>> >> -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
