On Jul 11, 2011, at 3:00am, Luca Natti wrote:

> Somethig that gives also false positives ok,
> because we can check by hand for the final decision on the doc.
> 
> I need more specific directions with some examples ,
> because we have few time to implement this.

See "Winnowing: Local Algorithms for Document Fingerprinting" by Schleimer, 
Wilderson and Aiken.

And any papers on MOSS, an implementation of the ideas contained in that paper, 
for detecting plagiarism.

-- Ken

> On Mon, Jul 11, 2011 at 10:35 AM, Em <[email protected]> wrote:
> 
>> Hi Luca,
>> 
>> how about quoting another researcher's work? Are you also interested in
>> the amount of quotes in respect to the whole document? I think it is not
>> impossible to let an algorithm find out whether some subsequences in
>> both documents are correctly marked, but it might be hard. Depending on
>> your business-case you might find out that there will be a lot of
>> false-positives when judging someone's work as plagiarism.
>> 
>> Another idea to find out similarity between the content of two documents
>> is implemented in Nutch. Fortunately I found a piece of documentation in
>> the solr-api-docs where you can read about it:
>> 
>> http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html
>> 
>> You could do something like that for content-blocks of a document
>> (several sentences or a fixed window of words). This way you are able to
>> find out similarities between documents where the author has rewritten a
>> part of another researcher's work.
>> This way you are able to find out phrases where the
>> longest-common-subsequence is small but a human would see the
>> similarities between both documents and the possiblity of a plagiarism.
>> 
>> Regards,
>> Em
>> 
>> Am 11.07.2011 09:15, schrieb Luca Natti:
>>> yes, i'm interested in plagiarism applied to research papers, university
>>> notes, thesis.
>>> Any theory and *best* snippets of code/examples is very appreciated!
>>> thanks in advance for your help,
>>> 
>>> 
>>> On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[email protected]>
>> wrote:
>>> 
>>>> If 'puzzling' means direct plagiarism, then some sort of
>>>> longest-common-subsequence might be a better metric.
>>>> 
>>>> If this isn't what the OP meant, then sorry! 'Puzzling' is a new term
>> for
>>>> me.
>>>> 
>>>> On Friday, 8 July 2011, Sergey Bartunov <[email protected]> wrote:
>>>>> You may start from
>> http://en.wikipedia.org/wiki/Latent_semantic_analysis
>>>>> 
>>>>> On 8 July 2011 12:47, Luca Natti <[email protected]> wrote:
>>>>>> Is there a  way to compute similarity between docs?
>>>>>> And similarity by paragraphs?
>>>>>> 
>>>>>> What We want to tell is if a research paper is original or made by
>>>>>> "puzzling" other works.
>>>>>> 
>>>>>> thanks!
>>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> 
>>>> http://tinyurl.com/andrew-clegg-linkedin |
>> http://twitter.com/andrew_clegg
>>>> 
>>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions






Reply via email to