I've seen people doing all kinds of things to detect this. A few
directions to research around:

- suffix trees/ suffix arrays to detect longest common subsequences
(perfect matches though),
- bioinformatics, in particular gene sequencing to detect long
near-matching sequences (a variation of the above, I'm not familiar
with any particular algorithms, but I imagine this is a well explored
space given the funds they receive ;),
- techniques for fuzzy matching/ near-duplicate detection, but
combined with arbitrary document chunking or method-specific data (for
example shingles).

These should yield you tons of reading material to start with (Google
Scholar, Citeseer). Sorry for not being any more specific.

Dawid

On Mon, Jul 11, 2011 at 9:15 AM, Luca Natti <[email protected]> wrote:
> yes, i'm interested in plagiarism applied to research papers, university
> notes, thesis.
> Any theory and *best* snippets of code/examples is very appreciated!
> thanks in advance for your help,
>
>
> On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg <[email protected]> wrote:
>
>> If 'puzzling' means direct plagiarism, then some sort of
>> longest-common-subsequence might be a better metric.
>>
>> If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for
>> me.
>>
>> On Friday, 8 July 2011, Sergey Bartunov <[email protected]> wrote:
>> > You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis
>> >
>> > On 8 July 2011 12:47, Luca Natti <[email protected]> wrote:
>> >> Is there a  way to compute similarity between docs?
>> >> And similarity by paragraphs?
>> >>
>> >> What We want to tell is if a research paper is original or made by
>> >> "puzzling" other works.
>> >>
>> >> thanks!
>> >>
>> >
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
>

Reply via email to