Re: Document Similarity Algorithm at Solr/Lucene

Lance Norskog Wed, 07 Aug 2013 16:40:42 -0700

Block-quoting and plagiarism are two different questions.

Block-quoting is simple: break the text apart into sentences or evenparagraphs and make them separate documents. Make facets of thepost-analysis text. Now just pull counts of facets and block quotes willbe clear.

Mahout has a scalable implementation of n-gram based documentsimilarity. It calculates distances between all documents and identifiesclusters of similar documents. This is a much more general technique andmay help you find "obfuscated" plagiarism.


Lance

On 07/23/2013 02:33 AM, Furkan KAMACI wrote:

Hi;

Sometimes a huge part of a document may exist in another document. As like
in student plagiarism or quotation of a blog post at another blog post.
Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
detect it?

Re: Document Similarity Algorithm at Solr/Lucene

Reply via email to