BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use at underlying?
2013/7/24 Roman Chyla <roman.ch...@gmail.com> > This paper contains an excellent algorithm for plagiarism detection, but > beware the published version had a mistake in the algorithm - look for > corrections - I can't find them now, but I know they have been published > (perhaps by one of the co-authors). You could do it with solr, to create an > index of hashes, with the twist of storing position of the original text > (source of the hash) together with the token and the solr highlighting > would do the rest for you :) > > roman > > > On Tue, Jul 23, 2013 at 11:07 AM, Shashi Kant <sk...@sloan.mit.edu> wrote: > > > Here is a paper that I found useful: > > http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf > > > > > > On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI <furkankam...@gmail.com> > > wrote: > > > Thanks for your comments. > > > > > > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com> > > > > > >> if you need a specialized algorithm for detecting blogposts > plagiarism / > > >> quotations (which are different tasks IMHO) I think you have 2 > options: > > >> 1. implement a dedicated one based on your features / metrics / domain > > >> 2. try to fine tune an existing algorithm that is flexible enough > > >> > > >> If I were to do it with Solr I'd probably do something like: > > >> 1. index "original" blogposts in Solr (possibly using Jack's > suggestion > > >> about ngrams / shingles) > > >> 2. do MLT queries with "candidate blogposts copies" text > > >> 3. get the first, say, 2-3 hits > > >> 4. mark it as quote / plagiarism > > >> 5. eventually train a classifier to help you mark other texts as > quote / > > >> plagiarism > > >> > > >> HTH, > > >> Tommaso > > >> > > >> > > >> > > >> 2013/7/23 Furkan KAMACI <furkankam...@gmail.com> > > >> > > >> > Actually I need a specialized algorithm. I want to use that > algorithm > > to > > >> > detect duplicate blog posts. > > >> > > > >> > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com> > > >> > > > >> > > Hi, > > >> > > > > >> > > I you may leverage and / or improve MLT component [1]. > > >> > > > > >> > > HTH, > > >> > > Tommaso > > >> > > > > >> > > [1] : http://wiki.apache.org/solr/MoreLikeThis > > >> > > > > >> > > > > >> > > 2013/7/23 Furkan KAMACI <furkankam...@gmail.com> > > >> > > > > >> > > > Hi; > > >> > > > > > >> > > > Sometimes a huge part of a document may exist in another > > document. As > > >> > > like > > >> > > > in student plagiarism or quotation of a blog post at another > blog > > >> post. > > >> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any > > class > > >> > to > > >> > > > detect it? > > >> > > > > > >> > > > > >> > > > >> > > >