Re: Document Similarity Algorithm at Solr/Lucene

Furkan KAMACI Thu, 25 Jul 2013 01:18:52 -0700

BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use
at underlying?



2013/7/24 Roman Chyla <roman.ch...@gmail.com>

> This paper contains an excellent algorithm for plagiarism detection, but
> beware the published version had a mistake in the algorithm - look for
> corrections - I can't find them now, but I know they have been published
> (perhaps by one of the co-authors). You could do it with solr, to create an
> index of hashes, with the twist of storing position of the original text
> (source of the hash) together with the token and the solr highlighting
> would do the rest for you :)
>
> roman
>
>
> On Tue, Jul 23, 2013 at 11:07 AM, Shashi Kant <sk...@sloan.mit.edu> wrote:
>
> > Here is a paper that I found useful:
> > http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
> >
> >
> > On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI <furkankam...@gmail.com>
> > wrote:
> > > Thanks for your comments.
> > >
> > > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>
> > >
> > >> if you need a specialized algorithm for detecting blogposts
> plagiarism /
> > >> quotations (which are different tasks IMHO) I think you have 2
> options:
> > >> 1. implement a dedicated one based on your features / metrics / domain
> > >> 2. try to fine tune an existing algorithm that is flexible enough
> > >>
> > >> If I were to do it with Solr I'd probably do something like:
> > >> 1. index "original" blogposts in Solr (possibly using Jack's
> suggestion
> > >> about ngrams / shingles)
> > >> 2. do MLT queries with "candidate blogposts copies" text
> > >> 3. get the first, say, 2-3 hits
> > >> 4. mark it as quote / plagiarism
> > >> 5. eventually train a classifier to help you mark other texts as
> quote /
> > >> plagiarism
> > >>
> > >> HTH,
> > >> Tommaso
> > >>
> > >>
> > >>
> > >> 2013/7/23 Furkan KAMACI <furkankam...@gmail.com>
> > >>
> > >> > Actually I need a specialized algorithm. I want to use that
> algorithm
> > to
> > >> > detect duplicate blog posts.
> > >> >
> > >> > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I you may leverage and / or improve MLT component [1].
> > >> > >
> > >> > > HTH,
> > >> > > Tommaso
> > >> > >
> > >> > > [1] : http://wiki.apache.org/solr/MoreLikeThis
> > >> > >
> > >> > >
> > >> > > 2013/7/23 Furkan KAMACI <furkankam...@gmail.com>
> > >> > >
> > >> > > > Hi;
> > >> > > >
> > >> > > > Sometimes a huge part of a document may exist in another
> > document. As
> > >> > > like
> > >> > > > in student plagiarism or quotation of a blog post at another
> blog
> > >> post.
> > >> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any
> > class
> > >> > to
> > >> > > > detect it?
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>

Re: Document Similarity Algorithm at Solr/Lucene

Reply via email to