Re: Document Comparison with Mahout

dc tech Wed, 07 Jul 2010 19:07:03 -0700

Document similarity is unlikely to work as the typical case is a term paper
for the class where the papers will be similar - many similar words etc. One
approach (suggesting in a book.. I do not recall the title now) is to take a
sample of text fragments from document 1 and use those fragments as queries
against the larger corpus. Plagiarism may be suggested if m of the n
fragments match assuming the cheater is smart and has at least not copied
the entire document. Key questions would be:
- how many text fragments (n) to take from the document under consideration
(call it doc 1) and fragment size and extraction technique (i.e. sentence
breaks)
- how many matches constitute a possible match (i.e. out of 10 fragments,
match is when 6 show up in a different document)
- one pass only or multiple passes


Hope that helps.





On Wed, Jul 7, 2010 at 2:19 PM, Grant Ingersoll <[email protected]> wrote:

> How do you want to determine copy?  Strictly or loosely?  Solr and Nutch
> have some deduplication capabilities, including fuzzy matching.  They
> probably could be brought into Mahout, too.
>
> -Grant
>
> On Jul 7, 2010, at 10:23 AM, JAGANADH G wrote:
>
> > Dear All
> >
> > Is there any way or algo available to compare tow documents.
> > Eg. Check if doc "A" is a copy (palagirised version) of document "B".
> >
> > With regards
> >
> > --
> > **********************************
> > JAGANADH G
> > http://jaganadhg.freeflux.net/blog
>
>

Re: Document Comparison with Mahout

Reply via email to