Document similarity is unlikely to work as the typical case is a term paper for the class where the papers will be similar - many similar words etc. One approach (suggesting in a book.. I do not recall the title now) is to take a sample of text fragments from document 1 and use those fragments as queries against the larger corpus. Plagiarism may be suggested if m of the n fragments match assuming the cheater is smart and has at least not copied the entire document. Key questions would be: - how many text fragments (n) to take from the document under consideration (call it doc 1) and fragment size and extraction technique (i.e. sentence breaks) - how many matches constitute a possible match (i.e. out of 10 fragments, match is when 6 show up in a different document) - one pass only or multiple passes
Hope that helps. On Wed, Jul 7, 2010 at 2:19 PM, Grant Ingersoll <[email protected]> wrote: > How do you want to determine copy? Strictly or loosely? Solr and Nutch > have some deduplication capabilities, including fuzzy matching. They > probably could be brought into Mahout, too. > > -Grant > > On Jul 7, 2010, at 10:23 AM, JAGANADH G wrote: > > > Dear All > > > > Is there any way or algo available to compare tow documents. > > Eg. Check if doc "A" is a copy (palagirised version) of document "B". > > > > With regards > > > > -- > > ********************************** > > JAGANADH G > > http://jaganadhg.freeflux.net/blog > >
