Re: Document Comparison with Mahout

JAGANADH G Wed, 07 Jul 2010 23:26:50 -0700

On Thu, Jul 8, 2010 at 7:35 AM, dc tech <[email protected]> wrote:


> Document similarity is unlikely to work as the typical case is a term paper
> for the class where the papers will be similar - many similar words etc.
> One
> approach (suggesting in a book.. I do not recall the title now) is to take
> a
> sample of text fragments from document 1 and use those fragments as queries
> against the larger corpus. Plagiarism may be suggested if m of the n
> fragments match assuming the cheater is smart and has at least not copied
> the entire document. Key questions would be:
> - how many text fragments (n) to take from the document under consideration
> (call it doc 1) and fragment size and extraction technique (i.e. sentence
> breaks)
> - how many matches constitute a possible match (i.e. out of 10 fragments,
> match is when 6 show up in a different document)
> - one pass only or multiple passes
>
> I got the same idea from some research papers.
Some where I saw that LSI will be also useful for the same. But I dont know
the details
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Document Comparison with Mahout

Reply via email to