On Thu, Jul 8, 2010 at 7:35 AM, dc tech <[email protected]> wrote:
> Document similarity is unlikely to work as the typical case is a term paper > for the class where the papers will be similar - many similar words etc. > One > approach (suggesting in a book.. I do not recall the title now) is to take > a > sample of text fragments from document 1 and use those fragments as queries > against the larger corpus. Plagiarism may be suggested if m of the n > fragments match assuming the cheater is smart and has at least not copied > the entire document. Key questions would be: > - how many text fragments (n) to take from the document under consideration > (call it doc 1) and fragment size and extraction technique (i.e. sentence > breaks) > - how many matches constitute a possible match (i.e. out of 10 fragments, > match is when 6 show up in a different document) > - one pass only or multiple passes > > I got the same idea from some research papers. Some where I saw that LSI will be also useful for the same. But I dont know the details -- ********************************** JAGANADH G http://jaganadhg.freeflux.net/blog
