subject:"Detection of index dublicates in Lucene"

Re: Detection of index dublicates in Lucene

2007-07-30 Thread karl wettin

30 jul 2007 kl. 14.43 skrev Grant Ingersoll: I believe Nutch has a duplicate detection algorithm. I don't know how easy it would be to run independently on a Lucene index. There have also been a bunch of near-duplicate ideas that have been presented on the forums before. This is one of t

Re: Detection of index dublicates in Lucene

2007-07-30 Thread Grant Ingersoll

I believe Nutch has a duplicate detection algorithm. I don't know how easy it would be to run independently on a Lucene index. -Grant On Jul 29, 2007, at 2:18 AM, Dmitry wrote: We trying to find are any implementation for Lucene - detection index duclicates. Assuming we have a set of doc

Re: Detection of index dublicates in Lucene

2007-07-30 Thread Michael Stoppelman

A couple of thoughts here... You could hash (e.g.md5) all the documents in your index and eliminate duplicates that way. Just pick one of the docs in the hash bucket as the non-dup document and the delete the other dups. This could be run as a batch job to eliminate the duplicates in an off-line p

Detection of index dublicates in Lucene

2007-07-28 Thread Dmitry

We trying to find are any implementation for Lucene - detection index duclicates. Assuming we have a set of documents and a document is a bunch of words. After we created indexec for the same document we need to knwo that all ideces will be uniq for specific document. (lexical equivalency).