30 jul 2007 kl. 14.43 skrev Grant Ingersoll:
I believe Nutch has a duplicate detection algorithm. I don't know
how easy it would be to run independently on a Lucene index.
There have also been a bunch of near-duplicate ideas that have been
presented on the forums before.
This is one of t
I believe Nutch has a duplicate detection algorithm. I don't know
how easy it would be to run independently on a Lucene index.
-Grant
On Jul 29, 2007, at 2:18 AM, Dmitry wrote:
We trying to find are any implementation for Lucene - detection
index duclicates.
Assuming we have a set of doc
A couple of thoughts here...
You could hash (e.g.md5) all the documents in your index and eliminate
duplicates that way. Just pick one of the docs in the hash bucket as
the non-dup document and the delete the other dups. This could be run as a
batch job to eliminate the duplicates in an off-line p
We trying to find are any implementation for Lucene - detection index
duclicates.
Assuming we have a set of documents and a document is a bunch of words.
After we created indexec for the same document we need to knwo that all
ideces will be uniq for specific document. (lexical equivalency).