On Jul 28, 2011, at 8:49am, Rich Heimann wrote:

> All,
> 
> I am curious if Lucene and/or Mahout can identify duplicate documents? I am
> having trouble with many redundant docs in my corpus, which is causing
> inflated values and an expense on users to process and reprocess much of the
> material. Can the redundancy be removed or managed in some sense my either
> Lucene at ingestion or Mahout at post-processing? The Vector Space Model
> seems to be notional similar to PCA or Factor Analysis, which both have
> similar ambitions. Thoughts???

Nutch has a TextProfileSignature class that creates a hash which is somewhat 
resilient to minor text changes between documents.

Assuming you have such a hash, then it's trivial to use a Hadoop workflow to 
remove duplicates.

Or Solr supports removing duplicates as well - see 
http://wiki.apache.org/solr/Deduplication

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions






Reply via email to