Re: Duplicate documents in a corpus

Ken Krugler Thu, 28 Jul 2011 10:33:49 -0700

On Jul 28, 2011, at 8:49am, Rich Heimann wrote:

> All,
> 
> I am curious if Lucene and/or Mahout can identify duplicate documents? I am
> having trouble with many redundant docs in my corpus, which is causing
> inflated values and an expense on users to process and reprocess much of the
> material. Can the redundancy be removed or managed in some sense my either
> Lucene at ingestion or Mahout at post-processing? The Vector Space Model
> seems to be notional similar to PCA or Factor Analysis, which both have
> similar ambitions. Thoughts???


Nutch has a TextProfileSignature class that creates a hash which is somewhat 
resilient to minor text changes between documents.

Assuming you have such a hash, then it's trivial to use a Hadoop workflow to 
remove duplicates.

Or Solr supports removing duplicates as well - see 
http://wiki.apache.org/solr/Deduplication

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions

Re: Duplicate documents in a corpus

Reply via email to