On Jul 28, 2011, at 8:49am, Rich Heimann wrote: > All, > > I am curious if Lucene and/or Mahout can identify duplicate documents? I am > having trouble with many redundant docs in my corpus, which is causing > inflated values and an expense on users to process and reprocess much of the > material. Can the redundancy be removed or managed in some sense my either > Lucene at ingestion or Mahout at post-processing? The Vector Space Model > seems to be notional similar to PCA or Factor Analysis, which both have > similar ambitions. Thoughts???
Nutch has a TextProfileSignature class that creates a hash which is somewhat resilient to minor text changes between documents. Assuming you have such a hash, then it's trivial to use a Hadoop workflow to remove duplicates. Or Solr supports removing duplicates as well - see http://wiki.apache.org/solr/Deduplication -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
