Mahout also has an LSH implementation that can help with this. On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler <[email protected]>wrote:
> > On Jul 28, 2011, at 8:49am, Rich Heimann wrote: > > > All, > > > > I am curious if Lucene and/or Mahout can identify duplicate documents? I > am > > having trouble with many redundant docs in my corpus, which is causing > > inflated values and an expense on users to process and reprocess much of > the > > material. Can the redundancy be removed or managed in some sense my > either > > Lucene at ingestion or Mahout at post-processing? The Vector Space Model > > seems to be notional similar to PCA or Factor Analysis, which both have > > similar ambitions. Thoughts??? > > Nutch has a TextProfileSignature class that creates a hash which is > somewhat resilient to minor text changes between documents. > > Assuming you have such a hash, then it's trivial to use a Hadoop workflow > to remove duplicates. > > Or Solr supports removing duplicates as well - see > http://wiki.apache.org/solr/Deduplication > > -- Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > custom data mining solutions > > > > > > >
