Three different answers, for different levels of one questions: how similar are these documents?
If they have the same exact bytes, the Solr/Lucene deduplication technique will work, and is very fast. (I don't remember if it is a Lucene or Solr feature.) If they have "minor text changes", different metadata etc., the Nutch/Hadoop job may work. If they are rearranged, plagiarized, etc. the Mahout LSA/LSI tools (can't find LSH as an acronym) are the most useful. Order of execution: the Solr/Lucene deduplication feature can be done one document at a time, almost entirely in memory. I don't know about the Nutch/Hadoop idea. The LSA/LSI tools very definitely need all (or most) of the documents to build a model, then tests each document against the model. Since this is a numerical comparison, there will be a failure rate, both ways: false positives and false negatives. False positives throw away valid documents. On 7/28/11, Ted Dunning <[email protected]> wrote: > Mahout also has an LSH implementation that can help with this. > > On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler > <[email protected]>wrote: > >> >> On Jul 28, 2011, at 8:49am, Rich Heimann wrote: >> >> > All, >> > >> > I am curious if Lucene and/or Mahout can identify duplicate documents? I >> am >> > having trouble with many redundant docs in my corpus, which is causing >> > inflated values and an expense on users to process and reprocess much of >> the >> > material. Can the redundancy be removed or managed in some sense my >> either >> > Lucene at ingestion or Mahout at post-processing? The Vector Space Model >> > seems to be notional similar to PCA or Factor Analysis, which both have >> > similar ambitions. Thoughts??? >> >> Nutch has a TextProfileSignature class that creates a hash which is >> somewhat resilient to minor text changes between documents. >> >> Assuming you have such a hash, then it's trivial to use a Hadoop workflow >> to remove duplicates. >> >> Or Solr supports removing duplicates as well - see >> http://wiki.apache.org/solr/Deduplication >> >> -- Ken >> >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://bixolabs.com >> custom data mining solutions >> >> >> >> >> >> >> > -- Lance Norskog [email protected]
