Hey Lance, LSH is a hashing mechanism: http://en.wikipedia.org/wiki/Locality-sensitive_hashing
Ted implemented something like this to hash vectors for training SGD Logistic Regression. Chris On Jul 28, 2011, at 3:43 PM, Lance Norskog wrote: > Three different answers, for different levels of one questions: how > similar are these documents? > > If they have the same exact bytes, the Solr/Lucene deduplication > technique will work, and is very fast. (I don't remember if it is a > Lucene or Solr feature.) > > If they have "minor text changes", different metadata etc., the > Nutch/Hadoop job may work. > > If they are rearranged, plagiarized, etc. the Mahout LSA/LSI tools > (can't find LSH as an acronym) are the most useful. > > Order of execution: the Solr/Lucene deduplication feature can be done > one document at a time, almost entirely in memory. I don't know about > the Nutch/Hadoop idea. The LSA/LSI tools very definitely need all (or > most) of the documents to build a model, then tests each document > against the model. Since this is a numerical comparison, there will be > a failure rate, both ways: false positives and false negatives. False > positives throw away valid documents. > > > > On 7/28/11, Ted Dunning <[email protected]> wrote: >> Mahout also has an LSH implementation that can help with this. >> >> On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler >> <[email protected]>wrote: >> >>> >>> On Jul 28, 2011, at 8:49am, Rich Heimann wrote: >>> >>>> All, >>>> >>>> I am curious if Lucene and/or Mahout can identify duplicate documents? I >>> am >>>> having trouble with many redundant docs in my corpus, which is causing >>>> inflated values and an expense on users to process and reprocess much of >>> the >>>> material. Can the redundancy be removed or managed in some sense my >>> either >>>> Lucene at ingestion or Mahout at post-processing? The Vector Space Model >>>> seems to be notional similar to PCA or Factor Analysis, which both have >>>> similar ambitions. Thoughts??? >>> >>> Nutch has a TextProfileSignature class that creates a hash which is >>> somewhat resilient to minor text changes between documents. >>> >>> Assuming you have such a hash, then it's trivial to use a Hadoop workflow >>> to remove duplicates. >>> >>> Or Solr supports removing duplicates as well - see >>> http://wiki.apache.org/solr/Deduplication >>> >>> -- Ken >>> >>> -------------------------- >>> Ken Krugler >>> +1 530-210-6378 >>> http://bixolabs.com >>> custom data mining solutions >>> >>> >>> >>> >>> >>> >>> >> > > > -- > Lance Norskog > [email protected] Chris Schilling Sr. Data Mining Engineer Clever Sense, Inc. "Curating the World Around You" -------------------------------------------------------------- Winner of the 2011 Fortune Brainstorm Start-up Idol Wanna join the Clever Team? We're hiring! --------------------------------------------------------------
