We also have a minhash implementation of some sort that I don't know much about.
On Thu, Jul 28, 2011 at 4:33 PM, Chris Schilling <[email protected]>wrote: > Hey Lance, > > LSH is a hashing mechanism: > http://en.wikipedia.org/wiki/Locality-sensitive_hashing > > Ted implemented something like this to hash vectors for training SGD > Logistic Regression. > > Chris > > On Jul 28, 2011, at 3:43 PM, Lance Norskog wrote: > > > Three different answers, for different levels of one questions: how > > similar are these documents? > > > > If they have the same exact bytes, the Solr/Lucene deduplication > > technique will work, and is very fast. (I don't remember if it is a > > Lucene or Solr feature.) > > > > If they have "minor text changes", different metadata etc., the > > Nutch/Hadoop job may work. > > > > If they are rearranged, plagiarized, etc. the Mahout LSA/LSI tools > > (can't find LSH as an acronym) are the most useful. > > > > Order of execution: the Solr/Lucene deduplication feature can be done > > one document at a time, almost entirely in memory. I don't know about > > the Nutch/Hadoop idea. The LSA/LSI tools very definitely need all (or > > most) of the documents to build a model, then tests each document > > against the model. Since this is a numerical comparison, there will be > > a failure rate, both ways: false positives and false negatives. False > > positives throw away valid documents. > > > > > > > > On 7/28/11, Ted Dunning <[email protected]> wrote: > >> Mahout also has an LSH implementation that can help with this. > >> > >> On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler > >> <[email protected]>wrote: > >> > >>> > >>> On Jul 28, 2011, at 8:49am, Rich Heimann wrote: > >>> > >>>> All, > >>>> > >>>> I am curious if Lucene and/or Mahout can identify duplicate documents? > I > >>> am > >>>> having trouble with many redundant docs in my corpus, which is causing > >>>> inflated values and an expense on users to process and reprocess much > of > >>> the > >>>> material. Can the redundancy be removed or managed in some sense my > >>> either > >>>> Lucene at ingestion or Mahout at post-processing? The Vector Space > Model > >>>> seems to be notional similar to PCA or Factor Analysis, which both > have > >>>> similar ambitions. Thoughts??? > >>> > >>> Nutch has a TextProfileSignature class that creates a hash which is > >>> somewhat resilient to minor text changes between documents. > >>> > >>> Assuming you have such a hash, then it's trivial to use a Hadoop > workflow > >>> to remove duplicates. > >>> > >>> Or Solr supports removing duplicates as well - see > >>> http://wiki.apache.org/solr/Deduplication > >>> > >>> -- Ken > >>> > >>> -------------------------- > >>> Ken Krugler > >>> +1 530-210-6378 > >>> http://bixolabs.com > >>> custom data mining solutions > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >> > > > > > > -- > > Lance Norskog > > [email protected] > > Chris Schilling > Sr. Data Mining Engineer > Clever Sense, Inc. > "Curating the World Around You" > -------------------------------------------------------------- > Winner of the 2011 Fortune Brainstorm Start-up Idol > > Wanna join the Clever Team? We're hiring! > -------------------------------------------------------------- > >
