Re: Duplicate documents in a corpus

Chris Schilling Thu, 28 Jul 2011 16:33:35 -0700

Hey Lance,

LSH is a hashing mechanism:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing


Ted implemented something like this to hash vectors for training SGD Logistic 
Regression.

Chris

On Jul 28, 2011, at 3:43 PM, Lance Norskog wrote:

> Three different answers, for different levels of one questions: how
> similar are these documents?
> 
> If they have the same exact bytes, the Solr/Lucene deduplication
> technique will work, and is very fast. (I don't remember if it is a
> Lucene or Solr feature.)
> 
> If they have "minor text changes", different metadata etc., the
> Nutch/Hadoop job may work.
> 
> If they are rearranged, plagiarized, etc. the Mahout LSA/LSI tools
> (can't find LSH as an acronym) are the most useful.
> 
> Order of execution: the Solr/Lucene deduplication feature can be done
> one document at a time, almost entirely in memory. I don't know about
> the Nutch/Hadoop idea. The LSA/LSI tools very definitely need all (or
> most) of the documents to build a model, then tests each document
> against the model. Since this is a numerical comparison, there will be
> a failure rate, both ways: false positives and false negatives. False
> positives throw away valid documents.
> 
> 
> 
> On 7/28/11, Ted Dunning <[email protected]> wrote:
>> Mahout also has an LSH implementation that can help with this.
>> 
>> On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler
>> <[email protected]>wrote:
>> 
>>> 
>>> On Jul 28, 2011, at 8:49am, Rich Heimann wrote:
>>> 
>>>> All,
>>>> 
>>>> I am curious if Lucene and/or Mahout can identify duplicate documents? I
>>> am
>>>> having trouble with many redundant docs in my corpus, which is causing
>>>> inflated values and an expense on users to process and reprocess much of
>>> the
>>>> material. Can the redundancy be removed or managed in some sense my
>>> either
>>>> Lucene at ingestion or Mahout at post-processing? The Vector Space Model
>>>> seems to be notional similar to PCA or Factor Analysis, which both have
>>>> similar ambitions. Thoughts???
>>> 
>>> Nutch has a TextProfileSignature class that creates a hash which is
>>> somewhat resilient to minor text changes between documents.
>>> 
>>> Assuming you have such a hash, then it's trivial to use a Hadoop workflow
>>> to remove duplicates.
>>> 
>>> Or Solr supports removing duplicates as well - see
>>> http://wiki.apache.org/solr/Deduplication
>>> 
>>> -- Ken
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> custom data mining solutions
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> -- 
> Lance Norskog
> [email protected]

Chris Schilling
Sr. Data Mining Engineer
Clever Sense, Inc.
"Curating the World Around You"
--------------------------------------------------------------
Winner of the 2011 Fortune Brainstorm Start-up Idol

Wanna join the Clever Team? We're hiring!
--------------------------------------------------------------

Re: Duplicate documents in a corpus

Reply via email to