Re: Duplicate documents in a corpus

Ted Dunning Thu, 28 Jul 2011 17:38:53 -0700

We also have a minhash implementation of some sort that I don't know much
about.


On Thu, Jul 28, 2011 at 4:33 PM, Chris Schilling
<[email protected]>wrote:

> Hey Lance,
>
> LSH is a hashing mechanism:
> http://en.wikipedia.org/wiki/Locality-sensitive_hashing
>
> Ted implemented something like this to hash vectors for training SGD
> Logistic Regression.
>
> Chris
>
> On Jul 28, 2011, at 3:43 PM, Lance Norskog wrote:
>
> > Three different answers, for different levels of one questions: how
> > similar are these documents?
> >
> > If they have the same exact bytes, the Solr/Lucene deduplication
> > technique will work, and is very fast. (I don't remember if it is a
> > Lucene or Solr feature.)
> >
> > If they have "minor text changes", different metadata etc., the
> > Nutch/Hadoop job may work.
> >
> > If they are rearranged, plagiarized, etc. the Mahout LSA/LSI tools
> > (can't find LSH as an acronym) are the most useful.
> >
> > Order of execution: the Solr/Lucene deduplication feature can be done
> > one document at a time, almost entirely in memory. I don't know about
> > the Nutch/Hadoop idea. The LSA/LSI tools very definitely need all (or
> > most) of the documents to build a model, then tests each document
> > against the model. Since this is a numerical comparison, there will be
> > a failure rate, both ways: false positives and false negatives. False
> > positives throw away valid documents.
> >
> >
> >
> > On 7/28/11, Ted Dunning <[email protected]> wrote:
> >> Mahout also has an LSH implementation that can help with this.
> >>
> >> On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler
> >> <[email protected]>wrote:
> >>
> >>>
> >>> On Jul 28, 2011, at 8:49am, Rich Heimann wrote:
> >>>
> >>>> All,
> >>>>
> >>>> I am curious if Lucene and/or Mahout can identify duplicate documents?
> I
> >>> am
> >>>> having trouble with many redundant docs in my corpus, which is causing
> >>>> inflated values and an expense on users to process and reprocess much
> of
> >>> the
> >>>> material. Can the redundancy be removed or managed in some sense my
> >>> either
> >>>> Lucene at ingestion or Mahout at post-processing? The Vector Space
> Model
> >>>> seems to be notional similar to PCA or Factor Analysis, which both
> have
> >>>> similar ambitions. Thoughts???
> >>>
> >>> Nutch has a TextProfileSignature class that creates a hash which is
> >>> somewhat resilient to minor text changes between documents.
> >>>
> >>> Assuming you have such a hash, then it's trivial to use a Hadoop
> workflow
> >>> to remove duplicates.
> >>>
> >>> Or Solr supports removing duplicates as well - see
> >>> http://wiki.apache.org/solr/Deduplication
> >>>
> >>> -- Ken
> >>>
> >>> --------------------------
> >>> Ken Krugler
> >>> +1 530-210-6378
> >>> http://bixolabs.com
> >>> custom data mining solutions
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
> >
> > --
> > Lance Norskog
> > [email protected]
>
> Chris Schilling
> Sr. Data Mining Engineer
> Clever Sense, Inc.
> "Curating the World Around You"
> --------------------------------------------------------------
> Winner of the 2011 Fortune Brainstorm Start-up Idol
>
> Wanna join the Clever Team? We're hiring!
> --------------------------------------------------------------
>
>

Re: Duplicate documents in a corpus

Reply via email to