Re: Duplicate documents in a corpus

Ted Dunning Thu, 28 Jul 2011 11:27:43 -0700

Mahout also has an LSH implementation that can help with this.

On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler <[email protected]>wrote:


>
> On Jul 28, 2011, at 8:49am, Rich Heimann wrote:
>
> > All,
> >
> > I am curious if Lucene and/or Mahout can identify duplicate documents? I
> am
> > having trouble with many redundant docs in my corpus, which is causing
> > inflated values and an expense on users to process and reprocess much of
> the
> > material. Can the redundancy be removed or managed in some sense my
> either
> > Lucene at ingestion or Mahout at post-processing? The Vector Space Model
> > seems to be notional similar to PCA or Factor Analysis, which both have
> > similar ambitions. Thoughts???
>
> Nutch has a TextProfileSignature class that creates a hash which is
> somewhat resilient to minor text changes between documents.
>
> Assuming you have such a hash, then it's trivial to use a Hadoop workflow
> to remove duplicates.
>
> Or Solr supports removing duplicates as well - see
> http://wiki.apache.org/solr/Deduplication
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom data mining solutions
>
>
>
>
>
>
>

Re: Duplicate documents in a corpus

Reply via email to