Hi John,

Yeah, I'm going for near-duplicate detection. Thanks for your advice, I'll
look into those algorithms and give it a go!

Cheers,
-dcf


On Tue, Apr 16, 2013 at 3:52 AM, John Conwell <[email protected]> wrote:

> Depends on what kind of deduplication your trying to do.  Do you want exact
> dup detection?  Or near dup detection?
>
> For exact dup you dont need Mahout.  Just run each doc through a mapper,
> where the mapper does a MD5 hash on the doc, and emit the MD5 hash value as
> the mapper key, and the doc id as the mapper value.  Then the reducer will
> pull all the documents together that have the same MD5 hash value.
>
> If you want to do a near dup analysys, you can go with a ngram shingling
> analysys.  I dont think there is anything built into Mahout that does this,
> you can use Mahout's ngram generation, and specify a very low Log
> Likelyhood score so most/all of the ngrams get emitted.  Then use this
> ngram data in your shingling algorithm.  There are several known shingling
> algorithms out there, just google them, and implement.
>
>
>
>
>
> On Mon, Apr 15, 2013 at 6:38 AM, xdcfff <[email protected]> wrote:
>
> > Hi all,
> >
> > Just looking for some general guidance on how I would approach this task.
> >
> > If I have two datasets containing items, what is currently the best way
> to
> > detect duplicates between them using Mahout? I intend on matching based
> on
> > item name text similarity to begin with.
> >
> > I'm willing to write Java wherever necessary, but I just want to be sure
> to
> > avoid "re-coding the wheel" as such.
> >
> > Cheers,
> > -dcf
> >
>
>
>
> --
>
> Thanks,
> John C
>

Reply via email to