Depends on what kind of deduplication your trying to do. Do you want exact dup detection? Or near dup detection?
For exact dup you dont need Mahout. Just run each doc through a mapper, where the mapper does a MD5 hash on the doc, and emit the MD5 hash value as the mapper key, and the doc id as the mapper value. Then the reducer will pull all the documents together that have the same MD5 hash value. If you want to do a near dup analysys, you can go with a ngram shingling analysys. I dont think there is anything built into Mahout that does this, you can use Mahout's ngram generation, and specify a very low Log Likelyhood score so most/all of the ngrams get emitted. Then use this ngram data in your shingling algorithm. There are several known shingling algorithms out there, just google them, and implement. On Mon, Apr 15, 2013 at 6:38 AM, xdcfff <[email protected]> wrote: > Hi all, > > Just looking for some general guidance on how I would approach this task. > > If I have two datasets containing items, what is currently the best way to > detect duplicates between them using Mahout? I intend on matching based on > item name text similarity to begin with. > > I'm willing to write Java wherever necessary, but I just want to be sure to > avoid "re-coding the wheel" as such. > > Cheers, > -dcf > -- Thanks, John C
