Hi,

I would like to sense check an approach to near-duplicate detection of
documents using Mahout.  After some basic research I've implemented a
basic proof which works effectively on a small corpus.

I have taken the following pre-processing steps:

1) Parse the document
2) Remove unnecessary tokens
3) Split by sentence
4) Create w-shingles from sentence tokens
5) Hash shingles
6) Minhash hashes
7) Jaccard Similarity adjusting for number of hash functions used in minhash

In order to scale this I will be doing the following:

1) Use M/R for all steps
2) Avoid adding exact duplicate documents to similarity matrix
3) Constructing an (additional) LSH matrix (threshold >=0.2) splitting
into buckets
4) Split the similarity job by blocks of document keys for each mapper
5) Every document in the minhash matrix gets submitted to every mapper
6) Each mapper queries the LSH matrix to look for candidates for
matching against
7) Each mapper matches against candidates in it's block and writes out
a key (docid) and a vector of all similar documents ({docid, score})
8) The reducer then combines the results from each mapper into the
final similarity matrix

I've only really used Mahout so far for doing the minhash stuff but
would like and can't find an LSH implementation.  To avoid
re-inventing the wheel I was looking for general pointers as to the
efficacy of my approach in the first instance and then any guidance on
how best to implement using the rest of mahout.

I've gone through MIA and felt the the rowsimilarityjob was a
possibility, however I understand that a JIRA has been raised to make
this potentially less general and in it's current form it may not
match my performance/cost criteria (i.e. high/low).

Any help is greatly appreciated.

Thanks in advance.

Niall

Reply via email to