Hi
My quick and dirty non-optimized solution would be as follows
MAPPER
=======
OUTPUT from Mapper
<Key = Sorted List {HASH1,HASH2,HASH3,HASH4} > <Value = DOCID1~HASH1
HASH2 HASH3 HASH4>
<Key = Sorted List {HASH1,HASH2,HASH3,HASH4} > <Value = DOCID1~DOCID2
HASH5 HASH3 HASH1 HASH4>
REDUCER
========
Iterate over keys
For a key = (say) {HASH1,HASH2,HASH3,HASH4}
Format the collection of values into some StringBuilder kind of class
Output
KEY = {DOCID1 DOCID2} value = null
KEY = {DOCID3 DOCID5} value = null
Hope I have understood your problem correctly…If not sorry about that
sanjay
From: parnab kumar <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Friday, June 14, 2013 7:06 AM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: How to design the mapper and reducer for the following problem
An input file where each line corresponds to a document .Each document is
identfied by some fingerPrints .For example a line in the input file
is of the following form :
input:
---------------------
DOCID1 HASH1 HASH2 HASH3 HASH4
DOCID2 HASH5 HASH3 HASH1 HASH4
The output of the mapreduce job should write the pair of DOCIDS which share a
threshold number of HASH in common.
output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the
intended recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited. If you
are not the intended recipient, please contact the sender by reply email and
destroy all copies of the original message along with any attachments, from
your computer system. If you are the intended recipient, please be advised that
the content of this message is subject to access, review and disclosure by the
sender's Email System Administrator.