Hi

My quick and dirty non-optimized solution would be as follows

MAPPER
=======
OUTPUT from Mapper
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~HASH1 
HASH2 HASH3 HASH4>
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~DOCID2  
 HASH5 HASH3 HASH1 HASH4>

REDUCER
========
Iterate over keys
For a key = (say) {HASH1,HASH2,HASH3,HASH4}
     Format the collection of values into some StringBuilder kind of class

Output
KEY = {DOCID1 DOCID2}  value = null
KEY = {DOCID3 DOCID5} value = null

Hope I have understood your problem correctly…If not sorry about that

sanjay

From: parnab kumar <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Friday, June 14, 2013 7:06 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is 
identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a 
threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

Reply via email to