You basically have a "record similarity scoring and linking" problem -- common
in data-quality software like ours. This could be thought of as computing the
cross-product of all records, counting the number of hash keys in common, and
then outputting those that exceed a threshold. This is very slow for large
data because of N-squared size of intermediate data set or at least the number
of iterations.
If you have assurance that the frequency of a given HASH value is low, such
that all instances of records containing a given hash key can fit into memory,
it can be done as follows:
1) Mapper1 outputs four tuples with hash as key: {HASH1, DOCID},
{HASH2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record
2) Reducer1 loads all tuples with same HASH into memory.
3) Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the
hash key (nested loop, but only output where DOCID1 < DOCID2)
4) Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as key
5) Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs for
those exceeding threshold.
If you have no such assurance, make Mapper1 a map-only job, and replace
Reducer1 with a new job that joins by HASH. Joins are not standardized in MR
but can be done with MultipleInputs, and of course Pig has this built in.
Searching on "Hadoop join" will give you some ideas of how to implement in
straight MR.
John
From: parnab kumar [mailto:[email protected]]
Sent: Friday, June 14, 2013 8:06 AM
To: [email protected]
Subject: How to design the mapper and reducer for the following problem
An input file where each line corresponds to a document .Each document is
identfied by some fingerPrints .For example a line in the input file
is of the following form :
input:
---------------------
DOCID1 HASH1 HASH2 HASH3 HASH4
DOCID2 HASH5 HASH3 HASH1 HASH4
The output of the mapreduce job should write the pair of DOCIDS which share a
threshold number of HASH in common.
output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5