An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file is of the following form :
input: --------------------- DOCID1 HASH1 HASH2 HASH3 HASH4 DOCID2 HASH5 HASH3 HASH1 HASH4 The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common. output: -------------------------- DOCID1 DOCID2 DOCID3 DOCID5
