Hi,
I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
are similar if their corresponding hamming distance is less than equal to 2.
I need to group together hashes that are mutually similar to one another
i.e in the output file in each line i should have mutually similar keys.
I implemented a customer writable and the compareTo method looks as
follows :
*public int compareTo(Object o) {*
* Long thisHash = this.hash*
* Long thatHash = ((DocumentHash)o).hash.;*
* if(hammingDist(thisHash, thatHash)<=2){*
* return 0;*
* }*
* return thisHash.compareTo(thatHash);*
* }*
In the Map function I emit the customWritable as the key and in the reduce
group by the keys.
I checked the output file and exhaustively tested the hashes manually and
found that most hashes are mutually similar in each line. However, i found
that some hashes even though they are similar to a group are not in the
output.
For example: consider the following hashes :
HASH1 = 69215512
HASH2 = 69215512
HASH3 = 69215512
HASH4 = 69215568
All the above 4 hashes are mutually similar and are within a distance 2 of
each other. Still in the output file i found two separate records where
HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
as follows:
HASH4 HASH3
HASH1 HASH2
Can someone specify why the above happens ???
Thanks,
Parnab.