The "similar" logic is not transitive, that means, if a is similar to b, b is similar to c, but a may be not similar to c; then how do you do the group?
Regards, *Stanley Shi,* On Sat, Jun 21, 2014 at 2:51 AM, parnab kumar <parnab.2...@gmail.com> wrote: > Hi, > > I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes > are similar if their corresponding hamming distance is less than equal to 2. > > I need to group together hashes that are mutually similar to one another > i.e in the output file in each line i should have mutually similar keys. > > I implemented a customer writable and the compareTo method looks as > follows : > > *public int compareTo(Object o) {* > * Long thisHash = this.hash* > * Long thatHash = ((DocumentHash)o).hash.;* > * if(hammingDist(thisHash, thatHash)<=2){* > * return 0;* > * }* > * return thisHash.compareTo(thatHash);* > * }* > > > In the Map function I emit the customWritable as the key and in the reduce > group by the keys. > > I checked the output file and exhaustively tested the hashes manually and > found that most hashes are mutually similar in each line. However, i found > that some hashes even though they are similar to a group are not in the > output. > > For example: consider the following hashes : > > HASH1 = 69215512 > HASH2 = 69215512 > HASH3 = 69215512 > HASH4 = 69215568 > > All the above 4 hashes are mutually similar and are within a distance 2 of > each other. Still in the output file i found two separate records where > HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line > as follows: > > HASH4 HASH3 > HASH1 HASH2 > > > Can someone specify why the above happens ??? > > > Thanks, > Parnab. > > >