Sorry, forget about my previous comment about 2. ----- For general purpose, sometime we might need to hash on value, so we might need an extra parameter , hashOnKey ?
------ This is not correct, the hash key selection is based on mappers input values. if the mapper input value is vector, then we need to select by vector key or value (using value() or index()), not by mapper key or mapper value. the extra parameter might look like hashKeyOnMapperKey (or hashkeyOnMapperValue) Sam ________________________________ From: sam wu <[email protected]> To: [email protected] Sent: Friday, June 15, 2012 5:38 PM Subject: MinHash implementation thoughts I am playing with MinHashMapper, and have the following 2 penny thoughts 1. Efficiency in the map(..), around line 85 ---------- for (int i = 0; i < numHashFunctions; i++) { for (Vector.Element ele : featureVector) { ----* int value = (int) ele.get(); ----** bytesToHash[0] = (byte) (value >> 24); ..... } } } ----------- the featureVector (which can be random sparse vector) iterator ---* goes through all elements (including nonzero). When I test ASF email program (from Grant's ...). Each email has ~30000 elements. By using iteratorNonZero(), I reduce from 30000 to around 50 operations (both for converting from int to byte[], and hash computation) for some email.??? 2. which field key to hash we read from sequence file, so we have key and value. If we read from tfidf doc sequenceFile, then key will be termId, value will be tfidf value. What to hash for MinHash ? I think we need to hash on termId, instead of tfidf (meaningless for doc MinHash ?) so we need to get value = ele.index() ---** When I test the current code, I get almost all value = 0 For general purpose, sometime we might need to hash on value, so we might need an extra parameter , hashOnKey ? 3. How to compute custerID ? I thought LSH normally has r(rows in a band) and b (band). numberHashFunction = r *b #clusterId = b so we can easily compute probability 1- exp( (1-exp(s,r)),b) for (int i = 0; i < b; i++) { for (int j = 0; j < r; j++) { clusterIdBuilder.append(minHashValues[i * b + j ]).append('-'); } .... } the current implementation works in some way, but ?? This is Friday late afternoon, Hope I am still making sense. BR Sam
