I am playing with MinHashMapper, and have the following 2 penny thoughts

1. Efficiency

in the map(..), around line 85
----------
for (int i = 0; i < numHashFunctions; i++) {
      for (Vector.Element ele : featureVector) {         ----*
        int value = (int) ele.get();                                  ----**
         bytesToHash[0] = (byte) (value >> 24);
        .....
        }
      }
    }
-----------
the featureVector (which can be random sparse vector) iterator ---* goes
through all elements (including nonzero).
When I test ASF email program (from Grant's ...). Each email has ~30000
elements.
By using iteratorNonZero(),
I reduce from 30000 to around 50 operations (both for converting from int
to byte[], and hash computation) for some email.???

2. which field key to hash
we read from sequence file, so we have key and value.
If we read from tfidf doc sequenceFile, then
key will be termId, value will be tfidf value.
What to hash for MinHash ?
I think we need to hash on termId, instead of tfidf (meaningless for doc
MinHash ?)
so we need to get
value = ele.index() ---**

When I test the current code, I get almost all value = 0

For general purpose, sometime we might need to hash on value, so we might
need an extra parameter , hashOnKey ?

3. How to compute custerID ?

I thought LSH normally has r(rows in a band) and b (band).
numberHashFunction = r *b

#clusterId = b

so we can easily compute probability 1- exp( (1-exp(s,r)),b)

for (int i = 0; i < b; i++) {
  for (int j = 0; j < r; j++) {
     clusterIdBuilder.append(minHashValues[i * b + j ]).append('-');
  }
   ....
}

the current implementation works in some way, but ??


This is Friday late afternoon,

Hope I am still making sense.


BR

Sam

Reply via email to