Re: MinHash implementation thoughts

chsz wu Mon, 18 Jun 2012 08:11:03 -0700

Sorry, forget about my previous comment about 2.
-----
For general purpose, sometime we might need to hash on value, so we might
need an extra parameter , hashOnKey ?


------
This is not correct, the hash key selection is based on mappers input values.
if the mapper input value is vector, then we need to select by vector key or 
value (using value() or index()), not 
by mapper key or mapper value.

the extra parameter might look like  hashKeyOnMapperKey (or 
hashkeyOnMapperValue)



Sam


________________________________
 From: sam wu <[email protected]>
To: [email protected] 
Sent: Friday, June 15, 2012 5:38 PM
Subject: MinHash implementation thoughts
 
I am playing with MinHashMapper, and have the following 2 penny thoughts

1. Efficiency

in the map(..), around line 85
----------
for (int i = 0; i < numHashFunctions; i++) {
      for (Vector.Element ele : featureVector) {         ----*
        int value = (int) ele.get();                                  ----**
         bytesToHash[0] = (byte) (value >> 24);
        .....
        }
      }
    }
-----------
the featureVector (which can be random sparse vector) iterator ---* goes
through all elements (including nonzero).
When I test ASF email program (from Grant's ...). Each email has ~30000
elements.
By using iteratorNonZero(),
I reduce from 30000 to around 50 operations (both for converting from int
to byte[], and hash computation) for some email.???

2. which field key to hash
we read from sequence file, so we have key and value.
If we read from tfidf doc sequenceFile, then
key will be termId, value will be tfidf value.
What to hash for MinHash ?
I think we need to hash on termId, instead of tfidf (meaningless for doc
MinHash ?)
so we need to get
value = ele.index() ---**

When I test the current code, I get almost all value = 0

For general purpose, sometime we might need to hash on value, so we might
need an extra parameter , hashOnKey ?

3. How to compute custerID ?

I thought LSH normally has r(rows in a band) and b (band).
numberHashFunction = r *b

#clusterId = b

so we can easily compute probability 1- exp( (1-exp(s,r)),b)

for (int i = 0; i < b; i++) {
  for (int j = 0; j < r; j++) {
     clusterIdBuilder.append(minHashValues[i * b + j ]).append('-');
  }
   ....
}

the current implementation works in some way, but ??


This is Friday late afternoon,

Hope I am still making sense.


BR

Sam

Re: MinHash implementation thoughts

Reply via email to