Re: MinHash implementation thoughts

Lance Norskog Tue, 19 Jun 2012 02:06:33 -0700

Some Mahout people are not sure that the MinHash implementation is
right. You should try your versions of how the algorithms should work.


On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[email protected]> wrote:
> Sorry, forget about my previous comment about 2.
> -----
> For general purpose, sometime we might need to hash on value, so we might
> need an extra parameter , hashOnKey ?
>
> ------
> This is not correct, the hash key selection is based on mappers input values.
> if the mapper input value is vector, then we need to select by vector key or 
> value (using value() or index()), not
> by mapper key or mapper value.
>
> the extra parameter might look like  hashKeyOnMapperKey (or 
> hashkeyOnMapperValue)
>
>
>
> Sam
>
>
> ________________________________
>  From: sam wu <[email protected]>
> To: [email protected]
> Sent: Friday, June 15, 2012 5:38 PM
> Subject: MinHash implementation thoughts
>
> I am playing with MinHashMapper, and have the following 2 penny thoughts
>
> 1. Efficiency
>
> in the map(..), around line 85
> ----------
> for (int i = 0; i < numHashFunctions; i++) {
>       for (Vector.Element ele : featureVector) {         ----*
>         int value = (int) ele.get();                                  ----**
>          bytesToHash[0] = (byte) (value >> 24);
>         .....
>         }
>       }
>     }
> -----------
> the featureVector (which can be random sparse vector) iterator ---* goes
> through all elements (including nonzero).
> When I test ASF email program (from Grant's ...). Each email has ~30000
> elements.
> By using iteratorNonZero(),
> I reduce from 30000 to around 50 operations (both for converting from int
> to byte[], and hash computation) for some email.???
>
> 2. which field key to hash
> we read from sequence file, so we have key and value.
> If we read from tfidf doc sequenceFile, then
> key will be termId, value will be tfidf value.
> What to hash for MinHash ?
> I think we need to hash on termId, instead of tfidf (meaningless for doc
> MinHash ?)
> so we need to get
> value = ele.index() ---**
>
> When I test the current code, I get almost all value = 0
>
> For general purpose, sometime we might need to hash on value, so we might
> need an extra parameter , hashOnKey ?
>
> 3. How to compute custerID ?
>
> I thought LSH normally has r(rows in a band) and b (band).
> numberHashFunction = r *b
>
> #clusterId = b
>
> so we can easily compute probability 1- exp( (1-exp(s,r)),b)
>
> for (int i = 0; i < b; i++) {
>   for (int j = 0; j < r; j++) {
>      clusterIdBuilder.append(minHashValues[i * b + j ]).append('-');
>   }
>    ....
> }
>
> the current implementation works in some way, but ??
>
>
> This is Friday late afternoon,
>
> Hope I am still making sense.
>
>
> BR
>
> Sam



-- 
Lance Norskog
[email protected]

Re: MinHash implementation thoughts

Reply via email to