Re: MinHash implementation thoughts

Sean Owen Tue, 19 Jun 2012 02:21:08 -0700

Lance, what are you referring to here?
Sam if you have some patch to suggest that would be good, open a JIRA issue.


On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog <[email protected]> wrote:
> Some Mahout people are not sure that the MinHash implementation is
> right. You should try your versions of how the algorithms should work.
>
> On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[email protected]> wrote:
>> Sorry, forget about my previous comment about 2.
>> -----
>> For general purpose, sometime we might need to hash on value, so we might
>> need an extra parameter , hashOnKey ?
>>
>> ------
>> This is not correct, the hash key selection is based on mappers input values.
>> if the mapper input value is vector, then we need to select by vector key or 
>> value (using value() or index()), not
>> by mapper key or mapper value.
>>
>> the extra parameter might look like  hashKeyOnMapperKey (or 
>> hashkeyOnMapperValue)
>>
>>
>>
>> Sam
>>
>>
>> ________________________________
>>  From: sam wu <[email protected]>
>> To: [email protected]
>> Sent: Friday, June 15, 2012 5:38 PM
>> Subject: MinHash implementation thoughts
>>
>> I am playing with MinHashMapper, and have the following 2 penny thoughts
>>
>> 1. Efficiency
>>
>> in the map(..), around line 85
>> ----------
>> for (int i = 0; i < numHashFunctions; i++) {
>>       for (Vector.Element ele : featureVector) {         ----*
>>         int value = (int) ele.get();                                  ----**
>>          bytesToHash[0] = (byte) (value >> 24);
>>         .....
>>         }
>>       }
>>     }
>> -----------
>> the featureVector (which can be random sparse vector) iterator ---* goes
>> through all elements (including nonzero).
>> When I test ASF email program (from Grant's ...). Each email has ~30000
>> elements.
>> By using iteratorNonZero(),
>> I reduce from 30000 to around 50 operations (both for converting from int
>> to byte[], and hash computation) for some email.???
>>
>> 2. which field key to hash
>> we read from sequence file, so we have key and value.
>> If we read from tfidf doc sequenceFile, then
>> key will be termId, value will be tfidf value.
>> What to hash for MinHash ?
>> I think we need to hash on termId, instead of tfidf (meaningless for doc
>> MinHash ?)
>> so we need to get
>> value = ele.index() ---**
>>
>> When I test the current code, I get almost all value = 0
>>
>> For general purpose, sometime we might need to hash on value, so we might
>> need an extra parameter , hashOnKey ?
>>
>> 3. How to compute custerID ?
>>
>> I thought LSH normally has r(rows in a band) and b (band).
>> numberHashFunction = r *b
>>
>> #clusterId = b
>>
>> so we can easily compute probability 1- exp( (1-exp(s,r)),b)
>>
>> for (int i = 0; i < b; i++) {
>>   for (int j = 0; j < r; j++) {
>>      clusterIdBuilder.append(minHashValues[i * b + j ]).append('-');
>>   }
>>    ....
>> }
>>
>> the current implementation works in some way, but ??
>>
>>
>> This is Friday late afternoon,
>>
>> Hope I am still making sense.
>>
>>
>> BR
>>
>> Sam
>
>
>
> --
> Lance Norskog
> [email protected]

Re: MinHash implementation thoughts

Reply via email to