Re: MinHash implementation thoughts

chsz wu Tue, 19 Jun 2012 22:00:28 -0700

Not surprised with  the non-promising result.
As I said in my first email,


if the mapper input sequence file is key (docId), value (vector of term-tfIdf). 
then the hash key should be ~~ value->iteratorNonZero()->getNext()->getIndex(),
  getIndex() return term(Id) , right hash key
 getValue() return tfIdf , not meaningful for hash.

Sam 


________________________________
 From: Lance Norskog <[email protected]>
To: [email protected]; chsz wu <[email protected]> 
Sent: Tuesday, June 19, 2012 8:18 PM
Subject: Re: MinHash implementation thoughts
 
Sean-

http://www.lucidimagination.com/search/document/1d24eb86fcf3b8ab#b90e74b43cfb4c42
http://www.lucidimagination.com/search/document/ffc7e312b87088dc



On Tue, Jun 19, 2012 at 8:54 AM, chsz wu <[email protected]> wrote:
> I think for minhash to be generic and efficient, 3 things needed to be done 
> at least,
> 1. Prepare term-doc matrix to minimize hash computation (one more MR job).
>    Quite often we have doc-term data, for each term we have to averagely hash 
> average_doc_freq times,
>    With term-doc matrix, we reduce  per term hash to one.
> 2. Where is the hash key and docId
>    With Hadoop, mapper has input key and value. We need to map out Id and 
> hash_key.
>    To be general, we need to pass in a hashKeyMap function to extract id and 
> hashkey.
>    the combined clusterId(**_**_...) can be rehashed one more time to reduce 
> size
> 3. Post processing, sort by similarity probability (numbers of hash_match), 
> kind of sorting common friends by the number of friends in common.
>    Another MR job.
>
> I'll try to look at how to make a patch, and do my best.
>
> Sam
>
>
>
>
>
> ________________________________
>  From: Sean Owen <[email protected]>
> To: [email protected]
> Sent: Tuesday, June 19, 2012 2:20 AM
> Subject: Re: MinHash implementation thoughts
>
> Lance, what are you referring to here?
> Sam if you have some patch to suggest that would be good, open a JIRA issue.
>
> On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog <[email protected]> wrote:
>> Some Mahout people are not sure that the MinHash implementation is
>> right. You should try your versions of how the algorithms should work.
>>
>> On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[email protected]> wrote:
>>> Sorry, forget about my previous comment about 2.
>>> -----
>>> For general purpose, sometime we might need to hash on value, so we might
>>> need an extra parameter , hashOnKey ?
>>>
>>> ------
>>> This is not correct, the hash key selection is based on mappers input 
>>> values.
>>> if the mapper input value is vector, then we need to select by vector key 
>>> or value (using value() or index()), not
>>> by mapper key or mapper value.
>>>
>>> the extra parameter might look like  hashKeyOnMapperKey (or 
>>> hashkeyOnMapperValue)
>>>
>>>
>>>
>>> Sam
>>>
>>>
>>> ________________________________
>>>  From: sam wu <[email protected]>
>>> To: [email protected]
>>> Sent: Friday, June 15, 2012 5:38 PM
>>> Subject: MinHash implementation thoughts
>>>
>>> I am playing with MinHashMapper, and have the following 2 penny thoughts
>>>
>>> 1. Efficiency
>>>
>>> in the map(..), around line 85
>>> ----------
>>> for (int i = 0; i < numHashFunctions; i++) {
>>>       for (Vector.Element ele : featureVector) {         ----*
>>>         int value = (int) ele.get();                                  ----**
>>>          bytesToHash[0] = (byte) (value >> 24);
>>>         .....
>>>         }
>>>       }
>>>     }
>>> -----------
>>> the featureVector (which can be random sparse vector) iterator ---* goes
>>> through all elements (including nonzero).
>>> When I test ASF email program (from Grant's ...). Each email has ~30000
>>> elements.
>>> By using iteratorNonZero(),
>>> I reduce from 30000 to around 50 operations (both for converting from int
>>> to byte[], and hash computation) for some email.???
>>>
>>> 2. which field key to hash
>>> we read from sequence file, so we have key and value.
>>> If we read from tfidf doc sequenceFile, then
>>> key will be termId, value will be tfidf value.
>>> What to hash for MinHash ?
>>> I think we need to hash on termId, instead of tfidf (meaningless for doc
>>> MinHash ?)
>>> so we need to get
>>> value = ele.index() ---**
>>>
>>> When I test the current code, I get almost all value = 0
>>>
>>> For general purpose, sometime we might need to hash on value, so we might
>>> need an extra parameter , hashOnKey ?
>>>
>>> 3. How to compute custerID ?
>>>
>>> I thought LSH normally has r(rows in a band) and b (band).
>>> numberHashFunction = r *b
>>>
>>> #clusterId = b
>>>
>>> so we can easily compute probability 1- exp( (1-exp(s,r)),b)
>>>
>>> for (int i = 0; i < b; i++) {
>>>   for (int j = 0; j < r; j++) {
>>>      clusterIdBuilder.append(minHashValues[i * b + j ]).append('-');
>>>   }
>>>    ....
>>> }
>>>
>>> the current implementation works in some way, but ??
>>>
>>>
>>> This is Friday late afternoon,
>>>
>>> Hope I am still making sense.
>>>
>>>
>>> BR
>>>
>>> Sam
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]



-- 
Lance Norskog
[email protected]

Re: MinHash implementation thoughts

Reply via email to