I think for minhash to be generic and efficient, 3 things needed to be done at least, 1. Prepare term-doc matrix to minimize hash computation (one more MR job). Quite often we have doc-term data, for each term we have to averagely hash average_doc_freq times, With term-doc matrix, we reduce per term hash to one. 2. Where is the hash key and docId With Hadoop, mapper has input key and value. We need to map out Id and hash_key. To be general, we need to pass in a hashKeyMap function to extract id and hashkey. the combined clusterId(**_**_...) can be rehashed one more time to reduce size 3. Post processing, sort by similarity probability (numbers of hash_match), kind of sorting common friends by the number of friends in common. Another MR job.
I'll try to look at how to make a patch, and do my best. Sam ________________________________ From: Sean Owen <[email protected]> To: [email protected] Sent: Tuesday, June 19, 2012 2:20 AM Subject: Re: MinHash implementation thoughts Lance, what are you referring to here? Sam if you have some patch to suggest that would be good, open a JIRA issue. On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog <[email protected]> wrote: > Some Mahout people are not sure that the MinHash implementation is > right. You should try your versions of how the algorithms should work. > > On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[email protected]> wrote: >> Sorry, forget about my previous comment about 2. >> ----- >> For general purpose, sometime we might need to hash on value, so we might >> need an extra parameter , hashOnKey ? >> >> ------ >> This is not correct, the hash key selection is based on mappers input values. >> if the mapper input value is vector, then we need to select by vector key or >> value (using value() or index()), not >> by mapper key or mapper value. >> >> the extra parameter might look like hashKeyOnMapperKey (or >> hashkeyOnMapperValue) >> >> >> >> Sam >> >> >> ________________________________ >> From: sam wu <[email protected]> >> To: [email protected] >> Sent: Friday, June 15, 2012 5:38 PM >> Subject: MinHash implementation thoughts >> >> I am playing with MinHashMapper, and have the following 2 penny thoughts >> >> 1. Efficiency >> >> in the map(..), around line 85 >> ---------- >> for (int i = 0; i < numHashFunctions; i++) { >> for (Vector.Element ele : featureVector) { ----* >> int value = (int) ele.get(); ----** >> bytesToHash[0] = (byte) (value >> 24); >> ..... >> } >> } >> } >> ----------- >> the featureVector (which can be random sparse vector) iterator ---* goes >> through all elements (including nonzero). >> When I test ASF email program (from Grant's ...). Each email has ~30000 >> elements. >> By using iteratorNonZero(), >> I reduce from 30000 to around 50 operations (both for converting from int >> to byte[], and hash computation) for some email.??? >> >> 2. which field key to hash >> we read from sequence file, so we have key and value. >> If we read from tfidf doc sequenceFile, then >> key will be termId, value will be tfidf value. >> What to hash for MinHash ? >> I think we need to hash on termId, instead of tfidf (meaningless for doc >> MinHash ?) >> so we need to get >> value = ele.index() ---** >> >> When I test the current code, I get almost all value = 0 >> >> For general purpose, sometime we might need to hash on value, so we might >> need an extra parameter , hashOnKey ? >> >> 3. How to compute custerID ? >> >> I thought LSH normally has r(rows in a band) and b (band). >> numberHashFunction = r *b >> >> #clusterId = b >> >> so we can easily compute probability 1- exp( (1-exp(s,r)),b) >> >> for (int i = 0; i < b; i++) { >> for (int j = 0; j < r; j++) { >> clusterIdBuilder.append(minHashValues[i * b + j ]).append('-'); >> } >> .... >> } >> >> the current implementation works in some way, but ?? >> >> >> This is Friday late afternoon, >> >> Hope I am still making sense. >> >> >> BR >> >> Sam > > > > -- > Lance Norskog > [email protected]
