Lance, what are you referring to here? Sam if you have some patch to suggest that would be good, open a JIRA issue.
On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog <[email protected]> wrote: > Some Mahout people are not sure that the MinHash implementation is > right. You should try your versions of how the algorithms should work. > > On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[email protected]> wrote: >> Sorry, forget about my previous comment about 2. >> ----- >> For general purpose, sometime we might need to hash on value, so we might >> need an extra parameter , hashOnKey ? >> >> ------ >> This is not correct, the hash key selection is based on mappers input values. >> if the mapper input value is vector, then we need to select by vector key or >> value (using value() or index()), not >> by mapper key or mapper value. >> >> the extra parameter might look like hashKeyOnMapperKey (or >> hashkeyOnMapperValue) >> >> >> >> Sam >> >> >> ________________________________ >> From: sam wu <[email protected]> >> To: [email protected] >> Sent: Friday, June 15, 2012 5:38 PM >> Subject: MinHash implementation thoughts >> >> I am playing with MinHashMapper, and have the following 2 penny thoughts >> >> 1. Efficiency >> >> in the map(..), around line 85 >> ---------- >> for (int i = 0; i < numHashFunctions; i++) { >> for (Vector.Element ele : featureVector) { ----* >> int value = (int) ele.get(); ----** >> bytesToHash[0] = (byte) (value >> 24); >> ..... >> } >> } >> } >> ----------- >> the featureVector (which can be random sparse vector) iterator ---* goes >> through all elements (including nonzero). >> When I test ASF email program (from Grant's ...). Each email has ~30000 >> elements. >> By using iteratorNonZero(), >> I reduce from 30000 to around 50 operations (both for converting from int >> to byte[], and hash computation) for some email.??? >> >> 2. which field key to hash >> we read from sequence file, so we have key and value. >> If we read from tfidf doc sequenceFile, then >> key will be termId, value will be tfidf value. >> What to hash for MinHash ? >> I think we need to hash on termId, instead of tfidf (meaningless for doc >> MinHash ?) >> so we need to get >> value = ele.index() ---** >> >> When I test the current code, I get almost all value = 0 >> >> For general purpose, sometime we might need to hash on value, so we might >> need an extra parameter , hashOnKey ? >> >> 3. How to compute custerID ? >> >> I thought LSH normally has r(rows in a band) and b (band). >> numberHashFunction = r *b >> >> #clusterId = b >> >> so we can easily compute probability 1- exp( (1-exp(s,r)),b) >> >> for (int i = 0; i < b; i++) { >> for (int j = 0; j < r; j++) { >> clusterIdBuilder.append(minHashValues[i * b + j ]).append('-'); >> } >> .... >> } >> >> the current implementation works in some way, but ?? >> >> >> This is Friday late afternoon, >> >> Hope I am still making sense. >> >> >> BR >> >> Sam > > > > -- > Lance Norskog > [email protected]
