Some Mahout people are not sure that the MinHash implementation is right. You should try your versions of how the algorithms should work.
On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[email protected]> wrote: > Sorry, forget about my previous comment about 2. > ----- > For general purpose, sometime we might need to hash on value, so we might > need an extra parameter , hashOnKey ? > > ------ > This is not correct, the hash key selection is based on mappers input values. > if the mapper input value is vector, then we need to select by vector key or > value (using value() or index()), not > by mapper key or mapper value. > > the extra parameter might look like hashKeyOnMapperKey (or > hashkeyOnMapperValue) > > > > Sam > > > ________________________________ > From: sam wu <[email protected]> > To: [email protected] > Sent: Friday, June 15, 2012 5:38 PM > Subject: MinHash implementation thoughts > > I am playing with MinHashMapper, and have the following 2 penny thoughts > > 1. Efficiency > > in the map(..), around line 85 > ---------- > for (int i = 0; i < numHashFunctions; i++) { > for (Vector.Element ele : featureVector) { ----* > int value = (int) ele.get(); ----** > bytesToHash[0] = (byte) (value >> 24); > ..... > } > } > } > ----------- > the featureVector (which can be random sparse vector) iterator ---* goes > through all elements (including nonzero). > When I test ASF email program (from Grant's ...). Each email has ~30000 > elements. > By using iteratorNonZero(), > I reduce from 30000 to around 50 operations (both for converting from int > to byte[], and hash computation) for some email.??? > > 2. which field key to hash > we read from sequence file, so we have key and value. > If we read from tfidf doc sequenceFile, then > key will be termId, value will be tfidf value. > What to hash for MinHash ? > I think we need to hash on termId, instead of tfidf (meaningless for doc > MinHash ?) > so we need to get > value = ele.index() ---** > > When I test the current code, I get almost all value = 0 > > For general purpose, sometime we might need to hash on value, so we might > need an extra parameter , hashOnKey ? > > 3. How to compute custerID ? > > I thought LSH normally has r(rows in a band) and b (band). > numberHashFunction = r *b > > #clusterId = b > > so we can easily compute probability 1- exp( (1-exp(s,r)),b) > > for (int i = 0; i < b; i++) { > for (int j = 0; j < r; j++) { > clusterIdBuilder.append(minHashValues[i * b + j ]).append('-'); > } > .... > } > > the current implementation works in some way, but ?? > > > This is Friday late afternoon, > > Hope I am still making sense. > > > BR > > Sam -- Lance Norskog [email protected]
