Sean- http://www.lucidimagination.com/search/document/1d24eb86fcf3b8ab#b90e74b43cfb4c42 http://www.lucidimagination.com/search/document/ffc7e312b87088dc
On Tue, Jun 19, 2012 at 8:54 AM, chsz wu <[email protected]> wrote: > I think for minhash to be generic and efficient, 3 things needed to be done > at least, > 1. Prepare term-doc matrix to minimize hash computation (one more MR job). > Quite often we have doc-term data, for each term we have to averagely hash > average_doc_freq times, > With term-doc matrix, we reduce per term hash to one. > 2. Where is the hash key and docId > With Hadoop, mapper has input key and value. We need to map out Id and > hash_key. > To be general, we need to pass in a hashKeyMap function to extract id and > hashkey. > the combined clusterId(**_**_...) can be rehashed one more time to reduce > size > 3. Post processing, sort by similarity probability (numbers of hash_match), > kind of sorting common friends by the number of friends in common. > Another MR job. > > I'll try to look at how to make a patch, and do my best. > > Sam > > > > > > ________________________________ > From: Sean Owen <[email protected]> > To: [email protected] > Sent: Tuesday, June 19, 2012 2:20 AM > Subject: Re: MinHash implementation thoughts > > Lance, what are you referring to here? > Sam if you have some patch to suggest that would be good, open a JIRA issue. > > On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog <[email protected]> wrote: >> Some Mahout people are not sure that the MinHash implementation is >> right. You should try your versions of how the algorithms should work. >> >> On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[email protected]> wrote: >>> Sorry, forget about my previous comment about 2. >>> ----- >>> For general purpose, sometime we might need to hash on value, so we might >>> need an extra parameter , hashOnKey ? >>> >>> ------ >>> This is not correct, the hash key selection is based on mappers input >>> values. >>> if the mapper input value is vector, then we need to select by vector key >>> or value (using value() or index()), not >>> by mapper key or mapper value. >>> >>> the extra parameter might look like hashKeyOnMapperKey (or >>> hashkeyOnMapperValue) >>> >>> >>> >>> Sam >>> >>> >>> ________________________________ >>> From: sam wu <[email protected]> >>> To: [email protected] >>> Sent: Friday, June 15, 2012 5:38 PM >>> Subject: MinHash implementation thoughts >>> >>> I am playing with MinHashMapper, and have the following 2 penny thoughts >>> >>> 1. Efficiency >>> >>> in the map(..), around line 85 >>> ---------- >>> for (int i = 0; i < numHashFunctions; i++) { >>> for (Vector.Element ele : featureVector) { ----* >>> int value = (int) ele.get(); ----** >>> bytesToHash[0] = (byte) (value >> 24); >>> ..... >>> } >>> } >>> } >>> ----------- >>> the featureVector (which can be random sparse vector) iterator ---* goes >>> through all elements (including nonzero). >>> When I test ASF email program (from Grant's ...). Each email has ~30000 >>> elements. >>> By using iteratorNonZero(), >>> I reduce from 30000 to around 50 operations (both for converting from int >>> to byte[], and hash computation) for some email.??? >>> >>> 2. which field key to hash >>> we read from sequence file, so we have key and value. >>> If we read from tfidf doc sequenceFile, then >>> key will be termId, value will be tfidf value. >>> What to hash for MinHash ? >>> I think we need to hash on termId, instead of tfidf (meaningless for doc >>> MinHash ?) >>> so we need to get >>> value = ele.index() ---** >>> >>> When I test the current code, I get almost all value = 0 >>> >>> For general purpose, sometime we might need to hash on value, so we might >>> need an extra parameter , hashOnKey ? >>> >>> 3. How to compute custerID ? >>> >>> I thought LSH normally has r(rows in a band) and b (band). >>> numberHashFunction = r *b >>> >>> #clusterId = b >>> >>> so we can easily compute probability 1- exp( (1-exp(s,r)),b) >>> >>> for (int i = 0; i < b; i++) { >>> for (int j = 0; j < r; j++) { >>> clusterIdBuilder.append(minHashValues[i * b + j ]).append('-'); >>> } >>> .... >>> } >>> >>> the current implementation works in some way, but ?? >>> >>> >>> This is Friday late afternoon, >>> >>> Hope I am still making sense. >>> >>> >>> BR >>> >>> Sam >> >> >> >> -- >> Lance Norskog >> [email protected] -- Lance Norskog [email protected]
