Yep. By concatenating p hash-keys ( generated from p functions ) for each user, the probability that any 2 users will agree on a concatenated hash key is S(ui,uj)^p and thus making the clusters more refined. S(ui,uj) is the jaccard's coefficient ( the similarity coefficient )
On Tue, Nov 8, 2011 at 12:20 PM, Grant Ingersoll <[email protected]>wrote: > From MAHOUT-344 from the patch author: > > The idea behind keyGroups is to concatenate hashes from multiple hash > functions reduce the probability of collision between 2 users that agreed > on 1 or more individual hash values. This essentially improves the average > similarity of users in a cluster. > > -Grant > > On Nov 7, 2011, at 8:54 PM, Suneel Marthi wrote: > > > Do we have an answer for this? > > > > Sent from my iPhone > > > > On Nov 2, 2011, at 7:20 AM, Grant Ingersoll <[email protected]> wrote: > > > >> What's the Minhash key groups value used for in the MinhashDriver? I > mean, I see it is used for building up the key out of the hashed values, > but what's the significance of different values for it? The default is 2, > what does it mean practically speaking if I choose, say, 10? AFAICT, it > would mean that I would have more clusters, assuming that we still meet the > minimum cluster size imposed by the reducer? > >> > >> Thanks, > >> Grant > > >
