Could this project be done with symbol sequences instead of hash codes? The advantage of symbol sequences is that you can unpack them.
On Tue, Nov 8, 2011 at 9:54 AM, Vishal Santoshi <[email protected]>wrote: > Yep. > > By concatenating p hash-keys ( generated from p functions ) for each user, > the probability that any 2 users will agree on a concatenated hash key is > S(ui,uj)^p and thus making the clusters more refined. > S(ui,uj) is the jaccard's coefficient ( the similarity coefficient ) > > > On Tue, Nov 8, 2011 at 12:20 PM, Grant Ingersoll <[email protected] > >wrote: > > > From MAHOUT-344 from the patch author: > > > > The idea behind keyGroups is to concatenate hashes from multiple hash > > functions reduce the probability of collision between 2 users that agreed > > on 1 or more individual hash values. This essentially improves the > average > > similarity of users in a cluster. > > > > -Grant > > > > On Nov 7, 2011, at 8:54 PM, Suneel Marthi wrote: > > > > > Do we have an answer for this? > > > > > > Sent from my iPhone > > > > > > On Nov 2, 2011, at 7:20 AM, Grant Ingersoll <[email protected]> > wrote: > > > > > >> What's the Minhash key groups value used for in the MinhashDriver? I > > mean, I see it is used for building up the key out of the hashed values, > > but what's the significance of different values for it? The default is > 2, > > what does it mean practically speaking if I choose, say, 10? AFAICT, it > > would mean that I would have more clusters, assuming that we still meet > the > > minimum cluster size imposed by the reducer? > > >> > > >> Thanks, > > >> Grant > > > > > > > -- Lance Norskog [email protected]
