You can do LSH on real-valued vectors - the 1's and 0's are just the +/- signs of projections onto randomly chosen hyperplanes.
Ullman's book is a great reference for this, and also goes over how to do all the parameter choosing. On Wed, Apr 13, 2011 at 12:43 AM, ke xie <[email protected]> wrote: > Ok, I would try to implement a none-distributed one. Actually I have a > python version now. > > But I have a problem. When doing min-hash, the matrix should be either 1 or > 0, and then do the hash functions. Then how about rating data? If the > matrix > is filled with 1~5 numbers, should we convert them use some treshould and > convert the rating to 1 if the rating is more than the treshould? > > This is the reference I read about LSH. check it out (chapter 3) > http://infolab.stanford.edu/~ullman/mmds.html > > On Wed, Apr 13, 2011 at 3:25 PM, Ted Dunning <[email protected]> > wrote: > > > Sure. > > > > LSH is a fine candidate for parallelism and scaling. > > > > I would recommend starting small and testing as you go rather than > leaping > > into a parallelized full-fledged implementation. Look for other > open-source > > implementaions of LSH algorithms. > > > > Be warned that the parameter selection for LSH can be pretty tricky (so I > > hear, anyway). You should pick a reasonable and realistic test problem > so > > that you can experiment with that. > > > > > > On Wed, Apr 13, 2011 at 12:19 AM, ke xie <[email protected]> wrote: > > > >> Can we implement one and contribute into the mahout project? Any > >> suggestions? > >> > > > > > > > -- > Name: Ke Xie Eddy > Research Group of Information Retrieval > State Key Laboratory of Intelligent Technology and Systems > Tsinghua University >
