Ok, I would try to implement a none-distributed one. Actually I have a python version now.
But I have a problem. When doing min-hash, the matrix should be either 1 or 0, and then do the hash functions. Then how about rating data? If the matrix is filled with 1~5 numbers, should we convert them use some treshould and convert the rating to 1 if the rating is more than the treshould? This is the reference I read about LSH. check it out (chapter 3) http://infolab.stanford.edu/~ullman/mmds.html On Wed, Apr 13, 2011 at 3:25 PM, Ted Dunning <[email protected]> wrote: > Sure. > > LSH is a fine candidate for parallelism and scaling. > > I would recommend starting small and testing as you go rather than leaping > into a parallelized full-fledged implementation. Look for other open-source > implementaions of LSH algorithms. > > Be warned that the parameter selection for LSH can be pretty tricky (so I > hear, anyway). You should pick a reasonable and realistic test problem so > that you can experiment with that. > > > On Wed, Apr 13, 2011 at 12:19 AM, ke xie <[email protected]> wrote: > >> Can we implement one and contribute into the mahout project? Any >> suggestions? >> > > -- Name: Ke Xie Eddy Research Group of Information Retrieval State Key Laboratory of Intelligent Technology and Systems Tsinghua University
