It takes a truly gargantuan amount of data to justify map-reducing LSH. You can get very far with a plain single-machine implementation.
On Wed, Apr 13, 2011 at 5:57 AM, Sebastian Schelter <[email protected]> wrote: > They are using PLSI which we already tried to implement in > https://issues.apache.org/jira/browse/MAHOUT-106. We didn't get it scalable, > as far as I remember the paper, they are doing a nasty trick when sending > data to the reducers in a certain step so that they only have to load a > certain portion of data into memory. I'm not sure this can be replicated in > hadoop (would love to be proven wrong through). > > They are also using LSH to cluster users by jaccard-coefficient, don't we > already have code for this in org.apache.mahout.clustering.minhash ? > > --sebastian > > On 13.04.2011 10:49, Sean Owen wrote: >> >> One of the three approaches that they combine is latent semantic indexing >> -- >> that is what I was referring to. >> >> On Wed, Apr 13, 2011 at 8:33 AM, Ted Dunning<[email protected]> >> wrote: >> >>> Sean, >>> >>> Do you mean LSI (latent semantic indexing)? Or LSH (locality sensitive >>> hashing)? >>> >>> (are you a victim of agressive error correction?) >>> >>> (or am I the victim of too little?) >>> >>> > >
