LSH has many different flavors (based on the different similarity metric). Normally Minhash, which is good for if you have boolean (yes-no, 0-1) features, and in the case of k-shingle, it fits well. In the latent topcis model, like ALS, the feature is no longer 0-1. I think Random Hyperplane (cosine-similarity for LSH) will be better.
Another thought for finding NN, is to steal some idea from Ted's previous "K-Means Cluster at Scale", projection search for nearest cluster ( how to efficient to find k-NN centroids for a new vector). One TreeSet per feature with HeadSet & TailSet. Not sure will this scale to hugh data ? BTW, I recalled this streaming K-Means will be rolled into Mahout 0.8, but I didn't find it. is this in the pipeline ? Sam On Fri, Jul 6, 2012 at 3:18 AM, Jens Grivolla <[email protected]> wrote: > Maybe locality-sensitive hashing can help to get candidates before > calculating the exact distance? > > Bye, > Jens > > > On 07/06/2012 11:35 AM, Sean Owen wrote: > >> Here's one I've been puzzling over for a bit. In a factorization based >> on the SVD or what have you, you reconstruct the approximate original >> matrix (well, one row) by multiplying the matrices back together and >> looking for the largest elements. This is essentially multiplying a >> user feature vector by the entire item-feature matrix to reconstruct >> one approximate row of the input. >> >> That's not necessarily so slow, but it's not the fastest thing. I want >> to speed it up. It seems like there ought to be some shortcut, even if >> it means a probabilistic approach that could get it slightly wrong at >> times. >> >> I say so because most item feature vectors are nowhere near the user >> feature vector in feature space. Their dot product is not going to be >> the largest. In fact, given the user feature vector you can say >> exactly where in feature space (which direction) you want to look for >> the top items. For example, if the user feature vector is (2,1) you >> are looking for item vector (x,y) that maximizes 2x+y and that is >> largest in the direction of (2,1). >> >> When feature space is 50+-dimensional though, I'm having a hard time >> thinking of an efficient way to index those item feature vectors such >> that one could quickly find a few buckets of items to check and with >> high confidence have found the best recommendations. The bases I have >> are not necessarily orthogonal let alone orthonormal either. I bet, >> hope, someone will have an insight? >> >> You could cluster the items with k-means, quickly, I suppose. I was >> hoping for something less heavy. >> >> Sean >> >> > > >
