It is critical to use randomized projections here in order to get the
dimension independent characteristics.

On Fri, Jul 6, 2012 at 11:32 AM, Sean Owen <[email protected]> wrote:

> LSH is probably my ticket, thanks all. I tried a form of this, but
> just used the basis of the feature space to define the hyperplanes
> because I was lazy and experimenting. I didn't work well in the sense
> that the best recommendations were not hashed together unless you had
> fairly few buckets (i.e., not much speedup). But -- I imagine I did
> something wrong.
>
> On Fri, Jul 6, 2012 at 7:01 PM, sam wu <[email protected]> wrote:
> > LSH has many different flavors (based on the different similarity
> metric).
> > Normally Minhash, which is good for if you have boolean (yes-no, 0-1)
> > features, and in the case of k-shingle, it fits well.
> > In the latent topcis model, like ALS, the feature is no longer 0-1. I
> think
> > Random Hyperplane (cosine-similarity for LSH) will be better.
> >
> > Another thought for finding NN, is to steal some idea from Ted's previous
> > "K-Means Cluster at Scale", projection search for nearest cluster ( how
> to
> > efficient to find k-NN centroids for a new vector). One TreeSet per
> feature
> > with HeadSet & TailSet. Not sure will this scale to hugh data ?
> > BTW, I recalled this streaming K-Means will be rolled into Mahout 0.8,
> but
> > I didn't find it. is this in the pipeline ?
> >
> > Sam
> >
> >
> > On Fri, Jul 6, 2012 at 3:18 AM, Jens Grivolla <[email protected]>
> wrote:
> >
> >> Maybe locality-sensitive hashing can help to get candidates before
> >> calculating the exact distance?
> >>
> >> Bye,
> >> Jens
> >>
> >>
> >> On 07/06/2012 11:35 AM, Sean Owen wrote:
> >>
> >>> Here's one I've been puzzling over for a bit. In a factorization based
> >>> on the SVD or what have you, you reconstruct the approximate original
> >>> matrix (well, one row) by multiplying the matrices back together and
> >>> looking for the largest elements. This is essentially multiplying a
> >>> user feature vector by the entire item-feature matrix to reconstruct
> >>> one approximate row of the input.
> >>>
> >>> That's not necessarily so slow, but it's not the fastest thing. I want
> >>> to speed it up. It seems like there ought to be some shortcut, even if
> >>> it means a probabilistic approach that could get it slightly wrong at
> >>> times.
> >>>
> >>> I say so because most item feature vectors are nowhere near the user
> >>> feature vector in feature space. Their dot product is not going to be
> >>> the largest. In fact, given the user feature vector you can say
> >>> exactly where in feature space (which direction) you want to look for
> >>> the top items. For example, if the user feature vector is (2,1) you
> >>> are looking for item vector (x,y) that maximizes 2x+y and that is
> >>> largest in the direction of (2,1).
> >>>
> >>> When feature space is 50+-dimensional though, I'm having a hard time
> >>> thinking of an efficient way to index those item feature vectors such
> >>> that one could quickly find a few buckets of items to check and with
> >>> high confidence have found the best recommendations. The bases I have
> >>> are not necessarily orthogonal let alone orthonormal either. I bet,
> >>> hope, someone will have an insight?
> >>>
> >>> You could cluster the items with k-means, quickly, I suppose. I was
> >>> hoping for something less heavy.
> >>>
> >>> Sean
> >>>
> >>>
> >>
> >>
> >>
>

Reply via email to