One more thought.
Cosine similarity kinds of measure the ratio of different feature
preference. In recommendation job, I think  ratio of feature preference is
more relevant than the score itself ( kind reducing bias impact, some
people rank score higher,..)

Sam

On Fri, Jul 6, 2012 at 9:01 AM, sam wu <[email protected]> wrote:

> LSH has many different flavors (based on the different similarity metric).
> Normally Minhash, which is good for if you have boolean (yes-no, 0-1)
> features, and in the case of k-shingle, it fits well.
> In the latent topcis model, like ALS, the feature is no longer 0-1. I
> think Random Hyperplane (cosine-similarity for LSH) will be better.
>
> Another thought for finding NN, is to steal some idea from Ted's previous
> "K-Means Cluster at Scale", projection search for nearest cluster ( how to
> efficient to find k-NN centroids for a new vector). One TreeSet per feature
> with HeadSet & TailSet. Not sure will this scale to hugh data ?
> BTW, I recalled this streaming K-Means will be rolled into Mahout 0.8, but
> I didn't find it. is this in the pipeline ?
>
> Sam
>
>
> On Fri, Jul 6, 2012 at 3:18 AM, Jens Grivolla <[email protected]> wrote:
>
>> Maybe locality-sensitive hashing can help to get candidates before
>> calculating the exact distance?
>>
>> Bye,
>> Jens
>>
>>
>> On 07/06/2012 11:35 AM, Sean Owen wrote:
>>
>>> Here's one I've been puzzling over for a bit. In a factorization based
>>> on the SVD or what have you, you reconstruct the approximate original
>>> matrix (well, one row) by multiplying the matrices back together and
>>> looking for the largest elements. This is essentially multiplying a
>>> user feature vector by the entire item-feature matrix to reconstruct
>>> one approximate row of the input.
>>>
>>> That's not necessarily so slow, but it's not the fastest thing. I want
>>> to speed it up. It seems like there ought to be some shortcut, even if
>>> it means a probabilistic approach that could get it slightly wrong at
>>> times.
>>>
>>> I say so because most item feature vectors are nowhere near the user
>>> feature vector in feature space. Their dot product is not going to be
>>> the largest. In fact, given the user feature vector you can say
>>> exactly where in feature space (which direction) you want to look for
>>> the top items. For example, if the user feature vector is (2,1) you
>>> are looking for item vector (x,y) that maximizes 2x+y and that is
>>> largest in the direction of (2,1).
>>>
>>> When feature space is 50+-dimensional though, I'm having a hard time
>>> thinking of an efficient way to index those item feature vectors such
>>> that one could quickly find a few buckets of items to check and with
>>> high confidence have found the best recommendations. The bases I have
>>> are not necessarily orthogonal let alone orthonormal either. I bet,
>>> hope, someone will have an insight?
>>>
>>> You could cluster the items with k-means, quickly, I suppose. I was
>>> hoping for something less heavy.
>>>
>>> Sean
>>>
>>>
>>
>>
>>
>

Reply via email to