I have thought about this problem before, and I read several posts talking
about this. Sean Owen is right that the math doesn't care about what the things
are. But in practice I think a better way is that you can evaluate the
individual similarity of different kinds of data, and then combine the
individual similarities into the final one.
That means that for the users with two different kinds data, first you can
derive two kinds of similarity, the similarity from the amazon data and the
similarity from the youtube video data, and then you can add the similarity
with weight to get the final similarity matrix of the users.
linkedin's example:
http://www.quora.com/How-does-LinkedIns-recommendation-system-work
when they compute the similarity of people's profiles in linkedin, the speaker
said this
" Here in order to compute overall of similarity between me and Adil, we are
first computing similarity between our specialties, our skills, our titles and
other attribute."
and "Now we somehow need to combine the similarity score in the vector to a
single number " . there are some pictures in the post, which can help you
understand it.
I wonder if any of you agree with me?
thanks!
zhongliang
At 2012-07-04 15:42:16,"Sean Owen" <[email protected]> wrote:
>The best default answer is to put them all in one model. The math
>doesn't care what the things are. Unless you have a strong reason to
>weight one data set I wouldn't. If you do, then two models is best. It
>is hard to weight a subset of the data within most similarity
>functions. I don't think it would in Pearson for instance but could
>work in Tanimoto.
>
>On Wed, Jul 4, 2012 at 1:20 AM, Ken Krugler <[email protected]>
>wrote:
>> Hi all,
>>
>> I'm curious what approaches are recommended for generating user-user
>> similarity, when I've got two (or more) distinct types of item data, both of
>> which are fairly large.
>>
>> E.g. let's say I had a set of users where I knew both (a) what books they
>> had bought on Amazon, and (b) what YouTube videos they had watched.
>>
>> For each user, I want to find the 10 most similar other users.
>>
>> - I could create two separate models, find the nearest 30 users for each
>> user, and combine (maybe with weighting) the results.
>> - I could toss all of the data into one model - and I could use a value of
>> < 1.0 for whichever type of preference is less important.
>>
>> Any other suggestions? Input on the above two approaches?
>>
>> Thanks!
>>
>> -- Ken
>>
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>>
>>
>>
>>