Sparse data & Item Similarity

Matthew Runo Wed, 16 Feb 2011 14:25:11 -0800

Hello folks -

(I think that) I'm running into an issue with my user data being too
sparse with my item-item similarity calculations. A typical item_id in
my data might have about 2000 links to other items, but very few
"combinations" of users have viewed the same products.


For example we have two items, 1244 and 2319 - and there are only
three users in common between them.

So, there's only those three users who viewed both items. I'm
assigning preferences to different types of actions in my data.. and
since all three users did the same action towards the item, they have
the same preference value. Maybe I just need to start with a bigger
set of data to get more links between items in different "actions" in
order to spread out the generated similarities? I'm using the
EuclideanDistanceSimilarity to do the final computation.

I think this is leading to a huge number of "1" values being returned.
Nearly 72% of my item-item similarities are 1.0. I feel that this is
invalid, but I'm not quite sure of the best way to attack it.

There are some similarities of 1 where the items do not appear to be
similar at all, and the best I've been able to come up with as to how
the 1 came around was that there was only one user who had a link
between them and so that one user.

How many item-user-item combinations per item pair does it take to get
good output?

Sorry if I'm not quite describing my problem in the proper terms..

--Matthew Runo

Sparse data & Item Similarity

Reply via email to