Hello folks - (I think that) I'm running into an issue with my user data being too sparse with my item-item similarity calculations. A typical item_id in my data might have about 2000 links to other items, but very few "combinations" of users have viewed the same products.
For example we have two items, 1244 and 2319 - and there are only three users in common between them. So, there's only those three users who viewed both items. I'm assigning preferences to different types of actions in my data.. and since all three users did the same action towards the item, they have the same preference value. Maybe I just need to start with a bigger set of data to get more links between items in different "actions" in order to spread out the generated similarities? I'm using the EuclideanDistanceSimilarity to do the final computation. I think this is leading to a huge number of "1" values being returned. Nearly 72% of my item-item similarities are 1.0. I feel that this is invalid, but I'm not quite sure of the best way to attack it. There are some similarities of 1 where the items do not appear to be similar at all, and the best I've been able to come up with as to how the 1 came around was that there was only one user who had a link between them and so that one user. How many item-user-item combinations per item pair does it take to get good output? Sorry if I'm not quite describing my problem in the proper terms.. --Matthew Runo
