Mathew, I was running into a similar issue with my data. I discussed it with Sean Owen offline and his advice was, in a nutshell, to use the log-likelihood similarity metric. Since you describe your users as having only links, I assume you are not dealing with preference data. So, with the boolean data, the log-likelihood metric works very well (in my case, which I am also dealing with very sparse data). How do your results look if you try the likelihood approach?
Hope this helps, Chris On Feb 16, 2011, at 2:24 PM, Matthew Runo wrote: > Hello folks - > > (I think that) I'm running into an issue with my user data being too > sparse with my item-item similarity calculations. A typical item_id in > my data might have about 2000 links to other items, but very few > "combinations" of users have viewed the same products. > > For example we have two items, 1244 and 2319 - and there are only > three users in common between them. > > So, there's only those three users who viewed both items. I'm > assigning preferences to different types of actions in my data.. and > since all three users did the same action towards the item, they have > the same preference value. Maybe I just need to start with a bigger > set of data to get more links between items in different "actions" in > order to spread out the generated similarities? I'm using the > EuclideanDistanceSimilarity to do the final computation. > > I think this is leading to a huge number of "1" values being returned. > Nearly 72% of my item-item similarities are 1.0. I feel that this is > invalid, but I'm not quite sure of the best way to attack it. > > There are some similarities of 1 where the items do not appear to be > similar at all, and the best I've been able to come up with as to how > the 1 came around was that there was only one user who had a link > between them and so that one user. > > How many item-user-item combinations per item pair does it take to get > good output? > > Sorry if I'm not quite describing my problem in the proper terms.. > > --Matthew Runo
