So I've only processed a tiny fraction of my data with the LogLikelihoodSimilarity but already the output looks a lot better.
Do you think there's any benefit to storing things with small similarities? For example, would it make sense to just filter out things that are - say - less than 0.5? I would probably not recommend items that are so dissimilar. -Matthew Runo On Wed, Feb 16, 2011 at 2:39 PM, Matthew Runo <[email protected]> wrote: > Thank you for that suggestion. I have a few different actions that > users can do.. "view", "add to cart", and "buy" which I've assigned > different preference values to. Perhaps it would be better to simply > use boolean yes/no in my case? > > I'll give the log likelihood stuff a try tonight and I'll report back > in case anyone else runs into this issue. > > -Matthew Runo > > On Wed, Feb 16, 2011 at 2:31 PM, Chris Schilling <[email protected]> wrote: >> Mathew, >> >> I was running into a similar issue with my data. I discussed it with Sean >> Owen offline and his advice was, in a nutshell, to use the log-likelihood >> similarity metric. Since you describe your users as having only links, I >> assume you are not dealing with preference data. So, with the boolean data, >> the log-likelihood metric works very well (in my case, which I am also >> dealing with very sparse data). How do your results look if you try the >> likelihood approach? >> >> Hope this helps, >> Chris >> >> >> On Feb 16, 2011, at 2:24 PM, Matthew Runo wrote: >> >>> Hello folks - >>> >>> (I think that) I'm running into an issue with my user data being too >>> sparse with my item-item similarity calculations. A typical item_id in >>> my data might have about 2000 links to other items, but very few >>> "combinations" of users have viewed the same products. >>> >>> For example we have two items, 1244 and 2319 - and there are only >>> three users in common between them. >>> >>> So, there's only those three users who viewed both items. I'm >>> assigning preferences to different types of actions in my data.. and >>> since all three users did the same action towards the item, they have >>> the same preference value. Maybe I just need to start with a bigger >>> set of data to get more links between items in different "actions" in >>> order to spread out the generated similarities? I'm using the >>> EuclideanDistanceSimilarity to do the final computation. >>> >>> I think this is leading to a huge number of "1" values being returned. >>> Nearly 72% of my item-item similarities are 1.0. I feel that this is >>> invalid, but I'm not quite sure of the best way to attack it. >>> >>> There are some similarities of 1 where the items do not appear to be >>> similar at all, and the best I've been able to come up with as to how >>> the 1 came around was that there was only one user who had a link >>> between them and so that one user. >>> >>> How many item-user-item combinations per item pair does it take to get >>> good output? >>> >>> Sorry if I'm not quite describing my problem in the proper terms.. >>> >>> --Matthew Runo >> >> >
