Sean Thanks for your quick reply. Switching to a Jaccard coefficient based ItemSimilarity already improved things tremendously. >You can change the estimation to account for "certainty" in some way. >For example, you could divide the estimate by the weighted standard >deviation of that series that was averaged to make the estimate. The >result is no longer an estimate of a rating, but is probably going to >give much more sane results. In order to do this, which part of the code should I change? is it https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/recommender/Recommender.html#estimatePreference%28long,%20long%29 ?
________________________________ From: Sean Owen <[email protected]> To: [email protected]; a a <[email protected]> Sent: Friday, July 13, 2012 1:35 PM Subject: Re: item-based recommendation with custom similarity This is too much code to ask people to debug in detail, but I get the gist of it. I am guess that this is happening: the 2 War movies were rated 5.0, and were only tagged War. This means that any other movie tagged only War is estimated to be 5.0, given this similarity definition. And then that's hard for anything else to beat. You could say the problem is that a simple weighted average doesn't account for the number of items that were averaged. An average of 5.0 over 1-2 items is far less meaningful than an average of 5.0 over 100 items. This isn't normally much of an issue when users have rated a decent number of items, and when items have nonzero similarity to most or all others. Here, most item-item pairs have no similarity. You can change the estimation to account for "certainty" in some way. For example, you could divide the estimate by the weighted standard deviation of that series that was averaged to make the estimate. The result is no longer an estimate of a rating, but is probably going to give much more sane results. While it wouldn't really solve the problem by itself -- I would also recommend you change the similarity to be a simple Jaccard coefficient computed from genres. Just the intersection size divided by union size. You're doing something like that already, it's just the logical conclusion. On Fri, Jul 13, 2012 at 2:01 PM, a a <[email protected]> wrote: > Hello, > > I am trying to implement an item based recommender with a custom > ItemSimilarity. > I've used the movielens data for the test and the item similarity uses the > movie genre to create the similarity value. > > I've followed the advice in the book and wrote a very simple app to see it in action. > > When I run the code, the results that I get back do not make a lot of sense. > > For ex., below are the recommendations I get for user 1 is : > 1450 : 5.0 ->'1450 Prisoner of the Mountains, 1996, War' > 1289 : 5.0 ->'1289 Koyaanisqatsi, 1983, Documentary War' > 760 : 5.0 ->'760 Stalingrad, 1993, War' > 632 : 5.0 ->'632 Land and Freedom, 1995, War' > 665 : 5.0 ->'665 Underground, 1995, War' > Movies > Wacthed:53>Crime:2,Adventure:5,Action:5,War:2,Fantasy:3,Romance:6,Animation:39,Children's:20,Sci-Fi:3,Musical:14,Comedy:14,Thriller:3 > > In the last line of the log we see that, the user has watched a lot of movies > with genre Animation,Children's,Musical yet the reocmmendations are all from > the genre War. > I've repeated the test for many different users, and all the recommendations > that I got were out of line with the user history. > > Can anyone tell me what I'm doing wrong?
