My first 'guess', not knowing all the details, is that it sounds like the similarity metric is not showing a similarity for many pairs of items. The item-item connections are few and far between. This means that estimating a preference value for a new item may be based on just 2 item-item similarities. Though the result is calculated as a weighted average, it's possible to hit on an item that happens to be similar to perhaps just 2 of the user's items, and those 2 items were highly rated (10), and so you get such an answer.
The fact that you're seeing so many of the same answers sort of supports that conclusion. It's partly due to the fact that I suppose your similarity metric is returning just a few common values too so the maths work out similarly for many items. While there are a number of ways you could attack and hack around the issue, if I'm right, my diagnosis is that this is probably just not an effective content-based similarity metric for you. The maths aren't working out so well, and I think genre alone doesn't intuitively tell you much about movie similarity. Tanimoto will return NaN from item similarity when there is no intersection or union -- that is both sets are empty. But I agree there is a bit of an asymmetry here since for users you get NaN when there is no intersection. While mathematically it can return 0, in practice, it's more consistent with how other similarity metrics think of "no relation at all". On Fri, Dec 31, 2010 at 11:37 AM, Ahmet Arslan <[email protected]> wrote: > Hello Mahout community, > > I have an custom ItemSimilarity. > > Items can have multiple genre info. And some items do not have genre info > available. This custom similarity uses Jaccard coefficient over two genre > sets. (A intersect B) / (A union B), so its range is 0 to 1. > > A is genre set of item1 > B is genre set of item2 > > If one of the items does not have genre info, it returns 0.0. > > And user,item,pref triples contains pref values between 0.010416667 to 10.0 > > I am using GenericItemBasedRecommender with getAllOtherItems method overrided. > > I have custom IDRescorer that filters out some items. It does not do > rescoring. > > When I asked recommedation for a user, i see that top 65 items having the > same estimated preference. Even some results have perfect estimated pref > value 10. > Is this normal to have same estimated pref values for so many items? > > For example: > top 63 has score of 4.015476 > 63 to 96 has score of 3.9160492 > 97 has 3.8527777 > 98 to 100 has 3.472611 > > Another question is, TanimotoCoefficientSimilarity never returns Double.NaN > for item similarity. However it does for user similarity. How to choose > between 0.0 and Double.NaN? What will be the difference in terms of estimated > pref? > > Thanks. > > > >
