Hi, I've noticed a problem in the non-Hadoop (taste) version of the recommender package. The problem is in the AbstractSimilarity (in package org.apache.mahout.cf.taste.impl.similarity).
This class is the base class for computing the similarity values between vectors of users or items. It assumes that the similarity between the vectors is computed using only the commonly rated items/users. Consider the following two vectors: V1: <_, 3, 4, _, 2> V2: <3, 5, _, 2, 4> where "_" means no ratings. For these two vectors, the cosine or Pearson similarity is computed on the following vectors: <3, 2> <5, 4> However, if the number of common ratings is small then the similarity result will be very unreliable. Which is indeed the case if you run the code on Movielens dataset and measure recall values, the results will be very bad. There can be two solutions: 1. There should be a parameter n, which determines the minimum number of common ratings needed to compute a similarity otherwise the system should return NaN. 2. The similarity should be computed using all the ratings, for the above two vectors, the cosine similarity should be (3*5+2*4)/(sqrt(3^2+4^2+2^2)+sqrt(3^2+5^2+2^2+4^2)) Tevfik