Hi Greg,

do you see a difference in the actual similarity values that are computed?

--sebastian

On 24.11.2011 02:14, Greg H wrote:
> Hello,
> 
> I've been using Mahout's item-based recommender on several different
> implicit datasets. First I computed the item-item similarities by just
> passing in an ItemSimilarity and DataModel to the GenericItemSimilarity
> class but lately I've been using ItemSimilarityJob to calculate them on a
> Hadoop cluster. However, I've found that there is a significant difference
> in the results of my experiments depending on which method I use to
> calculate the similarities.
> 
> For example, when I use the public MovieLens 1M dataset (which I've
> converted into an implicit dataset), calculating the similarities with:
> 
> ItemBasedRecommender recommender = new
> GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new
> TanimotoCoefficientSimilarity(dataModel), dataModel));
> 
> gives 0.25 precision when splitting the data for each user at the ratio of
> 80% - 20% and then looking at only the top 5 recommended items. However,
> when I compute the similarities with ItemSimilarityJob using the following
> command:
> 
> hadoop jar mahout-core-0.6-SNAPSHOT-job.jar
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output
> --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT
> --maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true
> 
> I only get 0.23 precision. The results differ even more significantly in
> some other datasets that I've been working with. I know that
> ItemSimilarityJob prunes some items so I've tried many different settings
> for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves
> the results it still never matches the non-distributed version. Shouldn't
> the results be the same no matter which version I use to calculate the
> similarities?
> 
> Thank you,
> Greg
> 

Reply via email to