I've been using the Mahout itemsimilarity job for a while, with good results. I 
read that the new spark-itemsimilarity job is typically faster, by a factor of 
10, so I wanted to give it a try. I must be doing something wrong because, with 
the same EMR infrastructure, the spark job is slower than the old one (6 min vs 
16 min) working on the same data. I took a small sample dataset (766k rating 
pairs) to compare numbers, this is the result:

Input ratings: http://download.zaudera.com/public/ratings

Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)

Old itemsimilarity:

$ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData 
TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE
[5m54s]

(logs: http://download.zaudera.com/public/itemsimilarity.out)

New spark-itemsimilarity:

$ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity 
--maxSimilaritiesPerItem 10 --master yarn-client
[15m51s]

(logs: http://download.zaudera.com/public/spark-itemsimilarity.out)

Any ideas? Thanks!

Reply via email to