I've been using the Mahout itemsimilarity job for a while, with good results. I read that the new spark-itemsimilarity job is typically faster, by a factor of 10, so I wanted to give it a try. I must be doing something wrong because, with the same EMR infrastructure, the spark job is slower than the old one (6 min vs 16 min) working on the same data. I took a small sample dataset (766k rating pairs) to compare numbers, this is the result:
Input ratings: http://download.zaudera.com/public/ratings Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2) Old itemsimilarity: $ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE [5m54s] (logs: http://download.zaudera.com/public/itemsimilarity.out) New spark-itemsimilarity: $ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client [15m51s] (logs: http://download.zaudera.com/public/spark-itemsimilarity.out) Any ideas? Thanks!