AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s 
speed from in-memory calculations. So to see speed gains you have to have 
enough memory. Also partitioning will help in many cases. If you read in data 
from a single file—that partitioning will usually follow the calculation 
throughout all intermediate steps. If the data is from a single file the 
partition may be 1 and therefor it will only use one machine. The most recent 
Mahout snapshot (therefore the next release) allows you to pass in the 
partitioning for each event pair (this is only in the library use, not CLI). To 
get this effect in the current release, try splitting the input into multiple 
files.

I’m. probably the one that reported the 10x speed up and used input from Kafka 
DStreams, which causes very small default partition sizes. Also other 
comparisons for other calculations give a similar speedup result. There is 
little question about Spark being much faster—when used the way it is meant to 
be.

I use Mahout as a library all the time in the Universal Recommender implemented 
in Apache PredictionIO. As a library we get greater control than the CLI. The 
CLI is really only a proof of concept, not really meant for production.

BTW there is a significant algorithm benefit of the code behind 
spark-itemsimilarity that is probably more important than the speed increase 
and that is Correlated Cross-Occurrence, which allows the use of many 
indicators of user taste, not just the primary/conversion event, which is all 
any other CF-style recommender that I know of can use.


On Sep 22, 2016, at 1:49 AM, Arnau Sanchez <pyar...@gmail.com> wrote:

I've been using the Mahout itemsimilarity job for a while, with good results. I 
read that the new spark-itemsimilarity job is typically faster, by a factor of 
10, so I wanted to give it a try. I must be doing something wrong because, with 
the same EMR infrastructure, the spark job is slower than the old one (6 min vs 
16 min) working on the same data. I took a small sample dataset (766k rating 
pairs) to compare numbers, this is the result:

Input ratings: http://download.zaudera.com/public/ratings

Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)

Old itemsimilarity:

$ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData 
TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE
[5m54s]

(logs: http://download.zaudera.com/public/itemsimilarity.out)

New spark-itemsimilarity:

$ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity 
--maxSimilaritiesPerItem 10 --master yarn-client
[15m51s]

(logs: http://download.zaudera.com/public/spark-itemsimilarity.out)

Any ideas? Thanks!

Reply via email to