Hi Arnau,
The links to your logfiles don't work for me unfortunately. Are you sure
you correctly setup Spark? That can be a bit tricky in YARN settings,
sometimes one machine idles around...
Best,
Sebastian
On 25.09.2016 18:01, Pat Ferrel wrote:
AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s
speed from in-memory calculations. So to see speed gains you have to have
enough memory. Also partitioning will help in many cases. If you read in data
from a single file—that partitioning will usually follow the calculation
throughout all intermediate steps. If the data is from a single file the
partition may be 1 and therefor it will only use one machine. The most recent
Mahout snapshot (therefore the next release) allows you to pass in the
partitioning for each event pair (this is only in the library use, not CLI). To
get this effect in the current release, try splitting the input into multiple
files.
I’m. probably the one that reported the 10x speed up and used input from Kafka
DStreams, which causes very small default partition sizes. Also other
comparisons for other calculations give a similar speedup result. There is
little question about Spark being much faster—when used the way it is meant to
be.
I use Mahout as a library all the time in the Universal Recommender implemented
in Apache PredictionIO. As a library we get greater control than the CLI. The
CLI is really only a proof of concept, not really meant for production.
BTW there is a significant algorithm benefit of the code behind
spark-itemsimilarity that is probably more important than the speed increase
and that is Correlated Cross-Occurrence, which allows the use of many
indicators of user taste, not just the primary/conversion event, which is all
any other CF-style recommender that I know of can use.
On Sep 22, 2016, at 1:49 AM, Arnau Sanchez <pyar...@gmail.com> wrote:
I've been using the Mahout itemsimilarity job for a while, with good results. I
read that the new spark-itemsimilarity job is typically faster, by a factor of
10, so I wanted to give it a try. I must be doing something wrong because, with
the same EMR infrastructure, the spark job is slower than the old one (6 min vs
16 min) working on the same data. I took a small sample dataset (766k rating
pairs) to compare numbers, this is the result:
Input ratings: http://download.zaudera.com/public/ratings
Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)
Old itemsimilarity:
$ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData
TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE
[5m54s]
(logs: http://download.zaudera.com/public/itemsimilarity.out)
New spark-itemsimilarity:
$ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity
--maxSimilaritiesPerItem 10 --master yarn-client
[15m51s]
(logs: http://download.zaudera.com/public/spark-itemsimilarity.out)
Any ideas? Thanks!