Re: spark-itemsimilarity slower than itemsimilarity

2016-10-03 Thread Pat Ferrel
Except for reading the input it now takes ~5 minutes to train. On Sep 30, 2016, at 5:12 PM, Pat Ferrel wrote: Yeah, I bet Sebastian is right. I see no reason not to try running with --master local[4] or some number of cores on localhost. This will avoid all

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-30 Thread Pat Ferrel
Yeah, I bet Sebastian is right. I see no reason not to try running with --master local[4] or some number of cores on localhost. This will avoid all serialization. With times that low and small data there is no benefit to separate machines. We are using this with ~1TB of data. Using Mahout as a

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-30 Thread Sebastian
Hi Arnau, I don't think that you can expect any speedups in your setup, your input data is way to small and I think you run only two concurrent tasks. Maybe you should try a larger sample of your data and more machines. At the moment, it seems to me that the overheads of running in a

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-30 Thread Arnau Sanchez
Hi! Here you go: "ratings-clean" contains only pairs of (user, product) for those products with 4 or more user interactions (770k -> 465k): https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 The results: 1 part of 465k: 3m41.361s 5 parts of 100k: 4m20.785s 24 pars of

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Sebastian
Hi Arnau, I had a look at your ratings file and its kind of strange. Its pretty tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out of these, only 50k have more than 3 interactions. So I think the first thing that you should do is throw out all the items with so few

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Arnau Sanchez
A Dropbox link now: https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 And here is the script I use to test different sizes/partitions (example: 10 parts of 10k): #!/bin/sh set -e -u mkdir -p ratings-split rm -rf ratings-split/part* hdfs dfs -rm -r ratings-split

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Sebastian
Hi Arnau, The links to your logfiles don't work for me unfortunately. Are you sure you correctly setup Spark? That can be a bit tricky in YARN settings, sometimes one machine idles around... Best, Sebastian On 25.09.2016 18:01, Pat Ferrel wrote: AWS EMR is usually not very well suited for

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Arnau Sanchez
The scaling issues with EMR+Spark may explain the weird performance I am seeing with Mahout's spark-itemsimilarity, I compared the running times with different partitions: the more partitions I feed the job, the more parallel processes it creates in the nodes, the more RAM it uses (some 100GB

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-28 Thread Pat Ferrel
The problem with EMR is that the Spark driver needs to be as big as the executors many times and is not handled by EMR. EMR worked fine for Hadoop MapReduce because the driver usually did not have to be scaled vertically. I suppose you could say EMR would work but does not solve the whole

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-26 Thread Arnau Sanchez
On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel wrote: > AWS EMR is usually not very well suited for Spark. What infrastructure would you recommend? Some EC2 instances provide lots of memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb RAM). My

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-25 Thread Pat Ferrel
AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s speed from in-memory calculations. So to see speed gains you have to have enough memory. Also partitioning will help in many cases. If you read in data from a single file—that partitioning will usually follow the

spark-itemsimilarity slower than itemsimilarity

2016-09-22 Thread Arnau Sanchez
I've been using the Mahout itemsimilarity job for a while, with good results. I read that the new spark-itemsimilarity job is typically faster, by a factor of 10, so I wanted to give it a try. I must be doing something wrong because, with the same EMR infrastructure, the spark job is slower