Except for reading the input it now takes ~5 minutes to train.
On Sep 30, 2016, at 5:12 PM, Pat Ferrel wrote:
Yeah, I bet Sebastian is right. I see no reason not to try running with
--master local[4] or some number of cores on localhost. This will avoid all
Yeah, I bet Sebastian is right. I see no reason not to try running with
--master local[4] or some number of cores on localhost. This will avoid all
serialization. With times that low and small data there is no benefit to
separate machines.
We are using this with ~1TB of data. Using Mahout as a
Hi Arnau,
I don't think that you can expect any speedups in your setup, your input
data is way to small and I think you run only two concurrent tasks.
Maybe you should try a larger sample of your data and more machines.
At the moment, it seems to me that the overheads of running in a
Hi!
Here you go: "ratings-clean" contains only pairs of (user, product) for those
products with 4 or more user interactions (770k -> 465k):
https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
The results:
1 part of 465k: 3m41.361s
5 parts of 100k: 4m20.785s
24 pars of
Hi Arnau,
I had a look at your ratings file and its kind of strange. Its pretty
tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
of these, only 50k have more than 3 interactions.
So I think the first thing that you should do is throw out all the items
with so few
A Dropbox link now:
https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
And here is the script I use to test different sizes/partitions (example: 10
parts of 10k):
#!/bin/sh
set -e -u
mkdir -p ratings-split
rm -rf ratings-split/part*
hdfs dfs -rm -r ratings-split
Hi Arnau,
The links to your logfiles don't work for me unfortunately. Are you sure
you correctly setup Spark? That can be a bit tricky in YARN settings,
sometimes one machine idles around...
Best,
Sebastian
On 25.09.2016 18:01, Pat Ferrel wrote:
AWS EMR is usually not very well suited for
The scaling issues with EMR+Spark may explain the weird performance I am seeing
with Mahout's spark-itemsimilarity, I compared the running times with different
partitions: the more partitions I feed the job, the more parallel processes it
creates in the nodes, the more RAM it uses (some 100GB
The problem with EMR is that the Spark driver needs to be as big as the
executors many times and is not handled by EMR. EMR worked fine for Hadoop
MapReduce because the driver usually did not have to be scaled vertically. I
suppose you could say EMR would work but does not solve the whole
On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel wrote:
> AWS EMR is usually not very well suited for Spark.
What infrastructure would you recommend? Some EC2 instances provide lots of
memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb
RAM).
My
AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s
speed from in-memory calculations. So to see speed gains you have to have
enough memory. Also partitioning will help in many cases. If you read in data
from a single file—that partitioning will usually follow the
I've been using the Mahout itemsimilarity job for a while, with good results. I
read that the new spark-itemsimilarity job is typically faster, by a factor of
10, so I wanted to give it a try. I must be doing something wrong because, with
the same EMR infrastructure, the spark job is slower
12 matches
Mail list logo