Hi Arnau,
I had a look at your ratings file and its kind of strange. Its pretty
tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
of these, only 50k have more than 3 interactions.
So I think the first thing that you should do is throw out all the items
with so few
A Dropbox link now:
https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
And here is the script I use to test different sizes/partitions (example: 10
parts of 10k):
#!/bin/sh
set -e -u
mkdir -p ratings-split
rm -rf ratings-split/part*
hdfs dfs -rm -r ratings-split
Hi Arnau,
The links to your logfiles don't work for me unfortunately. Are you sure
you correctly setup Spark? That can be a bit tricky in YARN settings,
sometimes one machine idles around...
Best,
Sebastian
On 25.09.2016 18:01, Pat Ferrel wrote:
AWS EMR is usually not very well suited for
The scaling issues with EMR+Spark may explain the weird performance I am seeing
with Mahout's spark-itemsimilarity, I compared the running times with different
partitions: the more partitions I feed the job, the more parallel processes it
creates in the nodes, the more RAM it uses (some 100GB