Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Sebastian
Hi Arnau, I had a look at your ratings file and its kind of strange. Its pretty tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out of these, only 50k have more than 3 interactions. So I think the first thing that you should do is throw out all the items with so few

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Arnau Sanchez
A Dropbox link now: https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 And here is the script I use to test different sizes/partitions (example: 10 parts of 10k): #!/bin/sh set -e -u mkdir -p ratings-split rm -rf ratings-split/part* hdfs dfs -rm -r ratings-split

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Sebastian
Hi Arnau, The links to your logfiles don't work for me unfortunately. Are you sure you correctly setup Spark? That can be a bit tricky in YARN settings, sometimes one machine idles around... Best, Sebastian On 25.09.2016 18:01, Pat Ferrel wrote: AWS EMR is usually not very well suited for

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Arnau Sanchez
The scaling issues with EMR+Spark may explain the weird performance I am seeing with Mahout's spark-itemsimilarity, I compared the running times with different partitions: the more partitions I feed the job, the more parallel processes it creates in the nodes, the more RAM it uses (some 100GB