Hi Arnau,

I don't think that you can expect any speedups in your setup, your input data is way to small and I think you run only two concurrent tasks. Maybe you should try a larger sample of your data and more machines.

At the moment, it seems to me that the overheads of running in a distributed setting (task scheduling, serialization...) totally dominate the computation.

Best,
Sebastian

On 30.09.2016 11:11, Arnau Sanchez wrote:
Hi!

Here you go: "ratings-clean" contains only pairs of (user, product) for those 
products with 4 or more user interactions (770k -> 465k):

https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0

The results:

1 part of 465k:   3m41.361s
5 parts of 100k:  4m20.785s
24 pars of 20k:  10m44.375s
47 parts of 10k: 17m39.385s

On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian <s...@apache.org> wrote:

Hi Arnau,

I had a look at your ratings file and its kind of strange. Its pretty
tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
of these, only 50k have more than 3 interactions.

So I think the first thing that you should do is throw out all the items
with so few interactions. Item similarity computations are pretty
sensitive to the number of unique items, maybe thats why you don't see
much difference in the run times.

-s


On 29.09.2016 22:17, Arnau Sanchez wrote:
 --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10

Reply via email to