Hi!
Here you go: "ratings-clean" contains only pairs of (user, product) for those
products with 4 or more user interactions (770k -> 465k):
https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
The results:
1 part of 465k: 3m41.361s
5 parts of 100k: 4m20.785s
24 pars of
Hi Arnau,
I don't think that you can expect any speedups in your setup, your input
data is way to small and I think you run only two concurrent tasks.
Maybe you should try a larger sample of your data and more machines.
At the moment, it seems to me that the overheads of running in a
Yeah, I bet Sebastian is right. I see no reason not to try running with
--master local[4] or some number of cores on localhost. This will avoid all
serialization. With times that low and small data there is no benefit to
separate machines.
We are using this with ~1TB of data. Using Mahout as a