Re: spark-itemsimilarity slower than itemsimilarity

2016-09-30 Thread Arnau Sanchez
Hi! Here you go: "ratings-clean" contains only pairs of (user, product) for those products with 4 or more user interactions (770k -> 465k): https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 The results: 1 part of 465k: 3m41.361s 5 parts of 100k: 4m20.785s 24 pars of

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-30 Thread Sebastian
Hi Arnau, I don't think that you can expect any speedups in your setup, your input data is way to small and I think you run only two concurrent tasks. Maybe you should try a larger sample of your data and more machines. At the moment, it seems to me that the overheads of running in a

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-30 Thread Pat Ferrel
Yeah, I bet Sebastian is right. I see no reason not to try running with --master local[4] or some number of cores on localhost. This will avoid all serialization. With times that low and small data there is no benefit to separate machines. We are using this with ~1TB of data. Using Mahout as a