Hi! Here you go: "ratings-clean" contains only pairs of (user, product) for those products with 4 or more user interactions (770k -> 465k):
https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 The results: 1 part of 465k: 3m41.361s 5 parts of 100k: 4m20.785s 24 pars of 20k: 10m44.375s 47 parts of 10k: 17m39.385s On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian <s...@apache.org> wrote: > Hi Arnau, > > I had a look at your ratings file and its kind of strange. Its pretty > tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out > of these, only 50k have more than 3 interactions. > > So I think the first thing that you should do is throw out all the items > with so few interactions. Item similarity computations are pretty > sensitive to the number of unique items, maybe thats why you don't see > much difference in the run times. > > -s > > > On 29.09.2016 22:17, Arnau Sanchez wrote: > > --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10