Hi!

Here you go: "ratings-clean" contains only pairs of (user, product) for those 
products with 4 or more user interactions (770k -> 465k):

https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0

The results:

1 part of 465k:   3m41.361s
5 parts of 100k:  4m20.785s
24 pars of 20k:  10m44.375s
47 parts of 10k: 17m39.385s

On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian <s...@apache.org> wrote:

> Hi Arnau,
> 
> I had a look at your ratings file and its kind of strange. Its pretty 
> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out 
> of these, only 50k have more than 3 interactions.
> 
> So I think the first thing that you should do is throw out all the items 
> with so few interactions. Item similarity computations are pretty 
> sensitive to the number of unique items, maybe thats why you don't see 
> much difference in the run times.
> 
> -s
> 
> 
> On 29.09.2016 22:17, Arnau Sanchez wrote:
> >  --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10  

Reply via email to