Except for reading the input it now takes ~5 minutes to train.

On Sep 30, 2016, at 5:12 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

Yeah, I bet Sebastian is right. I see no reason not to try running with 
--master local[4] or some number of cores on localhost. This will avoid all 
serialization. With times that low and small data there is no benefit to 
separate machines.

We are using this with ~1TB of data. Using Mahout as a lib we also set the 
partitioning to 4x the number of cores in the cluster with --conf 
spark.parallelism=384 on 3 very large machines it complete in 42 minutes. We 
also have 11 events so 11 different input matrices. 

I’m testing a way to train on only what is used in the model creation, which is 
a way to take pairRDDs as input and filter out every interaction that is not 
from a user in the primary matrix. With 1TB this reduces the data to a very 
manageable number and reduces the size of the BiMaps dramatically (they require 
memory). It may not have as pronounced an effect on different data but will 
make the computation more efficient. With the Universal Recommender in the 
PredictionIO template we still use all user input in the queries so you loose 
no precision and get real-time user interactions in your queries. If you want a 
more modern recommender than the old Mahout MapReduce I’d strongly suggest you 
consider it or at least use it as a model for building a recommender with 
Mahout. actionml.com/docs/ur


On Sep 30, 2016, at 10:39 AM, Sebastian <s...@apache.org> wrote:

Hi Arnau,

I don't think that you can expect any speedups in your setup, your input data 
is way to small and I think you run only two concurrent tasks. Maybe you should 
try a larger sample of your data and more machines.

At the moment, it seems to me that the overheads of running in a distributed 
setting (task scheduling, serialization...) totally dominate the computation.

Best,
Sebastian

On 30.09.2016 11:11, Arnau Sanchez wrote:
> Hi!
> 
> Here you go: "ratings-clean" contains only pairs of (user, product) for those 
> products with 4 or more user interactions (770k -> 465k):
> 
> https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
> 
> The results:
> 
> 1 part of 465k:   3m41.361s
> 5 parts of 100k:  4m20.785s
> 24 pars of 20k:  10m44.375s
> 47 parts of 10k: 17m39.385s
> 
> On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian <s...@apache.org> wrote:
> 
>> Hi Arnau,
>> 
>> I had a look at your ratings file and its kind of strange. Its pretty
>> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
>> of these, only 50k have more than 3 interactions.
>> 
>> So I think the first thing that you should do is throw out all the items
>> with so few interactions. Item similarity computations are pretty
>> sensitive to the number of unique items, maybe thats why you don't see
>> much difference in the run times.
>> 
>> -s
>> 
>> 
>> On 29.09.2016 22:17, Arnau Sanchez wrote:
>>> --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10


Reply via email to