Re: spark-itemsimilarity slower than itemsimilarity

2016-10-03 Thread Pat Ferrel
Except for reading the input it now takes ~5 minutes to train.


On Sep 30, 2016, at 5:12 PM, Pat Ferrel  wrote:

Yeah, I bet Sebastian is right. I see no reason not to try running with 
--master local[4] or some number of cores on localhost. This will avoid all 
serialization. With times that low and small data there is no benefit to 
separate machines.

We are using this with ~1TB of data. Using Mahout as a lib we also set the 
partitioning to 4x the number of cores in the cluster with --conf 
spark.parallelism=384 on 3 very large machines it complete in 42 minutes. We 
also have 11 events so 11 different input matrices. 

I’m testing a way to train on only what is used in the model creation, which is 
a way to take pairRDDs as input and filter out every interaction that is not 
from a user in the primary matrix. With 1TB this reduces the data to a very 
manageable number and reduces the size of the BiMaps dramatically (they require 
memory). It may not have as pronounced an effect on different data but will 
make the computation more efficient. With the Universal Recommender in the 
PredictionIO template we still use all user input in the queries so you loose 
no precision and get real-time user interactions in your queries. If you want a 
more modern recommender than the old Mahout MapReduce I’d strongly suggest you 
consider it or at least use it as a model for building a recommender with 
Mahout. actionml.com/docs/ur


On Sep 30, 2016, at 10:39 AM, Sebastian  wrote:

Hi Arnau,

I don't think that you can expect any speedups in your setup, your input data 
is way to small and I think you run only two concurrent tasks. Maybe you should 
try a larger sample of your data and more machines.

At the moment, it seems to me that the overheads of running in a distributed 
setting (task scheduling, serialization...) totally dominate the computation.

Best,
Sebastian

On 30.09.2016 11:11, Arnau Sanchez wrote:
> Hi!
> 
> Here you go: "ratings-clean" contains only pairs of (user, product) for those 
> products with 4 or more user interactions (770k -> 465k):
> 
> https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
> 
> The results:
> 
> 1 part of 465k:   3m41.361s
> 5 parts of 100k:  4m20.785s
> 24 pars of 20k:  10m44.375s
> 47 parts of 10k: 17m39.385s
> 
> On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian  wrote:
> 
>> Hi Arnau,
>> 
>> I had a look at your ratings file and its kind of strange. Its pretty
>> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
>> of these, only 50k have more than 3 interactions.
>> 
>> So I think the first thing that you should do is throw out all the items
>> with so few interactions. Item similarity computations are pretty
>> sensitive to the number of unique items, maybe thats why you don't see
>> much difference in the run times.
>> 
>> -s
>> 
>> 
>> On 29.09.2016 22:17, Arnau Sanchez wrote:
>>> --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10




Re: spark-itemsimilarity slower than itemsimilarity

2016-09-30 Thread Pat Ferrel
Yeah, I bet Sebastian is right. I see no reason not to try running with 
--master local[4] or some number of cores on localhost. This will avoid all 
serialization. With times that low and small data there is no benefit to 
separate machines.

We are using this with ~1TB of data. Using Mahout as a lib we also set the 
partitioning to 4x the number of cores in the cluster with --conf 
spark.parallelism=384 on 3 very large machines it complete in 42 minutes. We 
also have 11 events so 11 different input matrices. 

I’m testing a way to train on only what is used in the model creation, which is 
a way to take pairRDDs as input and filter out every interaction that is not 
from a user in the primary matrix. With 1TB this reduces the data to a very 
manageable number and reduces the size of the BiMaps dramatically (they require 
memory). It may not have as pronounced an effect on different data but will 
make the computation more efficient. With the Universal Recommender in the 
PredictionIO template we still use all user input in the queries so you loose 
no precision and get real-time user interactions in your queries. If you want a 
more modern recommender than the old Mahout MapReduce I’d strongly suggest you 
consider it or at least use it as a model for building a recommender with 
Mahout. actionml.com/docs/ur


On Sep 30, 2016, at 10:39 AM, Sebastian  wrote:

Hi Arnau,

I don't think that you can expect any speedups in your setup, your input data 
is way to small and I think you run only two concurrent tasks. Maybe you should 
try a larger sample of your data and more machines.

At the moment, it seems to me that the overheads of running in a distributed 
setting (task scheduling, serialization...) totally dominate the computation.

Best,
Sebastian

On 30.09.2016 11:11, Arnau Sanchez wrote:
> Hi!
> 
> Here you go: "ratings-clean" contains only pairs of (user, product) for those 
> products with 4 or more user interactions (770k -> 465k):
> 
> https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
> 
> The results:
> 
> 1 part of 465k:   3m41.361s
> 5 parts of 100k:  4m20.785s
> 24 pars of 20k:  10m44.375s
> 47 parts of 10k: 17m39.385s
> 
> On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian  wrote:
> 
>> Hi Arnau,
>> 
>> I had a look at your ratings file and its kind of strange. Its pretty
>> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
>> of these, only 50k have more than 3 interactions.
>> 
>> So I think the first thing that you should do is throw out all the items
>> with so few interactions. Item similarity computations are pretty
>> sensitive to the number of unique items, maybe thats why you don't see
>> much difference in the run times.
>> 
>> -s
>> 
>> 
>> On 29.09.2016 22:17, Arnau Sanchez wrote:
>>> --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10



Re: spark-itemsimilarity slower than itemsimilarity

2016-09-30 Thread Sebastian

Hi Arnau,

I don't think that you can expect any speedups in your setup, your input 
data is way to small and I think you run only two concurrent tasks. 
Maybe you should try a larger sample of your data and more machines.


At the moment, it seems to me that the overheads of running in a 
distributed setting (task scheduling, serialization...) totally dominate 
the computation.


Best,
Sebastian

On 30.09.2016 11:11, Arnau Sanchez wrote:

Hi!

Here you go: "ratings-clean" contains only pairs of (user, product) for those 
products with 4 or more user interactions (770k -> 465k):

https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0

The results:

1 part of 465k:   3m41.361s
5 parts of 100k:  4m20.785s
24 pars of 20k:  10m44.375s
47 parts of 10k: 17m39.385s

On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian  wrote:


Hi Arnau,

I had a look at your ratings file and its kind of strange. Its pretty
tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
of these, only 50k have more than 3 interactions.

So I think the first thing that you should do is throw out all the items
with so few interactions. Item similarity computations are pretty
sensitive to the number of unique items, maybe thats why you don't see
much difference in the run times.

-s


On 29.09.2016 22:17, Arnau Sanchez wrote:

 --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10


Re: spark-itemsimilarity slower than itemsimilarity

2016-09-30 Thread Arnau Sanchez
Hi!

Here you go: "ratings-clean" contains only pairs of (user, product) for those 
products with 4 or more user interactions (770k -> 465k):

https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0

The results:

1 part of 465k:   3m41.361s
5 parts of 100k:  4m20.785s
24 pars of 20k:  10m44.375s
47 parts of 10k: 17m39.385s

On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian  wrote:

> Hi Arnau,
> 
> I had a look at your ratings file and its kind of strange. Its pretty 
> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out 
> of these, only 50k have more than 3 interactions.
> 
> So I think the first thing that you should do is throw out all the items 
> with so few interactions. Item similarity computations are pretty 
> sensitive to the number of unique items, maybe thats why you don't see 
> much difference in the run times.
> 
> -s
> 
> 
> On 29.09.2016 22:17, Arnau Sanchez wrote:
> >  --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10  


Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Sebastian

Hi Arnau,

I had a look at your ratings file and its kind of strange. Its pretty 
tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out 
of these, only 50k have more than 3 interactions.


So I think the first thing that you should do is throw out all the items 
with so few interactions. Item similarity computations are pretty 
sensitive to the number of unique items, maybe thats why you don't see 
much difference in the run times.


-s


On 29.09.2016 22:17, Arnau Sanchez wrote:

 --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10


Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Arnau Sanchez
A Dropbox link now:

https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0

And here is the script I use to test different sizes/partitions (example: 10 
parts of 10k):

#!/bin/sh
set -e -u

mkdir -p ratings-split
rm -rf ratings-split/part*
hdfs dfs -rm -r ratings-split spark-itemsimilarity temp
cat ~/input_ratings/part* 2>/dev/null | head -n100k | split -l10k -d - 
"ratings-split/part-" 
hdfs dfs -mkdir -p ratings-split
hdfs dfs -copyFromLocal ratings-split
time mahout spark-itemsimilarity --input ratings-split/ --output 
spark-itemsimilarity \
  --maxSimilaritiesPerItem 10 --master yarn-client |& tee 
spark-itemsimilarity.out

Thanks!

On Thu, 29 Sep 2016 19:46:03 +0200 Arnau Sanchez  wrote:

> Hi Sebastian,
> 
> That's weird, it works here. Anyway, a Dropbox link:
> 
> https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
> 
> Thanks!
> 
> On Thu, 29 Sep 2016 18:50:23 +0200 Sebastian  wrote:
> 
> > Hi Arnau,
> > 
> > The links to your logfiles don't work for me unfortunately. Are you sure 
> > you correctly setup Spark? That can be a bit tricky in YARN settings, 
> > sometimes one machine idles around...
> > 
> > Best,
> > Sebastian
> > 
> > On 25.09.2016 18:01, Pat Ferrel wrote:  
> > > AWS EMR is usually not very well suited for Spark. Spark get’s most of 
> > > it’s speed from in-memory calculations. So to see speed gains you have to 
> > > have enough memory. Also partitioning will help in many cases. If you 
> > > read in data from a single file—that partitioning will usually follow the 
> > > calculation throughout all intermediate steps. If the data is from a 
> > > single file the partition may be 1 and therefor it will only use one 
> > > machine. The most recent Mahout snapshot (therefore the next release) 
> > > allows you to pass in the partitioning for each event pair (this is only 
> > > in the library use, not CLI). To get this effect in the current release, 
> > > try splitting the input into multiple files.
> > >
> > > I’m. probably the one that reported the 10x speed up and used input from 
> > > Kafka DStreams, which causes very small default partition sizes. Also 
> > > other comparisons for other calculations give a similar speedup result. 
> > > There is little question about Spark being much faster—when used the way 
> > > it is meant to be.
> > >
> > > I use Mahout as a library all the time in the Universal Recommender 
> > > implemented in Apache PredictionIO. As a library we get greater control 
> > > than the CLI. The CLI is really only a proof of concept, not really meant 
> > > for production.
> > >
> > > BTW there is a significant algorithm benefit of the code behind 
> > > spark-itemsimilarity that is probably more important than the speed 
> > > increase and that is Correlated Cross-Occurrence, which allows the use of 
> > > many indicators of user taste, not just the primary/conversion event, 
> > > which is all any other CF-style recommender that I know of can use.
> > >
> > >
> > > On Sep 22, 2016, at 1:49 AM, Arnau Sanchez  wrote:
> > >
> > > I've been using the Mahout itemsimilarity job for a while, with good 
> > > results. I read that the new spark-itemsimilarity job is typically 
> > > faster, by a factor of 10, so I wanted to give it a try. I must be doing 
> > > something wrong because, with the same EMR infrastructure, the spark job 
> > > is slower than the old one (6 min vs 16 min) working on the same data. I 
> > > took a small sample dataset (766k rating pairs) to compare numbers, this 
> > > is the result:
> > >
> > > Input ratings: http://download.zaudera.com/public/ratings
> > >
> > > Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)
> > >
> > > Old itemsimilarity:
> > >
> > > $ mahout itemsimilarity --input ratings --output itemsimilarity 
> > > --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname 
> > > SIMILARITY_COOCCURRENCE
> > > [5m54s]
> > >
> > > (logs: http://download.zaudera.com/public/itemsimilarity.out)
> > >
> > > New spark-itemsimilarity:
> > >
> > > $ mahout spark-itemsimilarity --input ratings --output 
> > > spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client
> > > [15m51s]
> > >
> > > (logs: http://download.zaudera.com/public/spark-itemsimilarity.out)
> > >
> > > Any ideas? Thanks!
> > >


Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Sebastian

Hi Arnau,

The links to your logfiles don't work for me unfortunately. Are you sure 
you correctly setup Spark? That can be a bit tricky in YARN settings, 
sometimes one machine idles around...


Best,
Sebastian

On 25.09.2016 18:01, Pat Ferrel wrote:

AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s 
speed from in-memory calculations. So to see speed gains you have to have 
enough memory. Also partitioning will help in many cases. If you read in data 
from a single file—that partitioning will usually follow the calculation 
throughout all intermediate steps. If the data is from a single file the 
partition may be 1 and therefor it will only use one machine. The most recent 
Mahout snapshot (therefore the next release) allows you to pass in the 
partitioning for each event pair (this is only in the library use, not CLI). To 
get this effect in the current release, try splitting the input into multiple 
files.

I’m. probably the one that reported the 10x speed up and used input from Kafka 
DStreams, which causes very small default partition sizes. Also other 
comparisons for other calculations give a similar speedup result. There is 
little question about Spark being much faster—when used the way it is meant to 
be.

I use Mahout as a library all the time in the Universal Recommender implemented 
in Apache PredictionIO. As a library we get greater control than the CLI. The 
CLI is really only a proof of concept, not really meant for production.

BTW there is a significant algorithm benefit of the code behind 
spark-itemsimilarity that is probably more important than the speed increase 
and that is Correlated Cross-Occurrence, which allows the use of many 
indicators of user taste, not just the primary/conversion event, which is all 
any other CF-style recommender that I know of can use.


On Sep 22, 2016, at 1:49 AM, Arnau Sanchez  wrote:

I've been using the Mahout itemsimilarity job for a while, with good results. I 
read that the new spark-itemsimilarity job is typically faster, by a factor of 
10, so I wanted to give it a try. I must be doing something wrong because, with 
the same EMR infrastructure, the spark job is slower than the old one (6 min vs 
16 min) working on the same data. I took a small sample dataset (766k rating 
pairs) to compare numbers, this is the result:

Input ratings: http://download.zaudera.com/public/ratings

Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)

Old itemsimilarity:

$ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData 
TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE
[5m54s]

(logs: http://download.zaudera.com/public/itemsimilarity.out)

New spark-itemsimilarity:

$ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity 
--maxSimilaritiesPerItem 10 --master yarn-client
[15m51s]

(logs: http://download.zaudera.com/public/spark-itemsimilarity.out)

Any ideas? Thanks!



Re: spark-itemsimilarity slower than itemsimilarity

2016-09-29 Thread Arnau Sanchez
The scaling issues with EMR+Spark may explain the weird performance I am seeing 
with Mahout's spark-itemsimilarity, I compared the running times with different 
partitions: the more partitions I feed the job, the more parallel processes it 
creates in the nodes, the more RAM it uses (some 100GB in total), which looks 
promising... but the larger the elapsed time. With an input of 100k ratings:

1 part of 100k  -> 1m50.080s
5 parts of 20k  -> 2m03.376s
10 parts of 10k -> 2m48.722s
20 parts of 5k  -> 4m22.107s

This also happens for bigger inputs. With >= 1M ratings I am just not able to 
finish the task, it's so slow.

> > If you use the temp Spark strategy

We run a EMR cluster 24/7. With Spark we expected a faster data-processing so 
we could return more up-to-date recommendations.

Thanks for you hints Pat, much appreciated.

On Wed, 28 Sep 2016 15:23:27 -0700 Pat Ferrel  wrote:

> The problem with EMR is that the Spark driver needs to be as big as the 
> executors many times and is not handled by EMR. EMR worked fine for Hadoop 
> MapReduce because the driver usually did not have to be scaled vertically. I 
> suppose you could say EMR would work but does not solve the whole scaling 
> problem.
> 
> What we use at my consulting group is custom orchestration code written in 
> Terraform that creates a driver of the same size as the Spark executors. We 
> then install Spark using Docker and Docker Swarm. Since the driver machine 
> needs to have your application on it this will always require some custom 
> installation. For storage HDFS is used and it needs to be permanent whereas 
> Spark may only be needed periodically. We save money by creating Spark 
> Executors and Driver, using them to store to a permanent HDFS, then 
> destroying the Spark Driver and Executors. This way model creation with 
> spark-itemsimilarity or some app that uses Mahout as a Library, will only be 
> paid for at some small duty cycle. 
> 
> If you use the temp Spark strategy and create a model every week and need 4 
> r3.8xlarge to do it in 1 hour you only pay 1/168th of what you would for a 
> permanent cluster. This brings the cost to a quite reasonable range. You are 
> very unlikely to need machines that large anyway but you could afford it if 
> you only pay for the time they are actually used.
> 
>  
> On Sep 26, 2016, at 12:30 AM, Arnau Sanchez  wrote:
> 
> On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel  wrote:
> 
> > AWS EMR is usually not very well suited for Spark.  
> 
> What infrastructure would you recommend? Some EC2 instances provide lots of 
> memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb 
> RAM).
> 
> My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 
> 2 CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb).
> 
> > If the data is from a single file the partition may be 1 and therefor it 
> > will only use one machine.   
> 
> Indeed, I experienced that also for MR itemsimilarity, it yielded different 
> times -and results- for different partitions. I'll do more tests on that. 
> 
> > The CLI is really only a proof of concept, not really meant for production. 
> >  
> 
> Noted.
> 
> > BTW there is a significant algorithm benefit of the code behind 
> > spark-itemsimilarity that is probably more important than the speed 
> > increase and that is Correlated Cross-Occurrence  
> 
> Great! I have yet to compare improvements in the recommendations themselves, 
> I'll have this in mind.
> 
> Thanks for you help.


Re: spark-itemsimilarity slower than itemsimilarity

2016-09-28 Thread Pat Ferrel
The problem with EMR is that the Spark driver needs to be as big as the 
executors many times and is not handled by EMR. EMR worked fine for Hadoop 
MapReduce because the driver usually did not have to be scaled vertically. I 
suppose you could say EMR would work but does not solve the whole scaling 
problem.

What we use at my consulting group is custom orchestration code written in 
Terraform that creates a driver of the same size as the Spark executors. We 
then install Spark using Docker and Docker Swarm. Since the driver machine 
needs to have your application on it this will always require some custom 
installation. For storage HDFS is used and it needs to be permanent whereas 
Spark may only be needed periodically. We save money by creating Spark 
Executors and Driver, using them to store to a permanent HDFS, then destroying 
the Spark Driver and Executors. This way model creation with 
spark-itemsimilarity or some app that uses Mahout as a Library, will only be 
paid for at some small duty cycle. 

If you use the temp Spark strategy and create a model every week and need 4 
r3.8xlarge to do it in 1 hour you only pay 1/168th of what you would for a 
permanent cluster. This brings the cost to a quite reasonable range. You are 
very unlikely to need machines that large anyway but you could afford it if you 
only pay for the time they are actually used.

 
On Sep 26, 2016, at 12:30 AM, Arnau Sanchez  wrote:

On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel  wrote:

> AWS EMR is usually not very well suited for Spark.

What infrastructure would you recommend? Some EC2 instances provide lots of 
memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb 
RAM).

My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 2 
CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb).

> If the data is from a single file the partition may be 1 and therefor it will 
> only use one machine. 

Indeed, I experienced that also for MR itemsimilarity, it yielded different 
times -and results- for different partitions. I'll do more tests on that. 

> The CLI is really only a proof of concept, not really meant for production.

Noted.

> BTW there is a significant algorithm benefit of the code behind 
> spark-itemsimilarity that is probably more important than the speed increase 
> and that is Correlated Cross-Occurrence

Great! I have yet to compare improvements in the recommendations themselves, 
I'll have this in mind.

Thanks for you help.



Re: spark-itemsimilarity slower than itemsimilarity

2016-09-26 Thread Arnau Sanchez
On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel  wrote:

> AWS EMR is usually not very well suited for Spark.

What infrastructure would you recommend? Some EC2 instances provide lots of 
memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb 
RAM).

My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 2 
CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb).

> If the data is from a single file the partition may be 1 and therefor it will 
> only use one machine. 

Indeed, I experienced that also for MR itemsimilarity, it yielded different 
times -and results- for different partitions. I'll do more tests on that. 

> The CLI is really only a proof of concept, not really meant for production.

Noted.

> BTW there is a significant algorithm benefit of the code behind 
> spark-itemsimilarity that is probably more important than the speed increase 
> and that is Correlated Cross-Occurrence

Great! I have yet to compare improvements in the recommendations themselves, 
I'll have this in mind.

Thanks for you help.


Re: spark-itemsimilarity slower than itemsimilarity

2016-09-25 Thread Pat Ferrel
AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s 
speed from in-memory calculations. So to see speed gains you have to have 
enough memory. Also partitioning will help in many cases. If you read in data 
from a single file—that partitioning will usually follow the calculation 
throughout all intermediate steps. If the data is from a single file the 
partition may be 1 and therefor it will only use one machine. The most recent 
Mahout snapshot (therefore the next release) allows you to pass in the 
partitioning for each event pair (this is only in the library use, not CLI). To 
get this effect in the current release, try splitting the input into multiple 
files.

I’m. probably the one that reported the 10x speed up and used input from Kafka 
DStreams, which causes very small default partition sizes. Also other 
comparisons for other calculations give a similar speedup result. There is 
little question about Spark being much faster—when used the way it is meant to 
be.

I use Mahout as a library all the time in the Universal Recommender implemented 
in Apache PredictionIO. As a library we get greater control than the CLI. The 
CLI is really only a proof of concept, not really meant for production.

BTW there is a significant algorithm benefit of the code behind 
spark-itemsimilarity that is probably more important than the speed increase 
and that is Correlated Cross-Occurrence, which allows the use of many 
indicators of user taste, not just the primary/conversion event, which is all 
any other CF-style recommender that I know of can use.


On Sep 22, 2016, at 1:49 AM, Arnau Sanchez  wrote:

I've been using the Mahout itemsimilarity job for a while, with good results. I 
read that the new spark-itemsimilarity job is typically faster, by a factor of 
10, so I wanted to give it a try. I must be doing something wrong because, with 
the same EMR infrastructure, the spark job is slower than the old one (6 min vs 
16 min) working on the same data. I took a small sample dataset (766k rating 
pairs) to compare numbers, this is the result:

Input ratings: http://download.zaudera.com/public/ratings

Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)

Old itemsimilarity:

$ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData 
TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE
[5m54s]

(logs: http://download.zaudera.com/public/itemsimilarity.out)

New spark-itemsimilarity:

$ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity 
--maxSimilaritiesPerItem 10 --master yarn-client
[15m51s]

(logs: http://download.zaudera.com/public/spark-itemsimilarity.out)

Any ideas? Thanks!



spark-itemsimilarity slower than itemsimilarity

2016-09-22 Thread Arnau Sanchez
I've been using the Mahout itemsimilarity job for a while, with good results. I 
read that the new spark-itemsimilarity job is typically faster, by a factor of 
10, so I wanted to give it a try. I must be doing something wrong because, with 
the same EMR infrastructure, the spark job is slower than the old one (6 min vs 
16 min) working on the same data. I took a small sample dataset (766k rating 
pairs) to compare numbers, this is the result:

Input ratings: http://download.zaudera.com/public/ratings

Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)

Old itemsimilarity:

$ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData 
TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE
[5m54s]

(logs: http://download.zaudera.com/public/itemsimilarity.out)

New spark-itemsimilarity:

$ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity 
--maxSimilaritiesPerItem 10 --master yarn-client
[15m51s]

(logs: http://download.zaudera.com/public/spark-itemsimilarity.out)

Any ideas? Thanks!