The scaling issues with EMR+Spark may explain the weird performance I am seeing 
with Mahout's spark-itemsimilarity, I compared the running times with different 
partitions: the more partitions I feed the job, the more parallel processes it 
creates in the nodes, the more RAM it uses (some 100GB in total), which looks 
promising... but the larger the elapsed time. With an input of 100k ratings:

1 part of 100k  -> 1m50.080s
5 parts of 20k  -> 2m03.376s
10 parts of 10k -> 2m48.722s
20 parts of 5k  -> 4m22.107s

This also happens for bigger inputs. With >= 1M ratings I am just not able to 
finish the task, it's so slow.

> > If you use the temp Spark strategy

We run a EMR cluster 24/7. With Spark we expected a faster data-processing so 
we could return more up-to-date recommendations.

Thanks for you hints Pat, much appreciated.

On Wed, 28 Sep 2016 15:23:27 -0700 Pat Ferrel <p...@occamsmachete.com> wrote:

> The problem with EMR is that the Spark driver needs to be as big as the 
> executors many times and is not handled by EMR. EMR worked fine for Hadoop 
> MapReduce because the driver usually did not have to be scaled vertically. I 
> suppose you could say EMR would work but does not solve the whole scaling 
> problem.
> 
> What we use at my consulting group is custom orchestration code written in 
> Terraform that creates a driver of the same size as the Spark executors. We 
> then install Spark using Docker and Docker Swarm. Since the driver machine 
> needs to have your application on it this will always require some custom 
> installation. For storage HDFS is used and it needs to be permanent whereas 
> Spark may only be needed periodically. We save money by creating Spark 
> Executors and Driver, using them to store to a permanent HDFS, then 
> destroying the Spark Driver and Executors. This way model creation with 
> spark-itemsimilarity or some app that uses Mahout as a Library, will only be 
> paid for at some small duty cycle. 
> 
> If you use the temp Spark strategy and create a model every week and need 4 
> r3.8xlarge to do it in 1 hour you only pay 1/168th of what you would for a 
> permanent cluster. This brings the cost to a quite reasonable range. You are 
> very unlikely to need machines that large anyway but you could afford it if 
> you only pay for the time they are actually used.
> 
>  
> On Sep 26, 2016, at 12:30 AM, Arnau Sanchez <pyar...@gmail.com> wrote:
> 
> On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel <p...@occamsmachete.com> wrote:
> 
> > AWS EMR is usually not very well suited for Spark.  
> 
> What infrastructure would you recommend? Some EC2 instances provide lots of 
> memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb 
> RAM).
> 
> My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 
> 2 CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb).
> 
> > If the data is from a single file the partition may be 1 and therefor it 
> > will only use one machine.   
> 
> Indeed, I experienced that also for MR itemsimilarity, it yielded different 
> times -and results- for different partitions. I'll do more tests on that. 
> 
> > The CLI is really only a proof of concept, not really meant for production. 
> >  
> 
> Noted.
> 
> > BTW there is a significant algorithm benefit of the code behind 
> > spark-itemsimilarity that is probably more important than the speed 
> > increase and that is Correlated Cross-Occurrence  
> 
> Great! I have yet to compare improvements in the recommendations themselves, 
> I'll have this in mind.
> 
> Thanks for you help.

Reply via email to