The problem with EMR is that the Spark driver needs to be as big as the 
executors many times and is not handled by EMR. EMR worked fine for Hadoop 
MapReduce because the driver usually did not have to be scaled vertically. I 
suppose you could say EMR would work but does not solve the whole scaling 
problem.

What we use at my consulting group is custom orchestration code written in 
Terraform that creates a driver of the same size as the Spark executors. We 
then install Spark using Docker and Docker Swarm. Since the driver machine 
needs to have your application on it this will always require some custom 
installation. For storage HDFS is used and it needs to be permanent whereas 
Spark may only be needed periodically. We save money by creating Spark 
Executors and Driver, using them to store to a permanent HDFS, then destroying 
the Spark Driver and Executors. This way model creation with 
spark-itemsimilarity or some app that uses Mahout as a Library, will only be 
paid for at some small duty cycle. 

If you use the temp Spark strategy and create a model every week and need 4 
r3.8xlarge to do it in 1 hour you only pay 1/168th of what you would for a 
permanent cluster. This brings the cost to a quite reasonable range. You are 
very unlikely to need machines that large anyway but you could afford it if you 
only pay for the time they are actually used.

 
On Sep 26, 2016, at 12:30 AM, Arnau Sanchez <pyar...@gmail.com> wrote:

On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel <p...@occamsmachete.com> wrote:

> AWS EMR is usually not very well suited for Spark.

What infrastructure would you recommend? Some EC2 instances provide lots of 
memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb 
RAM).

My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 2 
CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb).

> If the data is from a single file the partition may be 1 and therefor it will 
> only use one machine. 

Indeed, I experienced that also for MR itemsimilarity, it yielded different 
times -and results- for different partitions. I'll do more tests on that. 

> The CLI is really only a proof of concept, not really meant for production.

Noted.

> BTW there is a significant algorithm benefit of the code behind 
> spark-itemsimilarity that is probably more important than the speed increase 
> and that is Correlated Cross-Occurrence

Great! I have yet to compare improvements in the recommendations themselves, 
I'll have this in mind.

Thanks for you help.

Reply via email to