On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel <p...@occamsmachete.com> wrote:

> AWS EMR is usually not very well suited for Spark.

What infrastructure would you recommend? Some EC2 instances provide lots of 
memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb 
RAM).

My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 2 
CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb).

> If the data is from a single file the partition may be 1 and therefor it will 
> only use one machine. 

Indeed, I experienced that also for MR itemsimilarity, it yielded different 
times -and results- for different partitions. I'll do more tests on that. 

> The CLI is really only a proof of concept, not really meant for production.

Noted.

> BTW there is a significant algorithm benefit of the code behind 
> spark-itemsimilarity that is probably more important than the speed increase 
> and that is Correlated Cross-Occurrence

Great! I have yet to compare improvements in the recommendations themselves, 
I'll have this in mind.

Thanks for you help.

Reply via email to