On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel <p...@occamsmachete.com> wrote:
> AWS EMR is usually not very well suited for Spark. What infrastructure would you recommend? Some EC2 instances provide lots of memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb RAM). My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 2 CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb). > If the data is from a single file the partition may be 1 and therefor it will > only use one machine. Indeed, I experienced that also for MR itemsimilarity, it yielded different times -and results- for different partitions. I'll do more tests on that. > The CLI is really only a proof of concept, not really meant for production. Noted. > BTW there is a significant algorithm benefit of the code behind > spark-itemsimilarity that is probably more important than the speed increase > and that is Correlated Cross-Occurrence Great! I have yet to compare improvements in the recommendations themselves, I'll have this in mind. Thanks for you help.