Hi, I agree with Steve, just start using vanilla SPARK EMR.
You can try to see point #4 here for dynamic allocation of executors https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin . Note that dynamic allocation of executors takes a bit of time for the jobs to start running, therefore you can provide another suggestion to EMR clusters while starting so that they allocate maximum possible processing to executors as the EMR clusters start using maximizeResourceAllocation as mentioned here: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html In case you are trying to load enough data in the spark Master node for graphing or exploratory analysis using Matlab, seaborn or bokeh its better to increase the driver memory by recreating spark context. Regards Gourav Sengupta On Mon, May 2, 2016 at 12:54 AM, Teng Qiu <teng...@gmail.com> wrote: > Hi, here we made several optimizations for accessing s3 from spark: > > https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando > > such as: > > https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133 > > you can deploy our spark package using our docker image, just simply: > > docker run -d --net=host \ > -e START_MASTER="true" \ > -e START_WORKER="true" \ > -e START_WEBAPP="true" \ > -e START_NOTEBOOK="true" \ > registry.opensource.zalan.do/bi/spark:1.6.2-6 > > > a jupyter notebook will running on port 8888 > > > have fun > > Best, > > Teng > > 2016-04-29 12:37 GMT+02:00 Steve Loughran <ste...@hortonworks.com>: > > > > On 28 Apr 2016, at 22:59, Alexander Pivovarov <apivova...@gmail.com> > wrote: > > > > Spark works well with S3 (read and write). However it's recommended to > set > > spark.speculation true (it's expected that some tasks fail if you read > large > > S3 folder, so speculation should help) > > > > > > > > I must disagree. > > > > Speculative execution has >1 executor running the query, with whoever > > finishes first winning. > > however, "finishes first" is implemented in the output committer, by > > renaming the attempt's output directory to the final output directory: > > whoever renames first wins. > > This relies on rename() being implemented in the filesystem client as an > > atomic transaction. > > Unfortunately, S3 doesn't do renames. Instead every file gets copied to > one > > of the new name, then the old file deleted; an operation that takes time > > O(data * files) > > > > if you have more than one executor trying to commit the work > simultaneously, > > your output will be mess of both executions, without anything detecting > and > > reporting it. > > > > Where did you find this recommendation to set speculation=true? > > > > -Steve > > > > see also: https://issues.apache.org/jira/browse/SPARK-10063 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >