Hi,

I agree with Steve, just start using vanilla SPARK EMR.

You can try to see point #4 here for dynamic allocation of executors
https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin
.

Note that dynamic allocation of executors takes a bit of time for the jobs
to start running, therefore you can provide another suggestion to EMR
clusters while starting so that they allocate maximum possible processing
to executors as the EMR clusters start using maximizeResourceAllocation as
mentioned here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html

In case you are trying to load enough data in the spark Master node for
graphing or exploratory analysis using Matlab, seaborn or bokeh its better
to increase the driver memory by recreating spark context.


Regards
Gourav Sengupta



On Mon, May 2, 2016 at 12:54 AM, Teng Qiu <teng...@gmail.com> wrote:

> Hi, here we made several optimizations for accessing s3 from spark:
>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>
> such as:
>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133
>
> you can deploy our spark package using our docker image, just simply:
>
> docker run -d --net=host \
>            -e START_MASTER="true" \
>            -e START_WORKER="true" \
>            -e START_WEBAPP="true" \
>            -e START_NOTEBOOK="true" \
>            registry.opensource.zalan.do/bi/spark:1.6.2-6
>
>
> a jupyter notebook will running on port 8888
>
>
> have fun
>
> Best,
>
> Teng
>
> 2016-04-29 12:37 GMT+02:00 Steve Loughran <ste...@hortonworks.com>:
> >
> > On 28 Apr 2016, at 22:59, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
> >
> > Spark works well with S3 (read and write). However it's recommended to
> set
> > spark.speculation true (it's expected that some tasks fail if you read
> large
> > S3 folder, so speculation should help)
> >
> >
> >
> > I must disagree.
> >
> > Speculative execution has >1 executor running the query, with whoever
> > finishes first winning.
> > however, "finishes first" is implemented in the output committer, by
> > renaming the attempt's output directory to the final output directory:
> > whoever renames first wins.
> > This relies on rename() being implemented in the filesystem client as an
> > atomic transaction.
> > Unfortunately, S3 doesn't do renames. Instead every file gets copied to
> one
> > of the new name, then the old file deleted; an operation that takes time
> > O(data * files)
> >
> > if you have more than one executor trying to commit the work
> simultaneously,
> > your output will be mess of both executions, without anything detecting
> and
> > reporting it.
> >
> > Where did you find this recommendation to set speculation=true?
> >
> > -Steve
> >
> > see also: https://issues.apache.org/jira/browse/SPARK-10063
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to