Re: Spark on AWS

Teng Qiu Sun, 01 May 2016 16:55:58 -0700

Hi, here we made several optimizations for accessing s3 from spark:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando


such as:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133

you can deploy our spark package using our docker image, just simply:

docker run -d --net=host \
           -e START_MASTER="true" \
           -e START_WORKER="true" \
           -e START_WEBAPP="true" \
           -e START_NOTEBOOK="true" \
           registry.opensource.zalan.do/bi/spark:1.6.2-6


a jupyter notebook will running on port 8888


have fun

Best,

Teng

2016-04-29 12:37 GMT+02:00 Steve Loughran <ste...@hortonworks.com>:
>
> On 28 Apr 2016, at 22:59, Alexander Pivovarov <apivova...@gmail.com> wrote:
>
> Spark works well with S3 (read and write). However it's recommended to set
> spark.speculation true (it's expected that some tasks fail if you read large
> S3 folder, so speculation should help)
>
>
>
> I must disagree.
>
> Speculative execution has >1 executor running the query, with whoever
> finishes first winning.
> however, "finishes first" is implemented in the output committer, by
> renaming the attempt's output directory to the final output directory:
> whoever renames first wins.
> This relies on rename() being implemented in the filesystem client as an
> atomic transaction.
> Unfortunately, S3 doesn't do renames. Instead every file gets copied to one
> of the new name, then the old file deleted; an operation that takes time
> O(data * files)
>
> if you have more than one executor trying to commit the work simultaneously,
> your output will be mess of both executions, without anything detecting and
> reporting it.
>
> Where did you find this recommendation to set speculation=true?
>
> -Steve
>
> see also: https://issues.apache.org/jira/browse/SPARK-10063

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark on AWS

Reply via email to