Hi, here we made several optimizations for accessing s3 from spark: https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
such as: https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133 you can deploy our spark package using our docker image, just simply: docker run -d --net=host \ -e START_MASTER="true" \ -e START_WORKER="true" \ -e START_WEBAPP="true" \ -e START_NOTEBOOK="true" \ registry.opensource.zalan.do/bi/spark:1.6.2-6 a jupyter notebook will running on port 8888 have fun Best, Teng 2016-04-29 12:37 GMT+02:00 Steve Loughran <ste...@hortonworks.com>: > > On 28 Apr 2016, at 22:59, Alexander Pivovarov <apivova...@gmail.com> wrote: > > Spark works well with S3 (read and write). However it's recommended to set > spark.speculation true (it's expected that some tasks fail if you read large > S3 folder, so speculation should help) > > > > I must disagree. > > Speculative execution has >1 executor running the query, with whoever > finishes first winning. > however, "finishes first" is implemented in the output committer, by > renaming the attempt's output directory to the final output directory: > whoever renames first wins. > This relies on rename() being implemented in the filesystem client as an > atomic transaction. > Unfortunately, S3 doesn't do renames. Instead every file gets copied to one > of the new name, then the old file deleted; an operation that takes time > O(data * files) > > if you have more than one executor trying to commit the work simultaneously, > your output will be mess of both executions, without anything detecting and > reporting it. > > Where did you find this recommendation to set speculation=true? > > -Steve > > see also: https://issues.apache.org/jira/browse/SPARK-10063 --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org