Hi Reynold, thanks for the response Yes, speculation mode needs some coordination. Regarding job failure : correct me if I wrong - if one of jobs fails - client code will be sort of "notified" by exception or something similar, so the client can decide to re-submit action(job), i.e. it won't be "silent" failure.
On 26 February 2016 at 11:50, Reynold Xin <[email protected]> wrote: > It could lose data in speculation mode, or if any job fails. > > On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman <[email protected]> > wrote: > >> Takeshi, do you know the reason why they wanted to remove this commiter >> in SPARK-10063? >> the jira has no info inside >> as far as I understand the direct committer can't be used when either of >> two is true >> 1. speculation mode >> 2. append mode(ie. not creating new version of data but appending to >> existing data) >> >> On 26 February 2016 at 08:24, Takeshi Yamamuro <[email protected]> >> wrote: >> >>> Hi, >>> >>> Great work! >>> What is the concrete performance gain of the committer on s3? >>> I'd like to know. >>> >>> I think there is no direct committer for files because these kinds of >>> committer has risks >>> to loss data (See: SPARK-10063). >>> Until this resolved, ISTM files cannot support direct commits. >>> >>> thanks, >>> >>> >>> >>> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <[email protected]> wrote: >>> >>>> yes, should be this one >>>> https://gist.github.com/aarondav/c513916e72101bbe14ec >>>> >>>> then need to set it in spark-defaults.conf : >>>> https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13 >>>> >>>> Am Freitag, 26. Februar 2016 schrieb Yin Yang : >>>> > The header of DirectOutputCommitter.scala says Databricks. >>>> > Did you get it from Databricks ? >>>> > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <[email protected]> wrote: >>>> >> >>>> >> interesting in this topic as well, why the DirectFileOutputCommitter >>>> not included? >>>> >> we added it in our fork, >>>> under >>>> core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala >>>> >> moreover, this DirectFileOutputCommitter is not working for the >>>> insert operations in HiveContext, since the Committer is called by hive >>>> (means uses dependencies in hive package) >>>> >> we made some hack to fix this, you can take a look: >>>> >> >>>> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando >>>> >> >>>> >> may bring some ideas to other spark contributors to find a better >>>> way to use s3. >>>> >> >>>> >> 2016-02-22 23:18 GMT+01:00 igor.berman <[email protected]>: >>>> >>> >>>> >>> Hi, >>>> >>> Wanted to understand if anybody uses DirectFileOutputCommitter or >>>> alikes >>>> >>> especially when working with s3? >>>> >>> I know that there is one impl in spark distro for parquet format, >>>> but not >>>> >>> for files - why? >>>> >>> >>>> >>> Imho, it can bring huge performance boost. >>>> >>> Using default FileOutputCommiter with s3 has big overhead at commit >>>> stage >>>> >>> when all parts are copied one-by-one to destination dir from >>>> _temporary, >>>> >>> which is bottleneck when number of partitions is high. >>>> >>> >>>> >>> Also, wanted to know if there are some problems when using >>>> >>> DirectFileOutputCommitter? >>>> >>> If writing one partition directly will fail in the middle is spark >>>> will >>>> >>> notice this and will fail job(say after all retries)? >>>> >>> >>>> >>> thanks in advance >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> -- >>>> >>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html >>>> >>> Sent from the Apache Spark User List mailing list archive at >>>> Nabble.com. >>>> >>> >>>> >>> >>>> --------------------------------------------------------------------- >>>> >>> To unsubscribe, e-mail: [email protected] >>>> >>> For additional commands, e-mail: [email protected] >>>> >>> >>>> >> >>>> > >>>> > >>>> >>> >>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >> >> >
