Hi Reynold,
thanks for the response
Yes, speculation mode needs some coordination.
Regarding job failure :
correct me if I wrong - if one of jobs fails - client code will be sort of
"notified" by exception or something similar, so the client can decide to
re-submit action(job), i.e. it won't be "silent" failure.


On 26 February 2016 at 11:50, Reynold Xin <[email protected]> wrote:

> It could lose data in speculation mode, or if any job fails.
>
> On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman <[email protected]>
> wrote:
>
>> Takeshi, do you know the reason why they wanted to remove this commiter
>> in SPARK-10063?
>> the jira has no info inside
>> as far as I understand the direct committer can't be used when either of
>> two is true
>> 1. speculation mode
>> 2. append mode(ie. not creating new version of data but appending to
>> existing data)
>>
>> On 26 February 2016 at 08:24, Takeshi Yamamuro <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> Great work!
>>> What is the concrete performance gain of the committer on s3?
>>> I'd like to know.
>>>
>>> I think there is no direct committer for files because these kinds of
>>> committer has risks
>>> to loss data (See: SPARK-10063).
>>> Until this resolved, ISTM files cannot support direct commits.
>>>
>>> thanks,
>>>
>>>
>>>
>>> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <[email protected]> wrote:
>>>
>>>> yes, should be this one
>>>> https://gist.github.com/aarondav/c513916e72101bbe14ec
>>>>
>>>> then need to set it in spark-defaults.conf :
>>>> https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13
>>>>
>>>> Am Freitag, 26. Februar 2016 schrieb Yin Yang :
>>>> > The header of DirectOutputCommitter.scala says Databricks.
>>>> > Did you get it from Databricks ?
>>>> > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <[email protected]> wrote:
>>>> >>
>>>> >> interesting in this topic as well, why the DirectFileOutputCommitter
>>>> not included?
>>>> >> we added it in our fork,
>>>> under 
>>>> core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
>>>> >> moreover, this DirectFileOutputCommitter is not working for the
>>>> insert operations in HiveContext, since the Committer is called by hive
>>>> (means uses dependencies in hive package)
>>>> >> we made some hack to fix this, you can take a look:
>>>> >>
>>>> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>>>> >>
>>>> >> may bring some ideas to other spark contributors to find a better
>>>> way to use s3.
>>>> >>
>>>> >> 2016-02-22 23:18 GMT+01:00 igor.berman <[email protected]>:
>>>> >>>
>>>> >>> Hi,
>>>> >>> Wanted to understand if anybody uses DirectFileOutputCommitter or
>>>> alikes
>>>> >>> especially when working with s3?
>>>> >>> I know that there is one impl in spark distro for parquet format,
>>>> but not
>>>> >>> for files -  why?
>>>> >>>
>>>> >>> Imho, it can bring huge performance boost.
>>>> >>> Using default FileOutputCommiter with s3 has big overhead at commit
>>>> stage
>>>> >>> when all parts are copied one-by-one to destination dir from
>>>> _temporary,
>>>> >>> which is bottleneck when number of partitions is high.
>>>> >>>
>>>> >>> Also, wanted to know if there are some problems when using
>>>> >>> DirectFileOutputCommitter?
>>>> >>> If writing one partition directly will fail in the middle is spark
>>>> will
>>>> >>> notice this and will fail job(say after all retries)?
>>>> >>>
>>>> >>> thanks in advance
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
>>>> >>> Sent from the Apache Spark User List mailing list archive at
>>>> Nabble.com.
>>>> >>>
>>>> >>>
>>>> ---------------------------------------------------------------------
>>>> >>> To unsubscribe, e-mail: [email protected]
>>>> >>> For additional commands, e-mail: [email protected]
>>>> >>>
>>>> >>
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>

Reply via email to