Re: Resubmision due to a fetch failure

Grega Kešpret Mon, 09 Dec 2013 07:30:26 -0800

Hi!

I tried setting spark.task.maxFailures to 1 (having this patch
https://github.com/apache/incubator-spark/pull/245) and started a job.
After some time, I killed all JVMs running on one of the two workers. I was
expecting Spark job to fail, however it re-fetched tasks to one of the two
workers that was still alive and the job succeeded.


Is there some other way I can make Spark job fail-fast?

Grega
--
[image: Inline image 1]
*Grega Kešpret*
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com <http://www.celtra.com/> |
@celtramobile<http://www.twitter.com/celtramobile>


On Thu, Nov 28, 2013 at 5:50 PM, Grega Kešpret <[email protected]> wrote:

> Thanks!
>
> Grega
> --
> [image: Inline image 1]
> *Grega Kešpret*
> Analytics engineer
>
> Celtra — Rich Media Mobile Advertising
> celtra.com <http://www.celtra.com/> | 
> @celtramobile<http://www.twitter.com/celtramobile>
>
>
> On Thu, Nov 28, 2013 at 3:40 PM, Prashant Sharma <[email protected]>wrote:
>
>> did you mean  spark.task.maxFailures
>> http://spark.incubator.apache.org/docs/latest/configuration.html
>>
>>
>> On Thu, Nov 28, 2013 at 7:58 PM, Grega Kešpret <[email protected]> wrote:
>>
>>> Bumping this thread, so it gets attention.
>>>
>>> Grega
>>>
>>> On Tue, Nov 26, 2013 at 12:26 PM, Grega Kešpret <[email protected]>wrote:
>>>
>>>> Also, is there a way to specify to Spark that it shouldn't resubmit
>>>> failed stages/tasks, but fail-fast in case any fetch failure occurs?
>>>>
>>>> Grega
>>>> --
>>>> [image: Inline image 1]
>>>> *Grega Kešpret*
>>>> Analytics engineer
>>>>
>>>> Celtra — Rich Media Mobile Advertising
>>>> celtra.com <http://www.celtra.com/> | 
>>>> @celtramobile<http://www.twitter.com/celtramobile>
>>>>
>>>>
>>>> On Mon, Nov 25, 2013 at 9:58 AM, Grega Kešpret <[email protected]>wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> We use Spark to process logs in batches and persist the end result in
>>>>> a db. Last week, we re-ran the job on the same data couple of times, only
>>>>> to find that one run had more results than the rest. Digging through the
>>>>> logs, we found out that a task has been lost and marked for resubmission.
>>>>>
>>>>> I marked the lines here:
>>>>>
>>>>> https://gist.github.com/gregakespret/7541805#file-spark-fetch-failure-L1432-L1509
>>>>>
>>>>> Because of that, one block of data was processed two times and the
>>>>> final result was not correct.
>>>>>
>>>>> My question is how can we catch such occurrences in the code, so that
>>>>> we can do an effective rollback/discard the data that will get recomputed?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Grega
>>>>> --
>>>>> [image: Inline image 1]
>>>>> *Grega Kešpret*
>>>>> Analytics engineer
>>>>>
>>>>> Celtra — Rich Media Mobile Advertising
>>>>> celtra.com <http://www.celtra.com/> | 
>>>>> @celtramobile<http://www.twitter.com/celtramobile>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> s
>>
>
>

<<celtra_logo.png>>

Re: Resubmision due to a fetch failure

Reply via email to