Hi! I tried setting spark.task.maxFailures to 1 (having this patch https://github.com/apache/incubator-spark/pull/245) and started a job. After some time, I killed all JVMs running on one of the two workers. I was expecting Spark job to fail, however it re-fetched tasks to one of the two workers that was still alive and the job succeeded.
Is there some other way I can make Spark job fail-fast? Grega -- [image: Inline image 1] *Grega Kešpret* Analytics engineer Celtra — Rich Media Mobile Advertising celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile> On Thu, Nov 28, 2013 at 5:50 PM, Grega Kešpret <[email protected]> wrote: > Thanks! > > Grega > -- > [image: Inline image 1] > *Grega Kešpret* > Analytics engineer > > Celtra — Rich Media Mobile Advertising > celtra.com <http://www.celtra.com/> | > @celtramobile<http://www.twitter.com/celtramobile> > > > On Thu, Nov 28, 2013 at 3:40 PM, Prashant Sharma <[email protected]>wrote: > >> did you mean spark.task.maxFailures >> http://spark.incubator.apache.org/docs/latest/configuration.html >> >> >> On Thu, Nov 28, 2013 at 7:58 PM, Grega Kešpret <[email protected]> wrote: >> >>> Bumping this thread, so it gets attention. >>> >>> Grega >>> >>> On Tue, Nov 26, 2013 at 12:26 PM, Grega Kešpret <[email protected]>wrote: >>> >>>> Also, is there a way to specify to Spark that it shouldn't resubmit >>>> failed stages/tasks, but fail-fast in case any fetch failure occurs? >>>> >>>> Grega >>>> -- >>>> [image: Inline image 1] >>>> *Grega Kešpret* >>>> Analytics engineer >>>> >>>> Celtra — Rich Media Mobile Advertising >>>> celtra.com <http://www.celtra.com/> | >>>> @celtramobile<http://www.twitter.com/celtramobile> >>>> >>>> >>>> On Mon, Nov 25, 2013 at 9:58 AM, Grega Kešpret <[email protected]>wrote: >>>> >>>>> Hi! >>>>> >>>>> We use Spark to process logs in batches and persist the end result in >>>>> a db. Last week, we re-ran the job on the same data couple of times, only >>>>> to find that one run had more results than the rest. Digging through the >>>>> logs, we found out that a task has been lost and marked for resubmission. >>>>> >>>>> I marked the lines here: >>>>> >>>>> https://gist.github.com/gregakespret/7541805#file-spark-fetch-failure-L1432-L1509 >>>>> >>>>> Because of that, one block of data was processed two times and the >>>>> final result was not correct. >>>>> >>>>> My question is how can we catch such occurrences in the code, so that >>>>> we can do an effective rollback/discard the data that will get recomputed? >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> Grega >>>>> -- >>>>> [image: Inline image 1] >>>>> *Grega Kešpret* >>>>> Analytics engineer >>>>> >>>>> Celtra — Rich Media Mobile Advertising >>>>> celtra.com <http://www.celtra.com/> | >>>>> @celtramobile<http://www.twitter.com/celtramobile> >>>>> >>>> >>>> >>> >> >> >> -- >> s >> > >
<<celtra_logo.png>>
