Re: Spark does not retry failed tasks initiated by hadoop

Aureliano Buendia Wed, 22 Jan 2014 16:54:47 -0800

On Thu, Jan 23, 2014 at 12:44 AM, Patrick Wendell <[email protected]>wrote:


> What makes you think it isn't retrying the task?


Because the output misses some rows.


> By default it tries
> three times... it only prints the error once though. in this case if
> your cluster doesn't have any datanodes it's likely that it failed
> several times.
>

You are probably right. So in th emissing rows case, spark has tries 3
times and gave up. Still, in web ui, there is no way to detect if a failed
task was correctly executed after a retry, or it was given up. I'm not sure
if this can be detected in the log.


>
> On Wed, Jan 22, 2014 at 4:04 PM, Aureliano Buendia <[email protected]>
> wrote:
> > Hi,
> >
> > I've written about this issue before, but there was no reply.
> >
> > It seems when a task fails due to hadoop io errors, spark does not retry
> > that task, and only reports it as a failed task, carrying on the other
> > tasks. As an example:
> >
> > WARN ClusterTaskSetManager: Loss was due to java.io.IOException
> > java.io.IOException: All datanodes x.x.x.x:50010 are bad. Aborting...
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2589)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2793)
> >
> >
> > I think almost all spark applications need to have 0 failed task in
> order to
> > produce a meaningful result.
> >
> > These io errors are not usually repeatable, and they might not occur
> after a
> > retry. Is there a setting in spark enforce a retry upon such failed
> tasks?
>

Re: Spark does not retry failed tasks initiated by hadoop

Reply via email to