Spark does not retry failed tasks initiated by hadoop

Aureliano Buendia Wed, 22 Jan 2014 16:06:04 -0800

Hi,

I've written about this issue before, but there was no reply.


It seems when a task fails due to hadoop io errors, spark does not retry
that task, and only reports it as a failed task, carrying on the other
tasks. As an example:

WARN ClusterTaskSetManager: Loss was due to java.io.IOException
java.io.IOException: All datanodes x.x.x.x:50010 are bad. Aborting...
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2589)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2793)


I think almost all spark applications need to have 0 failed task in order
to produce a meaningful result.

These io errors are not usually repeatable, and they might not occur after
a retry. Is there a setting in spark enforce a retry upon such failed tasks?

Spark does not retry failed tasks initiated by hadoop

Reply via email to