Hi,

I've written about this issue before, but there was no reply.

It seems when a task fails due to hadoop io errors, spark does not retry
that task, and only reports it as a failed task, carrying on the other
tasks. As an example:

WARN ClusterTaskSetManager: Loss was due to java.io.IOException
java.io.IOException: All datanodes x.x.x.x:50010 are bad. Aborting...
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2589)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2793)


I think almost all spark applications need to have 0 failed task in order
to produce a meaningful result.

These io errors are not usually repeatable, and they might not occur after
a retry. Is there a setting in spark enforce a retry upon such failed tasks?

Reply via email to