Hi,
I've written about this issue before, but there was no reply.
It seems when a task fails due to hadoop io errors, spark does not retry
that task, and only reports it as a failed task, carrying on the other
tasks. As an example:
WARN ClusterTaskSetManager: Loss was due to java.io.IOException
java.io.IOException: All datanodes x.x.x.x:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2589)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2793)
I think almost all spark applications need to have 0 failed task in order
to produce a meaningful result.
These io errors are not usually repeatable, and they might not occur after
a retry. Is there a setting in spark enforce a retry upon such failed tasks?