On Thu, Jan 23, 2014 at 12:44 AM, Patrick Wendell <[email protected]>wrote:
> What makes you think it isn't retrying the task? Because the output misses some rows. > By default it tries > three times... it only prints the error once though. in this case if > your cluster doesn't have any datanodes it's likely that it failed > several times. > You are probably right. So in th emissing rows case, spark has tries 3 times and gave up. Still, in web ui, there is no way to detect if a failed task was correctly executed after a retry, or it was given up. I'm not sure if this can be detected in the log. > > On Wed, Jan 22, 2014 at 4:04 PM, Aureliano Buendia <[email protected]> > wrote: > > Hi, > > > > I've written about this issue before, but there was no reply. > > > > It seems when a task fails due to hadoop io errors, spark does not retry > > that task, and only reports it as a failed task, carrying on the other > > tasks. As an example: > > > > WARN ClusterTaskSetManager: Loss was due to java.io.IOException > > java.io.IOException: All datanodes x.x.x.x:50010 are bad. Aborting... > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096) > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2589) > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2793) > > > > > > I think almost all spark applications need to have 0 failed task in > order to > > produce a meaningful result. > > > > These io errors are not usually repeatable, and they might not occur > after a > > retry. Is there a setting in spark enforce a retry upon such failed > tasks? >
