Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Oh, by default it's set to 0L. I'll try setting it to 3 immediately. Thanks for the help! Jianshi On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang wrote: > Thanks Shixiong! > > Very strange that our tasks were retried on the same executor again and > again. I'll check spark.scheduler.execut

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Thanks Shixiong! Very strange that our tasks were retried on the same executor again and again. I'll check spark.scheduler.executorTaskBlacklistTime. Jianshi On Mon, Mar 16, 2015 at 6:02 PM, Shixiong Zhu wrote: > There are 2 cases for "No space left on device": > > 1. Some tasks which use larg

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Shixiong Zhu
There are 2 cases for "No space left on device": 1. Some tasks which use large temp space cannot run in any node. 2. The free space of datanodes is not balance. Some tasks which use large temp space can not run in several nodes, but they can run in other nodes successfully. Because most of our ca

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang wrote: > Hi, > > We're facing "No space left on device" errors lately from time to time. > The job will fail after retries. Obvious in such case, retry won't be > helpful. > > Sure

Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Hi, We're facing "No space left on device" errors lately from time to time. The job will fail after retries. Obvious in such case, retry won't be helpful. Sure it's the problem in the datanodes but I'm wondering if Spark Driver can handle it and decommission the problematic datanode before retryi