Hi, One of the rational behind killing the app can be to avoid skewness in data.
I have created this issue (https://issues.apache.org/jira/browse/SPARK-6735) to provide options for disabling this behaviour, as well as making the number of executor's failure to be relative with respect to a window duration. I will upload the PR shortly. Thanks, Twinkle On Tue, Apr 7, 2015 at 2:02 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > What's the advantage of killing an application for lack of resources? > > I think the rationale behind killing an app based on executor failures is > that, if we see a lot of them in a short span of time, it means there's > probably something going wrong in the app or on the cluster. > > On Wed, Apr 1, 2015 at 7:08 PM, twinkle sachdeva < > twinkle.sachd...@gmail.com> wrote: > >> Hi, >> >> Thanks Sandy. >> >> >> Another way to look at this is that would we like to have our long >> running application to die? >> >> So let's say, we create a window of around 10 batches, and we are using >> incremental kind of operations inside our application, as restart here is a >> relatively more costlier, so should it be the maximum number of executor >> failure's kind of criteria to fail the application or should we have some >> parameters around minimum number of executor's availability for some x time? >> >> So, if the application is not able to have minimum n number of executors >> within x period of time, then we should fail the application. >> >> Adding time factor here, will allow some window for spark to get more >> executors allocated if some of them fails. >> >> Thoughts please. >> >> Thanks, >> Twinkle >> >> >> On Wed, Apr 1, 2015 at 10:19 PM, Sandy Ryza <sandy.r...@cloudera.com> >> wrote: >> >>> That's a good question, Twinkle. >>> >>> One solution could be to allow a maximum number of failures within any >>> given time span. E.g. a max failures per hour property. >>> >>> -Sandy >>> >>> On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva < >>> twinkle.sachd...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> In spark over YARN, there is a property "spark.yarn.max.executor.failures" >>>> which controls the maximum number of executor's failure an application will >>>> survive. >>>> >>>> If number of executor's failures ( due to any reason like OOM or >>>> machine failure etc ), exceeds this value then applications quits. >>>> >>>> For small duration spark job, this looks fine, but for the long running >>>> jobs as this does not take into account the duration, this can lead to same >>>> treatment for two different scenarios ( mentioned below) : >>>> 1. executors failing with in 5 mins. >>>> 2. executors failing sparsely, but at some point even a single executor >>>> failure ( which application could have survived ) can make the application >>>> quit. >>>> >>>> Sending it to the community to listen what kind of behaviour / strategy >>>> they think will be suitable for long running spark jobs or spark streaming >>>> jobs. >>>> >>>> Thanks and Regards, >>>> Twinkle >>>> >>> >>> >> >