Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-07 Thread twinkle sachdeva
Hi,

One of the rational behind killing the app can be to avoid skewness in
data.

I have created this issue (https://issues.apache.org/jira/browse/SPARK-6735)
to provide options for disabling this behaviour, as well as making the
number of executor's failure to be relative with respect to a window
duration.

I will upload the PR shortly.

Thanks,
Twinkle


On Tue, Apr 7, 2015 at 2:02 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 What's the advantage of killing an application for lack of resources?

 I think the rationale behind killing an app based on executor failures is
 that, if we see a lot of them in a short span of time, it means there's
 probably something going wrong in the app or on the cluster.

 On Wed, Apr 1, 2015 at 7:08 PM, twinkle sachdeva 
 twinkle.sachd...@gmail.com wrote:

 Hi,

 Thanks Sandy.


 Another way to look at this is that would we like to have our long
 running application to die?

 So let's say, we create a window of around 10 batches, and we are using
 incremental kind of operations inside our application, as restart here is a
 relatively more costlier, so should it be the maximum number of executor
 failure's kind of criteria to fail the application or should we have some
 parameters around minimum number of executor's availability for some x time?

 So, if the application is not able to have minimum n number of executors
 within x period of time, then we should fail the application.

 Adding time factor here, will allow some window for spark to get more
 executors allocated if some of them fails.

 Thoughts please.

 Thanks,
 Twinkle


 On Wed, Apr 1, 2015 at 10:19 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 That's a good question, Twinkle.

 One solution could be to allow a maximum number of failures within any
 given time span.  E.g. a max failures per hour property.

 -Sandy

 On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva 
 twinkle.sachd...@gmail.com wrote:

 Hi,

 In spark over YARN, there is a property spark.yarn.max.executor.failures
 which controls the maximum number of executor's failure an application will
 survive.

 If number of executor's failures ( due to any reason like OOM or
 machine failure etc ), exceeds this value then applications quits.

 For small duration spark job, this looks fine, but for the long running
 jobs as this does not take into account the duration, this can lead to same
 treatment for two different scenarios ( mentioned below) :
 1. executors failing with in 5 mins.
 2. executors failing sparsely, but at some point even a single executor
 failure ( which application could have survived ) can make the application
 quit.

 Sending it to the community to listen what kind of behaviour / strategy
 they think will be suitable for long running spark jobs or spark streaming
 jobs.

 Thanks and Regards,
 Twinkle







Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-06 Thread Sandy Ryza
What's the advantage of killing an application for lack of resources?

I think the rationale behind killing an app based on executor failures is
that, if we see a lot of them in a short span of time, it means there's
probably something going wrong in the app or on the cluster.

On Wed, Apr 1, 2015 at 7:08 PM, twinkle sachdeva twinkle.sachd...@gmail.com
 wrote:

 Hi,

 Thanks Sandy.


 Another way to look at this is that would we like to have our long running
 application to die?

 So let's say, we create a window of around 10 batches, and we are using
 incremental kind of operations inside our application, as restart here is a
 relatively more costlier, so should it be the maximum number of executor
 failure's kind of criteria to fail the application or should we have some
 parameters around minimum number of executor's availability for some x time?

 So, if the application is not able to have minimum n number of executors
 within x period of time, then we should fail the application.

 Adding time factor here, will allow some window for spark to get more
 executors allocated if some of them fails.

 Thoughts please.

 Thanks,
 Twinkle


 On Wed, Apr 1, 2015 at 10:19 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 That's a good question, Twinkle.

 One solution could be to allow a maximum number of failures within any
 given time span.  E.g. a max failures per hour property.

 -Sandy

 On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva 
 twinkle.sachd...@gmail.com wrote:

 Hi,

 In spark over YARN, there is a property spark.yarn.max.executor.failures
 which controls the maximum number of executor's failure an application will
 survive.

 If number of executor's failures ( due to any reason like OOM or machine
 failure etc ), exceeds this value then applications quits.

 For small duration spark job, this looks fine, but for the long running
 jobs as this does not take into account the duration, this can lead to same
 treatment for two different scenarios ( mentioned below) :
 1. executors failing with in 5 mins.
 2. executors failing sparsely, but at some point even a single executor
 failure ( which application could have survived ) can make the application
 quit.

 Sending it to the community to listen what kind of behaviour / strategy
 they think will be suitable for long running spark jobs or spark streaming
 jobs.

 Thanks and Regards,
 Twinkle






Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread twinkle sachdeva
Hi,

In spark over YARN, there is a property spark.yarn.max.executor.failures
which controls the maximum number of executor's failure an application will
survive.

If number of executor's failures ( due to any reason like OOM or machine
failure etc ), increases this value then applications quits.

For small duration spark job, this looks fine, but for the long running
jobs as this does not take into account the duration, this can lead to same
treatment for two different scenarios ( mentioned below) :
1. executors failing with in 5 mins.
2. executors failing sparsely, but at some point even a single executor
failure ( which application could have survived ) can make the application
quit.

Sending it to the community to listen what kind of behaviour / strategy
they think will be suitable for long running spark jobs or spark streaming
jobs.

Thanks and Regards,
Twinkle


Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread twinkle sachdeva
Hi,

Thanks Sandy.


Another way to look at this is that would we like to have our long running
application to die?

So let's say, we create a window of around 10 batches, and we are using
incremental kind of operations inside our application, as restart here is a
relatively more costlier, so should it be the maximum number of executor
failure's kind of criteria to fail the application or should we have some
parameters around minimum number of executor's availability for some x time?

So, if the application is not able to have minimum n number of executors
within x period of time, then we should fail the application.

Adding time factor here, will allow some window for spark to get more
executors allocated if some of them fails.

Thoughts please.

Thanks,
Twinkle


On Wed, Apr 1, 2015 at 10:19 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 That's a good question, Twinkle.

 One solution could be to allow a maximum number of failures within any
 given time span.  E.g. a max failures per hour property.

 -Sandy

 On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva 
 twinkle.sachd...@gmail.com wrote:

 Hi,

 In spark over YARN, there is a property spark.yarn.max.executor.failures
 which controls the maximum number of executor's failure an application will
 survive.

 If number of executor's failures ( due to any reason like OOM or machine
 failure etc ), exceeds this value then applications quits.

 For small duration spark job, this looks fine, but for the long running
 jobs as this does not take into account the duration, this can lead to same
 treatment for two different scenarios ( mentioned below) :
 1. executors failing with in 5 mins.
 2. executors failing sparsely, but at some point even a single executor
 failure ( which application could have survived ) can make the application
 quit.

 Sending it to the community to listen what kind of behaviour / strategy
 they think will be suitable for long running spark jobs or spark streaming
 jobs.

 Thanks and Regards,
 Twinkle





Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread Sandy Ryza
That's a good question, Twinkle.

One solution could be to allow a maximum number of failures within any
given time span.  E.g. a max failures per hour property.

-Sandy

On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva 
twinkle.sachd...@gmail.com wrote:

 Hi,

 In spark over YARN, there is a property spark.yarn.max.executor.failures
 which controls the maximum number of executor's failure an application will
 survive.

 If number of executor's failures ( due to any reason like OOM or machine
 failure etc ), increases this value then applications quits.

 For small duration spark job, this looks fine, but for the long running
 jobs as this does not take into account the duration, this can lead to same
 treatment for two different scenarios ( mentioned below) :
 1. executors failing with in 5 mins.
 2. executors failing sparsely, but at some point even a single executor
 failure ( which application could have survived ) can make the application
 quit.

 Sending it to the community to listen what kind of behaviour / strategy
 they think will be suitable for long running spark jobs or spark streaming
 jobs.

 Thanks and Regards,
 Twinkle