[jira] [Commented] (SPARK-19755) Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.

2018-02-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370002#comment-16370002
 ] 

Apache Spark commented on SPARK-19755:
--

User 'IgorBerman' has created a pull request for this issue:
https://github.com/apache/spark/pull/20640

> Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result 
> - scheduler cannot create an executor after some time.
> ---
>
> Key: SPARK-19755
> URL: https://issues.apache.org/jira/browse/SPARK-19755
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 2.1.0
> Environment: mesos, marathon, docker - driver and executors are 
> dockerized.
>Reporter: Timur Abakumov
>Priority: Major
>
> When for some reason task fails - MesosCoarseGrainedSchedulerBackend 
> increased failure counter for a slave where that task was running.
> When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.  
> Over time  scheduler cannot create a new executor - every slave is is in the 
> blacklist.  Task failure not necessary related to host health- especially for 
> long running stream apps.
> If accepted as a bug: possible solution is to use: spark.blacklist.enabled to 
> make that functionality optional and if it make sense   MAX_SLAVE_FAILURES 
> also can be configurable.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19755) Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.

2018-02-19 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369227#comment-16369227
 ] 

Igor Berman commented on SPARK-19755:
-

This Jira is very relevant for the case when running with  dynamic allocation 
turned on, where starting and stopping executors is part of natural lifecycle 
of the driver. The chances to fail when starting executor are increasing(e.g. 
due to transient port collisions)

 

The threshold of 2 seems too low and artificial for this usecases. I've 
observed situation where at some point almost 1/3 of mesos-slave nodes are 
marked as blacklisted(but they were ok). This creates situation where the 
cluster has free resources but frameworks can't use them since they actively 
decline offers from the master.

> Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result 
> - scheduler cannot create an executor after some time.
> ---
>
> Key: SPARK-19755
> URL: https://issues.apache.org/jira/browse/SPARK-19755
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 2.1.0
> Environment: mesos, marathon, docker - driver and executors are 
> dockerized.
>Reporter: Timur Abakumov
>Priority: Major
>
> When for some reason task fails - MesosCoarseGrainedSchedulerBackend 
> increased failure counter for a slave where that task was running.
> When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.  
> Over time  scheduler cannot create a new executor - every slave is is in the 
> blacklist.  Task failure not necessary related to host health- especially for 
> long running stream apps.
> If accepted as a bug: possible solution is to use: spark.blacklist.enabled to 
> make that functionality optional and if it make sense   MAX_SLAVE_FAILURES 
> also can be configurable.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19755) Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.

2017-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965497#comment-15965497
 ] 

Apache Spark commented on SPARK-19755:
--

User 'timout' has created a pull request for this issue:
https://github.com/apache/spark/pull/17619

> Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result 
> - scheduler cannot create an executor after some time.
> ---
>
> Key: SPARK-19755
> URL: https://issues.apache.org/jira/browse/SPARK-19755
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 2.1.0
> Environment: mesos, marathon, docker - driver and executors are 
> dockerized.
>Reporter: Timur Abakumov
>
> When for some reason task fails - MesosCoarseGrainedSchedulerBackend 
> increased failure counter for a slave where that task was running.
> When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.  
> Over time  scheduler cannot create a new executor - every slave is is in the 
> blacklist.  Task failure not necessary related to host health- especially for 
> long running stream apps.
> If accepted as a bug: possible solution is to use: spark.blacklist.enabled to 
> make that functionality optional and if it make sense   MAX_SLAVE_FAILURES 
> also can be configurable.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19755) Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.

2017-03-17 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930909#comment-15930909
 ] 

Kay Ousterhout commented on SPARK-19755:


I'm closing this because the configs you're proposing adding already exist: 
spark.blacklist.enabled already exists to turn of all blacklisting (this is 
false by default, so the fact that you're seeing blacklisting behavior means 
that your configuration enables blacklisting), and 
spark.blacklist.maxFailedTaskPerExecutor is the other thing you proposed 
adding.  All of the blacklisting parameters are listed here: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L101

Feel free to re-open this if I've misunderstood and the existing configs don't 
address the issues you're seeing!

> Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result 
> - scheduler cannot create an executor after some time.
> ---
>
> Key: SPARK-19755
> URL: https://issues.apache.org/jira/browse/SPARK-19755
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 2.1.0
> Environment: mesos, marathon, docker - driver and executors are 
> dockerized.
>Reporter: Timur Abakumov
>
> When for some reason task fails - MesosCoarseGrainedSchedulerBackend 
> increased failure counter for a slave where that task was running.
> When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.  
> Over time  scheduler cannot create a new executor - every slave is is in the 
> blacklist.  Task failure not necessary related to host health- especially for 
> long running stream apps.
> If accepted as a bug: possible solution is to use: spark.blacklist.enabled to 
> make that functionality optional and if it make sense   MAX_SLAVE_FAILURES 
> also can be configurable.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org