[jira] [Commented] (SPARK-16554) Spark should kill executors when they are blacklisted

2017-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831005#comment-15831005
 ] 

Apache Spark commented on SPARK-16554:
--

User 'jsoltren' has created a pull request for this issue:
https://github.com/apache/spark/pull/16650

> Spark should kill executors when they are blacklisted
> -
>
> Key: SPARK-16554
> URL: https://issues.apache.org/jira/browse/SPARK-16554
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: Imran Rashid
>Assignee: Jose Soltren
>
> SPARK-8425 will allow blacklisting faulty executors and nodes.  However, 
> these blacklisted executors will continue to run.  This is bad for a few 
> reasons:
> (1) Even if there is faulty-hardware, if the cluster is under-utilized spark 
> may be able to request another executor on a different node.
> (2) If there is a faulty-disk (the most common case of faulty-hardware), the 
> cluster manager may be able to allocate another executor on the same node, if 
> it can exclude the bad disk.  (Yarn will do this with its disk-health 
> checker.)
> With dynamic allocation, this may seem less critical, as a blacklisted 
> executor will stop running new tasks and eventually get reclaimed.  However, 
> if there is cached data on those executors, they will not get killed till 
> {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} expires, which is 
> (effectively) infinite by default.
> Users may not *always* want to kill bad executors, so this must be 
> configurable to some extent.  At a minimum, it should be possible to enable / 
> disable it; perhaps the executor should be killed after it has been 
> blacklisted a configurable {{N}} times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16554) Spark should kill executors when they are blacklisted

2017-01-18 Thread Jose Soltren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828522#comment-15828522
 ] 

Jose Soltren commented on SPARK-16554:
--

Builds on some BlacklistTracker changes that should land first.

> Spark should kill executors when they are blacklisted
> -
>
> Key: SPARK-16554
> URL: https://issues.apache.org/jira/browse/SPARK-16554
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: Imran Rashid
>Assignee: Jose Soltren
>
> SPARK-8425 will allow blacklisting faulty executors and nodes.  However, 
> these blacklisted executors will continue to run.  This is bad for a few 
> reasons:
> (1) Even if there is faulty-hardware, if the cluster is under-utilized spark 
> may be able to request another executor on a different node.
> (2) If there is a faulty-disk (the most common case of faulty-hardware), the 
> cluster manager may be able to allocate another executor on the same node, if 
> it can exclude the bad disk.  (Yarn will do this with its disk-health 
> checker.)
> With dynamic allocation, this may seem less critical, as a blacklisted 
> executor will stop running new tasks and eventually get reclaimed.  However, 
> if there is cached data on those executors, they will not get killed till 
> {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} expires, which is 
> (effectively) infinite by default.
> Users may not *always* want to kill bad executors, so this must be 
> configurable to some extent.  At a minimum, it should be possible to enable / 
> disable it; perhaps the executor should be killed after it has been 
> blacklisted a configurable {{N}} times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16554) Spark should kill executors when they are blacklisted

2017-01-18 Thread Jose Soltren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828520#comment-15828520
 ] 

Jose Soltren commented on SPARK-16554:
--

I have some changes ready, but I'm going to wait for 
https://github.com/apache/spark/pull/16346 to land before sending a pull review 
to avoid churn. Hopefully this happens in the next day or two.

> Spark should kill executors when they are blacklisted
> -
>
> Key: SPARK-16554
> URL: https://issues.apache.org/jira/browse/SPARK-16554
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: Imran Rashid
>Assignee: Jose Soltren
>
> SPARK-8425 will allow blacklisting faulty executors and nodes.  However, 
> these blacklisted executors will continue to run.  This is bad for a few 
> reasons:
> (1) Even if there is faulty-hardware, if the cluster is under-utilized spark 
> may be able to request another executor on a different node.
> (2) If there is a faulty-disk (the most common case of faulty-hardware), the 
> cluster manager may be able to allocate another executor on the same node, if 
> it can exclude the bad disk.  (Yarn will do this with its disk-health 
> checker.)
> With dynamic allocation, this may seem less critical, as a blacklisted 
> executor will stop running new tasks and eventually get reclaimed.  However, 
> if there is cached data on those executors, they will not get killed till 
> {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} expires, which is 
> (effectively) infinite by default.
> Users may not *always* want to kill bad executors, so this must be 
> configurable to some extent.  At a minimum, it should be possible to enable / 
> disable it; perhaps the executor should be killed after it has been 
> blacklisted a configurable {{N}} times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16554) Spark should kill executors when they are blacklisted

2016-11-30 Thread Jose Soltren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709080#comment-15709080
 ] 

Jose Soltren commented on SPARK-16554:
--

This is a nice idea but I'm not sure how feasible it is.

If we're going to kill an executor, it is because there has already been a task 
failure which makes the executor suspect. There could be a network problem, or 
lost or corrupt local storage. Even if we could recover hosted blocks off the 
executor I'm not certain how trustworthy they would be.

There would need to be some mechanism in place to determine if recovered blocks 
are valid, and perhaps a maximum number of tries or a timeout for this step as 
well. I'm not certain how often this would provide a benefit as compared to 
simply recomputing using lineage.

I'll continue to think through this as I look further at the code.

> Spark should kill executors when they are blacklisted
> -
>
> Key: SPARK-16554
> URL: https://issues.apache.org/jira/browse/SPARK-16554
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: Imran Rashid
>
> SPARK-8425 will allow blacklisting faulty executors and nodes.  However, 
> these blacklisted executors will continue to run.  This is bad for a few 
> reasons:
> (1) Even if there is faulty-hardware, if the cluster is under-utilized spark 
> may be able to request another executor on a different node.
> (2) If there is a faulty-disk (the most common case of faulty-hardware), the 
> cluster manager may be able to allocate another executor on the same node, if 
> it can exclude the bad disk.  (Yarn will do this with its disk-health 
> checker.)
> With dynamic allocation, this may seem less critical, as a blacklisted 
> executor will stop running new tasks and eventually get reclaimed.  However, 
> if there is cached data on those executors, they will not get killed till 
> {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} expires, which is 
> (effectively) infinite by default.
> Users may not *always* want to kill bad executors, so this must be 
> configurable to some extent.  At a minimum, it should be possible to enable / 
> disable it; perhaps the executor should be killed after it has been 
> blacklisted a configurable {{N}} times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16554) Spark should kill executors when they are blacklisted

2016-11-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15703851#comment-15703851
 ] 

Mridul Muralidharan commented on SPARK-16554:
-

It would also be good if we can also move currently hosted blocks off the node 
before it is killed as a best case effort.
This will reduce recomputation (if single copy of rdd block) and/or prevent 
under replication (if block replication > 1)

> Spark should kill executors when they are blacklisted
> -
>
> Key: SPARK-16554
> URL: https://issues.apache.org/jira/browse/SPARK-16554
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: Imran Rashid
>
> SPARK-8425 will allow blacklisting faulty executors and nodes.  However, 
> these blacklisted executors will continue to run.  This is bad for a few 
> reasons:
> (1) Even if there is faulty-hardware, if the cluster is under-utilized spark 
> may be able to request another executor on a different node.
> (2) If there is a faulty-disk (the most common case of faulty-hardware), the 
> cluster manager may be able to allocate another executor on the same node, if 
> it can exclude the bad disk.  (Yarn will do this with its disk-health 
> checker.)
> With dynamic allocation, this may seem less critical, as a blacklisted 
> executor will stop running new tasks and eventually get reclaimed.  However, 
> if there is cached data on those executors, they will not get killed till 
> {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} expires, which is 
> (effectively) infinite by default.
> Users may not *always* want to kill bad executors, so this must be 
> configurable to some extent.  At a minimum, it should be possible to enable / 
> disable it; perhaps the executor should be killed after it has been 
> blacklisted a configurable {{N}} times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org