[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again

2018-06-11 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509214#comment-16509214
 ] 

Rui Li commented on SPARK-24387:


Yes, blacklisting can be used to avoid the issue. But blacklist can be turned 
off, or configured to be more tolerant. So it's better to have a more reliable 
solution.

> Heartbeat-timeout executor is added back and used again
> ---
>
> Key: SPARK-24387
> URL: https://issues.apache.org/jira/browse/SPARK-24387
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Rui Li
>Priority: Major
>
> In our job, when there's only one task and one executor running, the 
> executor's heartbeat is lost and driver decides to remove it. However, the 
> executor is added again and the task's retry attempt is scheduled to that 
> executor, almost immediately after the executor is marked as lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again

2018-06-11 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508832#comment-16508832
 ] 

Jiang Xingbo commented on SPARK-24387:
--

{quote}So I think there's a race condition that the backend may make offers 
before killing the executor. And since this is the only executor left, it's 
offered to the TaskScheduler and the retried task is scheduled to it.{quote}
IIUC removing an executor due to heartbeat timeout will be treated as a 
SlaveLost, which shall encounter a taskFailure for each task running on that 
executor, and therefore blacklist the task from running again on that executor, 
so why can we offer the executor to the retried task again?

> Heartbeat-timeout executor is added back and used again
> ---
>
> Key: SPARK-24387
> URL: https://issues.apache.org/jira/browse/SPARK-24387
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Rui Li
>Priority: Major
>
> In our job, when there's only one task and one executor running, the 
> executor's heartbeat is lost and driver decides to remove it. However, the 
> executor is added again and the task's retry attempt is scheduled to that 
> executor, almost immediately after the executor is marked as lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again

2018-06-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498942#comment-16498942
 ] 

Apache Spark commented on SPARK-24387:
--

User 'lirui-apache' has created a pull request for this issue:
https://github.com/apache/spark/pull/21486

> Heartbeat-timeout executor is added back and used again
> ---
>
> Key: SPARK-24387
> URL: https://issues.apache.org/jira/browse/SPARK-24387
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Rui Li
>Priority: Major
>
> In our job, when there's only one task and one executor running, the 
> executor's heartbeat is lost and driver decides to remove it. However, the 
> executor is added again and the task's retry attempt is scheduled to that 
> executor, almost immediately after the executor is marked as lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again

2018-05-28 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492563#comment-16492563
 ] 

Rui Li commented on SPARK-24387:


Instead of let HeartbeatReceiver tell TaskScheduler the executor is lost, I'm 
wondering whether it makes sense to let CoarseGrainedSchedulerBackend call 
executorLost in the killExecutors method, at which point, the executor has been 
marked as pending-to-remove and won't be offered again.
[~kayousterhout], [~vanzin] would you mind share your thoughts? Thanks.

> Heartbeat-timeout executor is added back and used again
> ---
>
> Key: SPARK-24387
> URL: https://issues.apache.org/jira/browse/SPARK-24387
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Rui Li
>Priority: Major
>
> In our job, when there's only one task and one executor running, the 
> executor's heartbeat is lost and driver decides to remove it. However, the 
> executor is added again and the task's retry attempt is scheduled to that 
> executor, almost immediately after the executor is marked as lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again

2018-05-25 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490309#comment-16490309
 ] 

Rui Li commented on SPARK-24387:


When HeartbeatReceiver finds the executor's heartbeat is timeout, it informs 
the TaskScheduler and kills the executor asynchronously. When TaskScheduler 
handles the lost executor, it tries to revive offer from the backend. So I 
think there's a race condition that the backend may make offers before killing 
the executor. And since this is the only executor left, it's offered to the 
TaskScheduler and the retried task is scheduled to it.

And when killing a heartbeat-timeout executor, we expect a replacement executor 
to be launched. But when the new executor is launched, there's no task for it 
to run. So it's kept idle until killed by dynamic allocation.

> Heartbeat-timeout executor is added back and used again
> ---
>
> Key: SPARK-24387
> URL: https://issues.apache.org/jira/browse/SPARK-24387
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Rui Li
>Priority: Major
>
> In our job, when there's only one task and one executor running, the 
> executor's heartbeat is lost and driver decides to remove it. However, the 
> executor is added again and the task's retry attempt is scheduled to that 
> executor, almost immediately after the executor is marked as lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again

2018-05-25 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490293#comment-16490293
 ] 

Rui Li commented on SPARK-24387:


A snippet of the log w/ some fields masked:
{noformat}
[Stage 2:==>(199 + 1) / 
200]18/05/20 05:37:07 WARN HeartbeatReceiver: Removing executor 1100 with no 
recent heartbeats: 345110 ms exceeds timeout 30 ms
18/05/20 05:37:07 ERROR YarnClusterScheduler: Lost executor 1100 on HOSTA: 
Executor heartbeat timed out after 345110 ms
18/05/20 05:37:07 WARN TaskSetManager: Lost task 55.0 in stage 2.0 (TID 12080, 
HOSTA, executor 1100): ExecutorLostFailure (executor 1100 exited caused by one 
of the running tasks) Reason: Executor heartbeat timed out after 345110 ms
18/05/20 05:37:07 INFO DAGScheduler: Executor lost: 1100 (epoch 2)
18/05/20 05:37:07 INFO DAGScheduler: Host added was in lost list earlier: HOSTA
18/05/20 05:37:07 INFO TaskSetManager: Starting task 55.1 in stage 2.0 (TID 
12225, HOSTA, executor 1100, partition 55, PROCESS_LOCAL, 6227 bytes)
{noformat}

> Heartbeat-timeout executor is added back and used again
> ---
>
> Key: SPARK-24387
> URL: https://issues.apache.org/jira/browse/SPARK-24387
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Rui Li
>Priority: Major
>
> In our job, when there's only one task and one executor running, the 
> executor's heartbeat is lost and driver decides to remove it. However, the 
> executor is added again and the task's retry attempt is scheduled to that 
> executor, almost immediately after the executor is marked as lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org