[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again
[ https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509214#comment-16509214 ] Rui Li commented on SPARK-24387: Yes, blacklisting can be used to avoid the issue. But blacklist can be turned off, or configured to be more tolerant. So it's better to have a more reliable solution. > Heartbeat-timeout executor is added back and used again > --- > > Key: SPARK-24387 > URL: https://issues.apache.org/jira/browse/SPARK-24387 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Rui Li >Priority: Major > > In our job, when there's only one task and one executor running, the > executor's heartbeat is lost and driver decides to remove it. However, the > executor is added again and the task's retry attempt is scheduled to that > executor, almost immediately after the executor is marked as lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again
[ https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508832#comment-16508832 ] Jiang Xingbo commented on SPARK-24387: -- {quote}So I think there's a race condition that the backend may make offers before killing the executor. And since this is the only executor left, it's offered to the TaskScheduler and the retried task is scheduled to it.{quote} IIUC removing an executor due to heartbeat timeout will be treated as a SlaveLost, which shall encounter a taskFailure for each task running on that executor, and therefore blacklist the task from running again on that executor, so why can we offer the executor to the retried task again? > Heartbeat-timeout executor is added back and used again > --- > > Key: SPARK-24387 > URL: https://issues.apache.org/jira/browse/SPARK-24387 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Rui Li >Priority: Major > > In our job, when there's only one task and one executor running, the > executor's heartbeat is lost and driver decides to remove it. However, the > executor is added again and the task's retry attempt is scheduled to that > executor, almost immediately after the executor is marked as lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again
[ https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498942#comment-16498942 ] Apache Spark commented on SPARK-24387: -- User 'lirui-apache' has created a pull request for this issue: https://github.com/apache/spark/pull/21486 > Heartbeat-timeout executor is added back and used again > --- > > Key: SPARK-24387 > URL: https://issues.apache.org/jira/browse/SPARK-24387 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Rui Li >Priority: Major > > In our job, when there's only one task and one executor running, the > executor's heartbeat is lost and driver decides to remove it. However, the > executor is added again and the task's retry attempt is scheduled to that > executor, almost immediately after the executor is marked as lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again
[ https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492563#comment-16492563 ] Rui Li commented on SPARK-24387: Instead of let HeartbeatReceiver tell TaskScheduler the executor is lost, I'm wondering whether it makes sense to let CoarseGrainedSchedulerBackend call executorLost in the killExecutors method, at which point, the executor has been marked as pending-to-remove and won't be offered again. [~kayousterhout], [~vanzin] would you mind share your thoughts? Thanks. > Heartbeat-timeout executor is added back and used again > --- > > Key: SPARK-24387 > URL: https://issues.apache.org/jira/browse/SPARK-24387 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Rui Li >Priority: Major > > In our job, when there's only one task and one executor running, the > executor's heartbeat is lost and driver decides to remove it. However, the > executor is added again and the task's retry attempt is scheduled to that > executor, almost immediately after the executor is marked as lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again
[ https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490309#comment-16490309 ] Rui Li commented on SPARK-24387: When HeartbeatReceiver finds the executor's heartbeat is timeout, it informs the TaskScheduler and kills the executor asynchronously. When TaskScheduler handles the lost executor, it tries to revive offer from the backend. So I think there's a race condition that the backend may make offers before killing the executor. And since this is the only executor left, it's offered to the TaskScheduler and the retried task is scheduled to it. And when killing a heartbeat-timeout executor, we expect a replacement executor to be launched. But when the new executor is launched, there's no task for it to run. So it's kept idle until killed by dynamic allocation. > Heartbeat-timeout executor is added back and used again > --- > > Key: SPARK-24387 > URL: https://issues.apache.org/jira/browse/SPARK-24387 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Rui Li >Priority: Major > > In our job, when there's only one task and one executor running, the > executor's heartbeat is lost and driver decides to remove it. However, the > executor is added again and the task's retry attempt is scheduled to that > executor, almost immediately after the executor is marked as lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again
[ https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490293#comment-16490293 ] Rui Li commented on SPARK-24387: A snippet of the log w/ some fields masked: {noformat} [Stage 2:==>(199 + 1) / 200]18/05/20 05:37:07 WARN HeartbeatReceiver: Removing executor 1100 with no recent heartbeats: 345110 ms exceeds timeout 30 ms 18/05/20 05:37:07 ERROR YarnClusterScheduler: Lost executor 1100 on HOSTA: Executor heartbeat timed out after 345110 ms 18/05/20 05:37:07 WARN TaskSetManager: Lost task 55.0 in stage 2.0 (TID 12080, HOSTA, executor 1100): ExecutorLostFailure (executor 1100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 345110 ms 18/05/20 05:37:07 INFO DAGScheduler: Executor lost: 1100 (epoch 2) 18/05/20 05:37:07 INFO DAGScheduler: Host added was in lost list earlier: HOSTA 18/05/20 05:37:07 INFO TaskSetManager: Starting task 55.1 in stage 2.0 (TID 12225, HOSTA, executor 1100, partition 55, PROCESS_LOCAL, 6227 bytes) {noformat} > Heartbeat-timeout executor is added back and used again > --- > > Key: SPARK-24387 > URL: https://issues.apache.org/jira/browse/SPARK-24387 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Rui Li >Priority: Major > > In our job, when there's only one task and one executor running, the > executor's heartbeat is lost and driver decides to remove it. However, the > executor is added again and the task's retry attempt is scheduled to that > executor, almost immediately after the executor is marked as lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org