Shixiong Zhu created SPARK-27348: ------------------------------------ Summary: HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend Key: SPARK-27348 URL: https://issues.apache.org/jira/browse/SPARK-27348 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Shixiong Zhu
When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost executors from CoarseGrainedSchedulerBackend. When a connection is gracefully shut down, CoarseGrainedSchedulerBackend will not receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask TaskScheduler to run tasks on this lost executor. This task will never finish and the job will hang forever. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org