[jira] [Assigned] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17929:


Assignee: Apache Spark

> Deadlock when AM restart and send RemoveExecutor on reset
> -
>
> Key: SPARK-17929
> URL: https://issues.apache.org/jira/browse/SPARK-17929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Weizhong
>Assignee: Apache Spark
>Priority: Minor
>
> We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
> {code}
>   protected def reset(): Unit = synchronized {
> numPendingExecutors = 0
> executorsPendingToRemove.clear()
> // Remove all the lingering executors that should be removed but not yet. 
> The reason might be
> // because (1) disconnected event is not yet received; (2) executors die 
> silently.
> executorDataMap.toMap.foreach { case (eid, _) =>
>   driverEndpoint.askWithRetry[Boolean](
> RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
> re-registered.")))
> }
>   }
> {code}
> but on removeExecutor also need the lock 
> "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, 
> and send RPC will failed, and reset failed
> {code}
> private def removeExecutor(executorId: String, reason: 
> ExecutorLossReason): Unit = {
>   logDebug(s"Asked to remove executor $executorId with reason $reason")
>   executorDataMap.get(executorId) match {
> case Some(executorInfo) =>
>   // This must be synchronized because variables mutated
>   // in this block are read when requesting executors
>   val killed = CoarseGrainedSchedulerBackend.this.synchronized {
> addressToExecutorId -= executorInfo.executorAddress
> executorDataMap -= executorId
> executorsPendingLossReason -= executorId
> executorsPendingToRemove.remove(executorId).getOrElse(false)
>   }
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17929:


Assignee: (was: Apache Spark)

> Deadlock when AM restart and send RemoveExecutor on reset
> -
>
> Key: SPARK-17929
> URL: https://issues.apache.org/jira/browse/SPARK-17929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Weizhong
>Priority: Minor
>
> We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
> {code}
>   protected def reset(): Unit = synchronized {
> numPendingExecutors = 0
> executorsPendingToRemove.clear()
> // Remove all the lingering executors that should be removed but not yet. 
> The reason might be
> // because (1) disconnected event is not yet received; (2) executors die 
> silently.
> executorDataMap.toMap.foreach { case (eid, _) =>
>   driverEndpoint.askWithRetry[Boolean](
> RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
> re-registered.")))
> }
>   }
> {code}
> but on removeExecutor also need the lock 
> "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, 
> and send RPC will failed, and reset failed
> {code}
> private def removeExecutor(executorId: String, reason: 
> ExecutorLossReason): Unit = {
>   logDebug(s"Asked to remove executor $executorId with reason $reason")
>   executorDataMap.get(executorId) match {
> case Some(executorInfo) =>
>   // This must be synchronized because variables mutated
>   // in this block are read when requesting executors
>   val killed = CoarseGrainedSchedulerBackend.this.synchronized {
> addressToExecutorId -= executorInfo.executorAddress
> executorDataMap -= executorId
> executorsPendingLossReason -= executorId
> executorsPendingToRemove.remove(executorId).getOrElse(false)
>   }
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org