[ 
https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weizhong updated SPARK-17929:
-----------------------------
    Summary: Deadlock when AM restart and send RemoveExecutor on reset  (was: 
Deadlock when AM restart send RemoveExecutor)

> Deadlock when AM restart and send RemoveExecutor on reset
> ---------------------------------------------------------
>
>                 Key: SPARK-17929
>                 URL: https://issues.apache.org/jira/browse/SPARK-17929
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Weizhong
>            Priority: Minor
>
> We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
> {code}
>   protected def reset(): Unit = synchronized {
>     numPendingExecutors = 0
>     executorsPendingToRemove.clear()
>     // Remove all the lingering executors that should be removed but not yet. 
> The reason might be
>     // because (1) disconnected event is not yet received; (2) executors die 
> silently.
>     executorDataMap.toMap.foreach { case (eid, _) =>
>       driverEndpoint.askWithRetry[Boolean](
>         RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
> re-registered.")))
>     }
>   }
> {code}
> but on removeExecutor also need the lock 
> "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, 
> and send RPC will failed, and reset failed
> {code}
>     private def removeExecutor(executorId: String, reason: 
> ExecutorLossReason): Unit = {
>       logDebug(s"Asked to remove executor $executorId with reason $reason")
>       executorDataMap.get(executorId) match {
>         case Some(executorInfo) =>
>           // This must be synchronized because variables mutated
>           // in this block are read when requesting executors
>           val killed = CoarseGrainedSchedulerBackend.this.synchronized {
>             addressToExecutorId -= executorInfo.executorAddress
>             executorDataMap -= executorId
>             executorsPendingLossReason -= executorId
>             executorsPendingToRemove.remove(executorId).getOrElse(false)
>           }
>      ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to