[GitHub] spark pull request #14933: [SPARK-16533][CORE] - backport driver deadlock fi...
Github user angolon closed the pull request at: https://github.com/apache/spark/pull/14933 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14933: [SPARK-16533][CORE] - backport driver deadlock fix to 2....
Github user angolon commented on the issue: https://github.com/apache/spark/pull/14933 Done. Thanks again @vanzin :smile: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14933: [SPARK-16533][CORE] - backport driver deadlock fi...
GitHub user angolon opened a pull request: https://github.com/apache/spark/pull/14933 [SPARK-16533][CORE] - backport driver deadlock fix to 2.0 ## What changes were proposed in this pull request? Backport changes from #14710 and #14925 to 2.0 You can merge this pull request into a Git repository by running: $ git pull https://github.com/angolon/spark SPARK-16533-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14933.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14933 commit 45a3f220b5f1b08fbe4f8d390755041dd2738e67 Author: Angus Gerry <ango...@gmail.com> Date: 2016-09-01T17:35:31Z [SPARK-16533][CORE] resolve deadlocking in driver when executors die This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach: * Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention. * Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention. Existing tests, and manual tests under yarn-client mode. Author: Angus Gerry <ango...@gmail.com> Closes #14710 from angolon/SPARK-16533. commit de488ce0a0025d3c9736a1df6e45d90e265a84d4 Author: Marcelo Vanzin <van...@cloudera.com> Date: 2016-09-01T21:02:58Z [SPARK-16533][HOTFIX] Fix compilation on Scala 2.10. No idea why it was failing (the needed import was there), but this makes things work. Author: Marcelo Vanzin <van...@cloudera.com> Closes #14925 from vanzin/SPARK-16533. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user angolon commented on the issue: https://github.com/apache/spark/pull/14710 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user angolon commented on the issue: https://github.com/apache/spark/pull/14710 ...*sigh* --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user angolon commented on the issue: https://github.com/apache/spark/pull/14710 Thanks @vanzin. I'm on mobile at the moment - I'll take care of your nit when I get back to my desk in a couple of hours. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14710: [SPARK-16533][CORE] resolve deadlocking in driver...
Github user angolon commented on a diff in the pull request: https://github.com/apache/spark/pull/14710#discussion_r76899230 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -532,39 +547,53 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp final def killExecutors( executorIds: Seq[String], replace: Boolean, - force: Boolean): Boolean = synchronized { + force: Boolean): Boolean = { logInfo(s"Requesting to kill executor(s) ${executorIds.mkString(", ")}") -val (knownExecutors, unknownExecutors) = executorIds.partition(executorDataMap.contains) -unknownExecutors.foreach { id => - logWarning(s"Executor to kill $id does not exist!") -} -// If an executor is already pending to be removed, do not kill it again (SPARK-9795) -// If this executor is busy, do not kill it unless we are told to force kill it (SPARK-9552) -val executorsToKill = knownExecutors - .filter { id => !executorsPendingToRemove.contains(id) } - .filter { id => force || !scheduler.isExecutorBusy(id) } -executorsToKill.foreach { id => executorsPendingToRemove(id) = !replace } - -// If we do not wish to replace the executors we kill, sync the target number of executors -// with the cluster manager to avoid allocating new ones. When computing the new target, -// take into account executors that are pending to be added or removed. -if (!replace) { - doRequestTotalExecutors( -numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size) -} else { - numPendingExecutors += knownExecutors.size +val response = synchronized { + val (knownExecutors, unknownExecutors) = executorIds.partition(executorDataMap.contains) + unknownExecutors.foreach { id => +logWarning(s"Executor to kill $id does not exist!") + } + + // If an executor is already pending to be removed, do not kill it again (SPARK-9795) + // If this executor is busy, do not kill it unless we are told to force kill it (SPARK-9552) + val executorsToKill = knownExecutors +.filter { id => !executorsPendingToRemove.contains(id) } +.filter { id => force || !scheduler.isExecutorBusy(id) } + executorsToKill.foreach { id => executorsPendingToRemove(id) = !replace } + + // If we do not wish to replace the executors we kill, sync the target number of executors + // with the cluster manager to avoid allocating new ones. When computing the new target, + // take into account executors that are pending to be added or removed. + val adjustTotalExecutors = +if (!replace) { + doRequestTotalExecutors( +numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size) +} else { + numPendingExecutors += knownExecutors.size + Future.successful(true) +} + + val killExecutors: Boolean => Future[Boolean] = +if (!executorsToKill.isEmpty) { + _ => doKillExecutors(executorsToKill) +} else { + _ => Future.successful(false) +} + + adjustTotalExecutors.flatMap(killExecutors)(ThreadUtils.sameThread) --- End diff -- When I originally started working on this I thought I wouldn't be able to avoid blocking on that call within the synchronized block. However my (admittedly novice) understanding of the code aligns with what @vanzin said - because all it does is send the kill message there's no need to synchronize over it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user angolon commented on the issue: https://github.com/apache/spark/pull/14710 Thanks for the feedback, @vanzin - all good points. I'll fix them up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user angolon commented on the issue: https://github.com/apache/spark/pull/14710 Hrmm... SparkContextSuite passes all tests for me locally. Any idea what might be happening here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user angolon commented on the issue: https://github.com/apache/spark/pull/14710 Done, sorry! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14710: [SPARK-16533][CORE]
GitHub user angolon opened a pull request: https://github.com/apache/spark/pull/14710 [SPARK-16533][CORE] ## What changes were proposed in this pull request? This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach: * Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention. * Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention. ## How was this patch tested? Existing tests, and manual tests under yarn-client mode. You can merge this pull request into a Git repository by running: $ git pull https://github.com/angolon/spark SPARK-16533 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14710.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14710 commit cef69bf470199c63b6638933756b1d057dc890d1 Author: Angus Gerry <ango...@gmail.com> Date: 2016-08-19T01:52:58Z Revert "[SPARK-17022][YARN] Handle potential deadlock in driver handling messages" This reverts commit ea0bf91b4a2ca3ef472906e50e31fd6268b6f53e. commit 4970b3b0bcd834bbe5d5473a3065f04a48b12643 Author: Angus Gerry <ango...@gmail.com> Date: 2016-08-09T04:45:29Z [SPARK-16533][CORE] Use scheduleWithFixedDelay when calling ExecutorAllocatorManager.schedule to ease contention on locks. commit 920274a3ed0b8278d38d721587a24c9441fa5ff3 Author: Angus Gerry <ango...@gmail.com> Date: 2016-08-04T06:27:56Z [SPARK-16533][CORE] Replace many calls to askWithRetry to plain old ask. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org