[GitHub] spark pull request #14933: [SPARK-16533][CORE] - backport driver deadlock fi...

2016-09-07 Thread angolon
Github user angolon closed the pull request at:

https://github.com/apache/spark/pull/14933


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14933: [SPARK-16533][CORE] - backport driver deadlock fix to 2....

2016-09-07 Thread angolon
Github user angolon commented on the issue:

https://github.com/apache/spark/pull/14933
  
Done. Thanks again @vanzin :smile:


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14933: [SPARK-16533][CORE] - backport driver deadlock fi...

2016-09-01 Thread angolon
GitHub user angolon opened a pull request:

https://github.com/apache/spark/pull/14933

[SPARK-16533][CORE] - backport driver deadlock fix to 2.0

## What changes were proposed in this pull request?
Backport changes from #14710 and #14925 to 2.0

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/angolon/spark SPARK-16533-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14933.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14933


commit 45a3f220b5f1b08fbe4f8d390755041dd2738e67
Author: Angus Gerry <ango...@gmail.com>
Date:   2016-09-01T17:35:31Z

[SPARK-16533][CORE] resolve deadlocking in driver when executors die

This pull request reverts the changes made as a part of #14605, which 
simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling 
`ExecutorAllocationManager.schedule` for scheduling executor requests. The 
intent of this is that if invocations are delayed beyond the default schedule 
interval on account of lock contention, then we avoid a situation where calls 
to `schedule` are made back-to-back, potentially releasing and then immediately 
reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message 
handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us 
queue messages with the relevant endpoints, release whatever locks we might be 
holding, and then block whilst awaiting the response. This change is made at 
the cost of being able to retry should sending the message fail, as retrying 
outside of the lock could easily cause race conditions if other conflicting 
messages have been sent whilst awaiting a response. I believe this to be the 
lesser of two evils, as in many cases these RPC calls are to process local 
components, and so failures are more likely to be deterministic, and timeouts 
are more likely to be caused by lock contention.

Existing tests, and manual tests under yarn-client mode.

Author: Angus Gerry <ango...@gmail.com>

Closes #14710 from angolon/SPARK-16533.

commit de488ce0a0025d3c9736a1df6e45d90e265a84d4
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2016-09-01T21:02:58Z

[SPARK-16533][HOTFIX] Fix compilation on Scala 2.10.

No idea why it was failing (the needed import was there), but
this makes things work.

Author: Marcelo Vanzin <van...@cloudera.com>

Closes #14925 from vanzin/SPARK-16533.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...

2016-08-31 Thread angolon
Github user angolon commented on the issue:

https://github.com/apache/spark/pull/14710
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...

2016-08-31 Thread angolon
Github user angolon commented on the issue:

https://github.com/apache/spark/pull/14710
  
...*sigh*


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...

2016-08-30 Thread angolon
Github user angolon commented on the issue:

https://github.com/apache/spark/pull/14710
  
Thanks @vanzin. I'm on mobile at the moment - I'll take care of your nit 
when I get back to my desk in a couple of hours.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14710: [SPARK-16533][CORE] resolve deadlocking in driver...

2016-08-30 Thread angolon
Github user angolon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14710#discussion_r76899230
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
@@ -532,39 +547,53 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
   final def killExecutors(
   executorIds: Seq[String],
   replace: Boolean,
-  force: Boolean): Boolean = synchronized {
+  force: Boolean): Boolean = {
 logInfo(s"Requesting to kill executor(s) ${executorIds.mkString(", 
")}")
-val (knownExecutors, unknownExecutors) = 
executorIds.partition(executorDataMap.contains)
-unknownExecutors.foreach { id =>
-  logWarning(s"Executor to kill $id does not exist!")
-}
 
-// If an executor is already pending to be removed, do not kill it 
again (SPARK-9795)
-// If this executor is busy, do not kill it unless we are told to 
force kill it (SPARK-9552)
-val executorsToKill = knownExecutors
-  .filter { id => !executorsPendingToRemove.contains(id) }
-  .filter { id => force || !scheduler.isExecutorBusy(id) }
-executorsToKill.foreach { id => executorsPendingToRemove(id) = 
!replace }
-
-// If we do not wish to replace the executors we kill, sync the target 
number of executors
-// with the cluster manager to avoid allocating new ones. When 
computing the new target,
-// take into account executors that are pending to be added or removed.
-if (!replace) {
-  doRequestTotalExecutors(
-numExistingExecutors + numPendingExecutors - 
executorsPendingToRemove.size)
-} else {
-  numPendingExecutors += knownExecutors.size
+val response = synchronized {
+  val (knownExecutors, unknownExecutors) = 
executorIds.partition(executorDataMap.contains)
+  unknownExecutors.foreach { id =>
+logWarning(s"Executor to kill $id does not exist!")
+  }
+
+  // If an executor is already pending to be removed, do not kill it 
again (SPARK-9795)
+  // If this executor is busy, do not kill it unless we are told to 
force kill it (SPARK-9552)
+  val executorsToKill = knownExecutors
+.filter { id => !executorsPendingToRemove.contains(id) }
+.filter { id => force || !scheduler.isExecutorBusy(id) }
+  executorsToKill.foreach { id => executorsPendingToRemove(id) = 
!replace }
+
+  // If we do not wish to replace the executors we kill, sync the 
target number of executors
+  // with the cluster manager to avoid allocating new ones. When 
computing the new target,
+  // take into account executors that are pending to be added or 
removed.
+  val adjustTotalExecutors =
+if (!replace) {
+  doRequestTotalExecutors(
+numExistingExecutors + numPendingExecutors - 
executorsPendingToRemove.size)
+} else {
+  numPendingExecutors += knownExecutors.size
+  Future.successful(true)
+}
+
+  val killExecutors: Boolean => Future[Boolean] =
+if (!executorsToKill.isEmpty) {
+  _ => doKillExecutors(executorsToKill)
+} else {
+  _ => Future.successful(false)
+}
+
+  adjustTotalExecutors.flatMap(killExecutors)(ThreadUtils.sameThread)
--- End diff --

When I originally started working on this I thought I wouldn't be able to 
avoid blocking on that call within the synchronized block. However my 
(admittedly novice) understanding of the code aligns with what @vanzin said - 
because all it does is send the kill message there's no need to synchronize 
over it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...

2016-08-25 Thread angolon
Github user angolon commented on the issue:

https://github.com/apache/spark/pull/14710
  
Thanks for the feedback, @vanzin - all good points. I'll fix them up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...

2016-08-24 Thread angolon
Github user angolon commented on the issue:

https://github.com/apache/spark/pull/14710
  
Hrmm... SparkContextSuite passes all tests for me locally. Any idea what 
might be happening here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...

2016-08-18 Thread angolon
Github user angolon commented on the issue:

https://github.com/apache/spark/pull/14710
  
Done, sorry!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14710: [SPARK-16533][CORE]

2016-08-18 Thread angolon
GitHub user angolon opened a pull request:

https://github.com/apache/spark/pull/14710

[SPARK-16533][CORE]

## What changes were proposed in this pull request?
This pull request reverts the changes made as a part of #14605, which 
simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling 
`ExecutorAllocationManager.schedule` for scheduling executor requests. The 
intent of this is that if invocations are delayed beyond the default schedule 
interval on account of lock contention, then we avoid a situation where calls 
to `schedule` are made back-to-back, potentially releasing and then immediately 
reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message 
handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us 
queue messages with the relevant endpoints, release whatever locks we might be 
holding, and then block whilst awaiting the response. This change is made at 
the cost of being able to retry should sending the message fail, as retrying 
outside of the lock could easily cause race conditions if other conflicting 
messages have been sent whilst awaiting a response. I believe this to be the 
lesser of two evils, as in many cases these RPC calls are to process local 
components, and so failures are more likely to be deterministic, and timeouts 
are more likely to be caused by lock contention.

## How was this patch tested?
Existing tests, and manual tests under yarn-client mode.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/angolon/spark SPARK-16533

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14710.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14710


commit cef69bf470199c63b6638933756b1d057dc890d1
Author: Angus Gerry <ango...@gmail.com>
Date:   2016-08-19T01:52:58Z

Revert "[SPARK-17022][YARN] Handle potential deadlock in driver handling 
messages"

This reverts commit ea0bf91b4a2ca3ef472906e50e31fd6268b6f53e.

commit 4970b3b0bcd834bbe5d5473a3065f04a48b12643
Author: Angus Gerry <ango...@gmail.com>
Date:   2016-08-09T04:45:29Z

[SPARK-16533][CORE] Use scheduleWithFixedDelay when calling 
ExecutorAllocatorManager.schedule to ease contention on locks.

commit 920274a3ed0b8278d38d721587a24c9441fa5ff3
Author: Angus Gerry <ango...@gmail.com>
Date:   2016-08-04T06:27:56Z

[SPARK-16533][CORE] Replace many calls to askWithRetry to plain old ask.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org