[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18651


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-24 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r129038408
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

If its a fatal error I don't really expect us to continue so dynamic 
allocation doesn't matter. If you know a case where we would recover from fatal 
exception, I'm fine with adding in another catch there to decrement in the 
fatal case as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-23 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128943316
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

I agree that put `numExecutorsStarting.decrementAndGet()` together with 
`numExecutorsRunning.incrementAndGet()` in the `updateInternalState` is better 
if we can.

Why I try to put `numExecutorsStarting.decrementAndGet()` in the `finally` 
block is that if there some Exceptions is not `NonFatal`, and caught by the 
following code, we may can not allocated resources as we  specified, this is 
the same as @vanzin worried.

We may double the count in the current code, but this only slow down the 
allocation rate for a small time.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128838209
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

yep, the fact that its still marked as Starting even though failed will fix 
itself next loop through. Its no different then if we didn't know it failed and 
it was still in the ExecutorRunnable.run code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128834971
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

Ok, I see. The current code can double count the same executor as starting 
and running, while the previous code could count it as starting even though it 
failed to start (for a really small window), but that is a self-healing 
situation while the previous can have some adverse effects.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128831747
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

yes but its a bug right now as the numbers can be wrong. Are you looking at 
the synchronization?

Right now everything is called synchronized up to the point of launcher 
pool to do the ExecutorRunnable.  At this point running is not incremented, 
pending is decremented and we now increment Starting.  That is fine.

But when the ExecutorRunnable finishes the only place its called 
synchronized is in updateInternalState.  This right now increments running but 
does not decrement starting.  if updateResourceRequests gets called (which is 
synchronized), Right after updateInternalState (which leave the syncrhonized) 
but before the finally block executes and decrements starting the total number 
can be more then it really is.  That executor is counted as both running and 
starting


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128829650
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

`numExecutorsStarting` is incremented before the task is submitted and 
needs to be decremented after the task finishes, successfully or not. That's 
exactly what the current code does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128829289
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

not sure I follow your reference between starting and running.  I was just 
saying in the failure case it doesn't matter because you aren't going to 
overcount.

If we don't decrement the starting inside of updateInternalState we have 
the possibility to over count because running and starting aren't 
incremented/decremented within the synchronize block.  We don't want to do 
that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128828683
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -503,8 +511,9 @@ private[yarn] class YarnAllocator(
 allocatedContainerToHostMap.put(containerId, executorHostname)
   }
 
-  if (numExecutorsRunning < targetNumExecutors) {
+  if (numExecutorsRunning.get < targetNumExecutors) {
 if (launchContainers) {
+  numExecutorsStarting.incrementAndGet()
--- End diff --

note if we change the below back this needs to move outside 
if(launchContainers)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128828098
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

Note I thought the last version handled this properly where he had the 
decrement in the catch block


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128828035
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

this is "starting" not "running".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128827627
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

sorry don't follow. updateInternalState is called right after the run if it 
succeeds.  You do still need to handle the failure as well but in the failure 
the running is never incremented


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128807888
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

If you move this to `updateInternalState`, then if there's a problem in 
`ExecutorRunnable.run` this counter will not be decremented.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128807459
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

why did you move this out of updateInternalState?  I would rather see those 
stay together because updateInternalState is synchronized and if they aren't 
changed at the same time you could end up with the wrong numbers. 
numExecutorsStarting could still be say 1 when you have already added that to 
the # running and then updateResourceRequests gets an incorrect number.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-20 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128435807
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,8 +535,9 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
+  numExecutorsStarting.decrementAndGet()
--- End diff --

Yes, it is more robust. I have update the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-20 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128432387
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -242,7 +244,7 @@ private[yarn] class YarnAllocator(
 if (executorIdToContainer.contains(executorId)) {
   val container = executorIdToContainer.get(executorId).get
   internalReleaseContainer(container)
-  numExecutorsRunning -= 1
--- End diff --

Yes, I just try to keep consistency with `numExecutorsStarting`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-19 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128340324
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,8 +535,9 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
+  numExecutorsStarting.decrementAndGet()
--- End diff --

Safer to put this in a `finally`, no?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-19 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128339942
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -242,7 +244,7 @@ private[yarn] class YarnAllocator(
 if (executorIdToContainer.contains(executorId)) {
   val container = executorIdToContainer.get(executorId).get
   internalReleaseContainer(container)
-  numExecutorsRunning -= 1
--- End diff --

This doesn't need to be an atomic integer because this method is 
synchronized already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-19 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128290766
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -294,7 +296,8 @@ private[yarn] class YarnAllocator(
   def updateResourceRequests(): Unit = {
 val pendingAllocate = getPendingAllocate
 val numPendingAllocate = pendingAllocate.size
-val missing = targetNumExecutors - numPendingAllocate - 
numExecutorsRunning
+val missing = targetNumExecutors - numPendingAllocate -
+  numExecutorsStarting.get - numExecutorsRunning.get
 
--- End diff --

Thanks for your advice! I just add the debug info.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-19 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128249084
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -294,7 +296,8 @@ private[yarn] class YarnAllocator(
   def updateResourceRequests(): Unit = {
 val pendingAllocate = getPendingAllocate
 val numPendingAllocate = pendingAllocate.size
-val missing = targetNumExecutors - numPendingAllocate - 
numExecutorsRunning
+val missing = targetNumExecutors - numPendingAllocate -
+  numExecutorsStarting.get - numExecutorsRunning.get
 
--- End diff --

can you also add in a debug message here, something like below.  I found 
this very useful when debugging this issue and think it would be useful for 
debugging other allocating issues in the future.

logDebug(s"Updating resource requests, target: $targetNumExecutors, 
pending: " +
 +  s"$numPendingAllocate, running: $numExecutorsRunning, 
executorsStarting $numExecutorsPendingStart")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-19 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128247587
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -270,7 +272,7 @@ private[yarn] class YarnAllocator(
   logDebug("Allocated containers: %d. Current executor count: %d. 
Cluster resources: %s."
 .format(
   allocatedContainers.size,
-  numExecutorsRunning,
+  numExecutorsRunning.get,
--- End diff --

it woudl be nice to print the numExecutorsStarting here as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-18 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128144194
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -505,32 +508,37 @@ private[yarn] class YarnAllocator(
 
   if (numExecutorsRunning < targetNumExecutors) {
 if (launchContainers) {
-  launcherPool.execute(new Runnable {
-override def run(): Unit = {
-  try {
-new ExecutorRunnable(
-  Some(container),
-  conf,
-  sparkConf,
-  driverUrl,
-  executorId,
-  executorHostname,
-  executorMemory,
-  executorCores,
-  appAttemptId.getApplicationId.toString,
-  securityMgr,
-  localResources
-).run()
-updateInternalState()
-  } catch {
-case NonFatal(e) =>
-  logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
-  amClient.releaseAssignedContainer(containerId)
+  try {
+numExecutorToBeLaunched += 1
+launcherPool.execute(new Runnable {
+  override def run(): Unit = {
+try {
+  new ExecutorRunnable(
+Some(container),
+conf,
+sparkConf,
+driverUrl,
+executorId,
+executorHostname,
+executorMemory,
+executorCores,
+appAttemptId.getApplicationId.toString,
+securityMgr,
+localResources
+  ).run()
+  updateInternalState()
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to launch executor $executorId on 
container $containerId", e)
+// Assigned container should be released immediately
+// to avoid unnecessary resource occupation.
+amClient.releaseAssignedContainer(containerId)
+}
   }
-}
-  })
+})
+  } finally {
+numExecutorToBeLaunched -= 1
--- End diff --

Yes, you're right. When I test the code by experiment, i decrease the 
`numExecutorToBeLaunched` in the `updateInternalState` function, but I later 
found this may impact the test.

I will fix this soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-18 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128143898
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -82,6 +82,8 @@ private[yarn] class YarnAllocator(
 
   @volatile private var numExecutorsRunning = 0
 
+  @volatile private var numExecutorToBeLaunched = 0
--- End diff --

OK,I will change the name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-18 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128096434
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -505,32 +508,37 @@ private[yarn] class YarnAllocator(
 
   if (numExecutorsRunning < targetNumExecutors) {
 if (launchContainers) {
-  launcherPool.execute(new Runnable {
-override def run(): Unit = {
-  try {
-new ExecutorRunnable(
-  Some(container),
-  conf,
-  sparkConf,
-  driverUrl,
-  executorId,
-  executorHostname,
-  executorMemory,
-  executorCores,
-  appAttemptId.getApplicationId.toString,
-  securityMgr,
-  localResources
-).run()
-updateInternalState()
-  } catch {
-case NonFatal(e) =>
-  logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
-  amClient.releaseAssignedContainer(containerId)
+  try {
+numExecutorToBeLaunched += 1
+launcherPool.execute(new Runnable {
+  override def run(): Unit = {
+try {
+  new ExecutorRunnable(
+Some(container),
+conf,
+sparkConf,
+driverUrl,
+executorId,
+executorHostname,
+executorMemory,
+executorCores,
+appAttemptId.getApplicationId.toString,
+securityMgr,
+localResources
+  ).run()
+  updateInternalState()
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to launch executor $executorId on 
container $containerId", e)
+// Assigned container should be released immediately
+// to avoid unnecessary resource occupation.
+amClient.releaseAssignedContainer(containerId)
+}
   }
-}
-  })
+})
+  } finally {
+numExecutorToBeLaunched -= 1
--- End diff --

Also, shouldn't this be done in the actual thread that launching the 
executor? Otherwise you're decrementing the counter as soon as the task is 
submitted for execution, but you're not really waiting for the task to execute.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-18 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128096119
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -82,6 +82,8 @@ private[yarn] class YarnAllocator(
 
   @volatile private var numExecutorsRunning = 0
 
+  @volatile private var numExecutorToBeLaunched = 0
--- End diff --

A better name would be `numExecutorsStarting`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-18 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128095881
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -505,32 +508,37 @@ private[yarn] class YarnAllocator(
 
   if (numExecutorsRunning < targetNumExecutors) {
 if (launchContainers) {
-  launcherPool.execute(new Runnable {
-override def run(): Unit = {
-  try {
-new ExecutorRunnable(
-  Some(container),
-  conf,
-  sparkConf,
-  driverUrl,
-  executorId,
-  executorHostname,
-  executorMemory,
-  executorCores,
-  appAttemptId.getApplicationId.toString,
-  securityMgr,
-  localResources
-).run()
-updateInternalState()
-  } catch {
-case NonFatal(e) =>
-  logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
-  amClient.releaseAssignedContainer(containerId)
+  try {
+numExecutorToBeLaunched += 1
+launcherPool.execute(new Runnable {
+  override def run(): Unit = {
+try {
+  new ExecutorRunnable(
+Some(container),
+conf,
+sparkConf,
+driverUrl,
+executorId,
+executorHostname,
+executorMemory,
+executorCores,
+appAttemptId.getApplicationId.toString,
+securityMgr,
+localResources
+  ).run()
+  updateInternalState()
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to launch executor $executorId on 
container $containerId", e)
+// Assigned container should be released immediately
+// to avoid unnecessary resource occupation.
+amClient.releaseAssignedContainer(containerId)
+}
   }
-}
-  })
+})
+  } finally {
+numExecutorToBeLaunched -= 1
--- End diff --

Same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-18 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128095843
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -505,32 +508,37 @@ private[yarn] class YarnAllocator(
 
   if (numExecutorsRunning < targetNumExecutors) {
 if (launchContainers) {
-  launcherPool.execute(new Runnable {
-override def run(): Unit = {
-  try {
-new ExecutorRunnable(
-  Some(container),
-  conf,
-  sparkConf,
-  driverUrl,
-  executorId,
-  executorHostname,
-  executorMemory,
-  executorCores,
-  appAttemptId.getApplicationId.toString,
-  securityMgr,
-  localResources
-).run()
-updateInternalState()
-  } catch {
-case NonFatal(e) =>
-  logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
-  amClient.releaseAssignedContainer(containerId)
+  try {
+numExecutorToBeLaunched += 1
--- End diff --

This is not thread safe. `+=` is two operations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-17 Thread djvulee
GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/18651

[SPARK-21383][Core] Fix the YarnAllocator allocates more Resource

## What changes were proposed in this pull request?
When NodeManagers launching Executors,
the `missing` value will excel the
real value when the launch is slow, this can lead to YARN allocates more 
resource.


We add the `numExecutorToBeLaunched` when calculate the `missing` to avoid 
this.


## How was this patch tested?
Test by experiment.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/djvulee/spark YarnAllocate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18651.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18651


commit 818c9126959e8576861478e18389e6ed8fdbeac4
Author: DjvuLee 
Date:   2017-07-17T07:54:09Z

[Core] Fix the YarnAllocator allocate more Resource

When NodeManagers launched the Executors, the missing will excel the
real value, this can lead to YARN allocate more resource.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org