[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18651 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r129038408 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- If its a fatal error I don't really expect us to continue so dynamic allocation doesn't matter. If you know a case where we would recover from fatal exception, I'm fine with adding in another catch there to decrement in the fatal case as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128943316 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- I agree that put `numExecutorsStarting.decrementAndGet()` together with `numExecutorsRunning.incrementAndGet()` in the `updateInternalState` is better if we can. Why I try to put `numExecutorsStarting.decrementAndGet()` in the `finally` block is that if there some Exceptions is not `NonFatal`, and caught by the following code, we may can not allocated resources as we specified, this is the same as @vanzin worried. We may double the count in the current code, but this only slow down the allocation rate for a small time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128838209 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- yep, the fact that its still marked as Starting even though failed will fix itself next loop through. Its no different then if we didn't know it failed and it was still in the ExecutorRunnable.run code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128834971 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- Ok, I see. The current code can double count the same executor as starting and running, while the previous code could count it as starting even though it failed to start (for a really small window), but that is a self-healing situation while the previous can have some adverse effects. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128831747 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- yes but its a bug right now as the numbers can be wrong. Are you looking at the synchronization? Right now everything is called synchronized up to the point of launcher pool to do the ExecutorRunnable. At this point running is not incremented, pending is decremented and we now increment Starting. That is fine. But when the ExecutorRunnable finishes the only place its called synchronized is in updateInternalState. This right now increments running but does not decrement starting. if updateResourceRequests gets called (which is synchronized), Right after updateInternalState (which leave the syncrhonized) but before the finally block executes and decrements starting the total number can be more then it really is. That executor is counted as both running and starting --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128829650 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- `numExecutorsStarting` is incremented before the task is submitted and needs to be decremented after the task finishes, successfully or not. That's exactly what the current code does. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128829289 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- not sure I follow your reference between starting and running. I was just saying in the failure case it doesn't matter because you aren't going to overcount. If we don't decrement the starting inside of updateInternalState we have the possibility to over count because running and starting aren't incremented/decremented within the synchronize block. We don't want to do that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128828683 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -503,8 +511,9 @@ private[yarn] class YarnAllocator( allocatedContainerToHostMap.put(containerId, executorHostname) } - if (numExecutorsRunning < targetNumExecutors) { + if (numExecutorsRunning.get < targetNumExecutors) { if (launchContainers) { + numExecutorsStarting.incrementAndGet() --- End diff -- note if we change the below back this needs to move outside if(launchContainers) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128828098 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- Note I thought the last version handled this properly where he had the decrement in the catch block --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128828035 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- this is "starting" not "running". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128827627 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- sorry don't follow. updateInternalState is called right after the run if it succeeds. You do still need to handle the failure as well but in the failure the running is never incremented --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128807888 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- If you move this to `updateInternalState`, then if there's a problem in `ExecutorRunnable.run` this counter will not be decremented. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128807459 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- why did you move this out of updateInternalState? I would rather see those stay together because updateInternalState is synchronized and if they aren't changed at the same time you could end up with the wrong numbers. numExecutorsStarting could still be say 1 when you have already added that to the # running and then updateResourceRequests gets an incorrect number. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128435807 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,8 +535,9 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. + numExecutorsStarting.decrementAndGet() --- End diff -- Yes, it is more robust. I have update the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128432387 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -242,7 +244,7 @@ private[yarn] class YarnAllocator( if (executorIdToContainer.contains(executorId)) { val container = executorIdToContainer.get(executorId).get internalReleaseContainer(container) - numExecutorsRunning -= 1 --- End diff -- Yes, I just try to keep consistency with `numExecutorsStarting` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128340324 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,8 +535,9 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. + numExecutorsStarting.decrementAndGet() --- End diff -- Safer to put this in a `finally`, no? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128339942 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -242,7 +244,7 @@ private[yarn] class YarnAllocator( if (executorIdToContainer.contains(executorId)) { val container = executorIdToContainer.get(executorId).get internalReleaseContainer(container) - numExecutorsRunning -= 1 --- End diff -- This doesn't need to be an atomic integer because this method is synchronized already. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128290766 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -294,7 +296,8 @@ private[yarn] class YarnAllocator( def updateResourceRequests(): Unit = { val pendingAllocate = getPendingAllocate val numPendingAllocate = pendingAllocate.size -val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning +val missing = targetNumExecutors - numPendingAllocate - + numExecutorsStarting.get - numExecutorsRunning.get --- End diff -- Thanks for your advice! I just add the debug info. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128249084 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -294,7 +296,8 @@ private[yarn] class YarnAllocator( def updateResourceRequests(): Unit = { val pendingAllocate = getPendingAllocate val numPendingAllocate = pendingAllocate.size -val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning +val missing = targetNumExecutors - numPendingAllocate - + numExecutorsStarting.get - numExecutorsRunning.get --- End diff -- can you also add in a debug message here, something like below. I found this very useful when debugging this issue and think it would be useful for debugging other allocating issues in the future. logDebug(s"Updating resource requests, target: $targetNumExecutors, pending: " + + s"$numPendingAllocate, running: $numExecutorsRunning, executorsStarting $numExecutorsPendingStart") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128247587 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -270,7 +272,7 @@ private[yarn] class YarnAllocator( logDebug("Allocated containers: %d. Current executor count: %d. Cluster resources: %s." .format( allocatedContainers.size, - numExecutorsRunning, + numExecutorsRunning.get, --- End diff -- it woudl be nice to print the numExecutorsStarting here as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128144194 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -505,32 +508,37 @@ private[yarn] class YarnAllocator( if (numExecutorsRunning < targetNumExecutors) { if (launchContainers) { - launcherPool.execute(new Runnable { -override def run(): Unit = { - try { -new ExecutorRunnable( - Some(container), - conf, - sparkConf, - driverUrl, - executorId, - executorHostname, - executorMemory, - executorCores, - appAttemptId.getApplicationId.toString, - securityMgr, - localResources -).run() -updateInternalState() - } catch { -case NonFatal(e) => - logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. - amClient.releaseAssignedContainer(containerId) + try { +numExecutorToBeLaunched += 1 +launcherPool.execute(new Runnable { + override def run(): Unit = { +try { + new ExecutorRunnable( +Some(container), +conf, +sparkConf, +driverUrl, +executorId, +executorHostname, +executorMemory, +executorCores, +appAttemptId.getApplicationId.toString, +securityMgr, +localResources + ).run() + updateInternalState() +} catch { + case NonFatal(e) => +logError(s"Failed to launch executor $executorId on container $containerId", e) +// Assigned container should be released immediately +// to avoid unnecessary resource occupation. +amClient.releaseAssignedContainer(containerId) +} } -} - }) +}) + } finally { +numExecutorToBeLaunched -= 1 --- End diff -- Yes, you're right. When I test the code by experiment, i decrease the `numExecutorToBeLaunched` in the `updateInternalState` function, but I later found this may impact the test. I will fix this soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128143898 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -82,6 +82,8 @@ private[yarn] class YarnAllocator( @volatile private var numExecutorsRunning = 0 + @volatile private var numExecutorToBeLaunched = 0 --- End diff -- OKï¼I will change the name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128096434 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -505,32 +508,37 @@ private[yarn] class YarnAllocator( if (numExecutorsRunning < targetNumExecutors) { if (launchContainers) { - launcherPool.execute(new Runnable { -override def run(): Unit = { - try { -new ExecutorRunnable( - Some(container), - conf, - sparkConf, - driverUrl, - executorId, - executorHostname, - executorMemory, - executorCores, - appAttemptId.getApplicationId.toString, - securityMgr, - localResources -).run() -updateInternalState() - } catch { -case NonFatal(e) => - logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. - amClient.releaseAssignedContainer(containerId) + try { +numExecutorToBeLaunched += 1 +launcherPool.execute(new Runnable { + override def run(): Unit = { +try { + new ExecutorRunnable( +Some(container), +conf, +sparkConf, +driverUrl, +executorId, +executorHostname, +executorMemory, +executorCores, +appAttemptId.getApplicationId.toString, +securityMgr, +localResources + ).run() + updateInternalState() +} catch { + case NonFatal(e) => +logError(s"Failed to launch executor $executorId on container $containerId", e) +// Assigned container should be released immediately +// to avoid unnecessary resource occupation. +amClient.releaseAssignedContainer(containerId) +} } -} - }) +}) + } finally { +numExecutorToBeLaunched -= 1 --- End diff -- Also, shouldn't this be done in the actual thread that launching the executor? Otherwise you're decrementing the counter as soon as the task is submitted for execution, but you're not really waiting for the task to execute. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128096119 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -82,6 +82,8 @@ private[yarn] class YarnAllocator( @volatile private var numExecutorsRunning = 0 + @volatile private var numExecutorToBeLaunched = 0 --- End diff -- A better name would be `numExecutorsStarting`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128095881 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -505,32 +508,37 @@ private[yarn] class YarnAllocator( if (numExecutorsRunning < targetNumExecutors) { if (launchContainers) { - launcherPool.execute(new Runnable { -override def run(): Unit = { - try { -new ExecutorRunnable( - Some(container), - conf, - sparkConf, - driverUrl, - executorId, - executorHostname, - executorMemory, - executorCores, - appAttemptId.getApplicationId.toString, - securityMgr, - localResources -).run() -updateInternalState() - } catch { -case NonFatal(e) => - logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. - amClient.releaseAssignedContainer(containerId) + try { +numExecutorToBeLaunched += 1 +launcherPool.execute(new Runnable { + override def run(): Unit = { +try { + new ExecutorRunnable( +Some(container), +conf, +sparkConf, +driverUrl, +executorId, +executorHostname, +executorMemory, +executorCores, +appAttemptId.getApplicationId.toString, +securityMgr, +localResources + ).run() + updateInternalState() +} catch { + case NonFatal(e) => +logError(s"Failed to launch executor $executorId on container $containerId", e) +// Assigned container should be released immediately +// to avoid unnecessary resource occupation. +amClient.releaseAssignedContainer(containerId) +} } -} - }) +}) + } finally { +numExecutorToBeLaunched -= 1 --- End diff -- Same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128095843 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -505,32 +508,37 @@ private[yarn] class YarnAllocator( if (numExecutorsRunning < targetNumExecutors) { if (launchContainers) { - launcherPool.execute(new Runnable { -override def run(): Unit = { - try { -new ExecutorRunnable( - Some(container), - conf, - sparkConf, - driverUrl, - executorId, - executorHostname, - executorMemory, - executorCores, - appAttemptId.getApplicationId.toString, - securityMgr, - localResources -).run() -updateInternalState() - } catch { -case NonFatal(e) => - logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. - amClient.releaseAssignedContainer(containerId) + try { +numExecutorToBeLaunched += 1 --- End diff -- This is not thread safe. `+=` is two operations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/18651 [SPARK-21383][Core] Fix the YarnAllocator allocates more Resource ## What changes were proposed in this pull request? When NodeManagers launching Executors, the `missing` value will excel the real value when the launch is slow, this can lead to YARN allocates more resource. We add the `numExecutorToBeLaunched` when calculate the `missing` to avoid this. ## How was this patch tested? Test by experiment. You can merge this pull request into a Git repository by running: $ git pull https://github.com/djvulee/spark YarnAllocate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18651.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18651 commit 818c9126959e8576861478e18389e6ed8fdbeac4 Author: DjvuLeeDate: 2017-07-17T07:54:09Z [Core] Fix the YarnAllocator allocate more Resource When NodeManagers launched the Executors, the missing will excel the real value, this can lead to YARN allocate more resource. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org