[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990233#comment-13990233
 ] 

Wangda Tan commented on YARN-1368:
----------------------------------

Hi [~jianhe], thanks for this patch, I'm agree with major strategies. But I've 
some comments and questions,

In AbstractYarnScheduler:recoverContainersOnNode
{code}
+      if (rmApp.getApplicationSubmissionContext().getUnmanagedAM()) {
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("Skip recovering container " + status
+              + " for unmanaged AM." + rmApp.getApplicationId());
+        }
+        continue;
+      }
{code}
Why we don't recover container in unmanaged AM case? In my understand, no 
matter it's managed or unmanaged AM, the recover process should be same. Is 
there any difference between them?

Should this be included in schedulerAttempt.recoverContainer(...)?
{code}
+      // recover app scheduling info
+      schedulerAttempt.appSchedulingInfo.recoverContainer(rmContainer);
{code}

In AppSchedulingInfo.recoverContainer(...)
{code}
+    QueueMetrics metrics = queue.getMetrics();
+    if (pending) {
+      // If there was any running containers, the application was
+      // running from scheduler's POV.
+      pending = false;
+      metrics.runAppAttempt(applicationId, user);
+    }
+    if (rmContainer.getState().equals(RMContainerState.COMPLETED)) {
+      return;
+    }
+    metrics.allocateResources(user, 1, Resource.newInstance(1024, 1), false);
{code}
Should this be a part of queue.recoverContainer(...)? Is it better to create 
QueueMetrics.recoverContainer(...)?

In CapacityScheduler,
{code}
-    Collection<FiCaSchedulerNode> nodes = cs.getAllNodes().values();
+    Collection<SchedulerNode> nodes = cs.getAllNodes().values();
{code}
Could you elaborate why do this and a series of change between SchedulerNode 
and FiCaSchedulerNode? Not really understand.

For recoverContainer in queue, should we do top-down (recover from root queue) 
or bottom-up (recover from leaf queue). I found in the patch it's bottom-up, 
should this be decided by scheduler implementation?

> Common work to re-populate containers’ state into scheduler
> -----------------------------------------------------------
>
>                 Key: YARN-1368
>                 URL: https://issues.apache.org/jira/browse/YARN-1368
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Anubhav Dhoot
>         Attachments: YARN-1368.1.patch, YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running 
> containers upon registration. The RM needs to send this information to the 
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
> the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to