[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997386#comment-13997386
 ] 

Wangda Tan commented on YARN-1368:
----------------------------------

Hi Jian,
Thanks for updating your patch, I took a look at it, some comments,
1) RMAppImpl.java
{code}
+      // Add the current attempt to the scheduler.It'll be removed from
+      // scheduler in RMAppAttempt#BaseFinalTransition
+      app.handler.handle(new AppAttemptAddedSchedulerEvent(app.currentAttempt
+        .getAppAttemptId(), false));
{code}
Not quite understand about this, in current trunk code, how RMAppAttempt notify 
scheduler about "AppAttemptAddedSchedulerEvent" when recovering? If this is a 
missing part in current trunk, I tend to put sending 
"AppAttemptAddedSchedulerEvent" code into RMAppAttemptImpl.

2) RMAppAttemptImpl.java
{code}
+   new ContainerFinishedTransition(
+      new AMContainerCrashedAtRunningTransition(),
+      RMAppAttemptState.RUNNING))
{code}
And
{code}
+   new ContainerFinishedTransition(
+      new AMContainerCrashedBeforeRunningTransition(),
+      RMAppAttemptState.LAUNCHED))
{code}
I found the "RUNNING" and "LAUNCHED" state are passed in as targetedFinalState, 
and the targetFinalState will only used in FinalStateSavedTransition, I got 
confused, could you please elaborate on this? Why use split 
AMContainerCrashedTransition to two transitions and set their states to 
RUNNING/LAUNCHED differently.

3) In AbstractYarnScheduler.java
3.1 
{code}
+      if (rmApp == null) {
+        LOG.error("Skip recovering container " + status
+            + " for unknown application.");
+        continue;
+      }
{code}
And
{code}
+      if (rmApp.getApplicationSubmissionContext().getUnmanagedAM()) {
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("Skip recovering container " + status
+              + " for unmanaged AM." + rmApp.getApplicationId());
+        }
+        continue;
+      }
{code}
And
{code}
+      SchedulerApplication schedulerApp = applications.get(appId);
+      if (schedulerApp == null) {
+        LOG.info("Skip recovering container  " + status
+            + " for unknown SchedulerApplication. Application state is "
+            + rmApp.getState());
+        continue;
+      }
{code}
It's better to make log level more consistency.

3.2
{code}
+  public RMContainer createContainer(ContainerStatus status, RMNode node) {
+    Container container =
+        Container.newInstance(status.getContainerId(), node.getNodeID(),
+          node.getHttpAddress(), Resource.newInstance(1024, 1),
+          Priority.newInstance(0), null);
{code}
Should we change Resource(1024, 1) to its actually resource?

3.3 For recoverContainersOnNode, is it possible NODE_ADDED happened before 
APP_ADDED? I ask this before container may be recovered before its application 
added to scheduler if yes.

4) In FiCaSchedulerNode.java:
{code}
+  @Override
+  public void recoverContainer(RMContainer rmContainer) {
+    if (rmContainer.getState().equals(RMContainerState.COMPLETED)) {
+      return;
+    }
+    allocateContainer(null, rmContainer);
+  }
{code}
Since the allocateContainer doesn't use application-id parameter, I think it's 
better to remove it.

5) In TestWorkPreservingRMRestart.java
{code}
+    assertEquals(usedCapacity, queue.getAbsoluteUsedCapacity(), 0);
{code}
It may better to use two parameter assertEquals, because delta is 0

> Common work to re-populate containers’ state into scheduler
> -----------------------------------------------------------
>
>                 Key: YARN-1368
>                 URL: https://issues.apache.org/jira/browse/YARN-1368
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Jian He
>         Attachments: YARN-1368.1.patch, YARN-1368.2.patch, 
> YARN-1368.combined.001.patch, YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running 
> containers upon registration. The RM needs to send this information to the 
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
> the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to