[jira] [Commented] (YARN-3194) After NM restart,completed containers are not released by RM which are sent during NM registration

2015-02-18 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325610#comment-14325610
 ] 

Rohith commented on YARN-3194:
--

Thanks [~jlowe] [~djp] [~jianhe] for detailed review:-)

bq. the container status processing code is almost a duplicate of the same code 
in StatusUpdateWhenHealthyTransition
Agree, this has to be refactored. Majority of processing containerStatus code 
is same.

bq. we don't remove containers that have completed from the launchedContainers 
map which seems wrong
I see, yes. completed containers should be removed from launchedContainers.

bq. I don't see why we would process container status sent during a reconnect 
differently than a regular status update from the NM
IIUC it is only to deal with NMContainerStatus and containerStatus. But I am 
not sure why these both created differently. What I see is containerStatus is 
subset of NMcontainerStatus. I think containerStatus would have been inside 
NMContainerStatus. 

bq. Is below condition valid for the newly added code in 
ReconnectNodeTransition too ? 
Yes, it is applicable since we are keeping old RMNode object.

bq. Add timeout to the test, testAppCleanupWhenNMRstarts - 
testProcessingContainerStatusesOnNMRestart ? and add more detailed comments 
about what the test is doing too ? 
Agree. 

bq. Could you add a validation that ApplicationMasterService#allocate indeed 
receives the completed container in this scenario?
Agree, I will add

bq. Question: does the 3072 include 1024 for the AM container and 2048 for the 
allocated container ? 
AM memory is 1024 and additional requested container memory is 2048. In test, 
number of request container is 1. So AllocatedMB should be AM+Requested i.e 
1024+2048=3072

 After NM restart,completed containers are not released by RM which are sent 
 during NM registration
 --

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
 process only ContainerState.RUNNING. If container is completed when NM was 
 down then those containers resources wont be release which result in 
 applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3194) After NM restart,completed containers are not released by RM which are sent during NM registration

2015-02-17 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325212#comment-14325212
 ] 

Junping Du commented on YARN-3194:
--

bq. I didn't see this problem originally, but I suspect it was because there 
were two things that masked it. As mentioned above, this problem doesn't 
manifest before YARN-2997. In addition, I was testing it with MapReduce 
applications, and the MR AM will explicitly kill containers for tasks that have 
completed (as reported by the umbilical connection between the AM and tasks).
I see. I think that's why we didn't notice this issue before. However, this bug 
should happen after YARN-2997, so we should mark affected version to be 2.7.

bq. I don't see why we would process container status sent during a reconnect 
differently than a regular status update from the NM.
I think we can do some code refactor work here. However, I think two things 
could be different between reconnect and regular resource update: 1. Port 
number could be changed (use ephemeral port when disable NM work preserving); 
2. Resource could be updated (assume NM's resource could be updated before). 
Isn't it?

 After NM restart,completed containers are not released by RM which are sent 
 during NM registration
 --

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
 Attachments: 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
 process only ContainerState.RUNNING. If container is completed when NM was 
 down then those containers resources wont be release which result in 
 applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3194) After NM restart,completed containers are not released by RM which are sent during NM registration

2015-02-17 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325264#comment-14325264
 ] 

Junping Du commented on YARN-3194:
--

Should be a blocker to 2.7 as it blocks rolling upgrade feature which works in 
2.6.

 After NM restart,completed containers are not released by RM which are sent 
 during NM registration
 --

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
 process only ContainerState.RUNNING. If container is completed when NM was 
 down then those containers resources wont be release which result in 
 applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3194) After NM restart,completed containers are not released by RM which are sent during NM registration

2015-02-17 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324955#comment-14324955
 ] 

Jason Lowe commented on YARN-3194:
--

bq. Jason Lowe, I remember we discussed this case in some JIRA under YARN-1336, 
did you see this problem before?

I didn't see this problem originally, but I suspect it was because there were 
two things that masked it.  As mentioned above, this problem doesn't manifest 
before YARN-2997.  In addition, I was testing it with MapReduce applications, 
and the MR AM will explicitly kill containers for tasks that have completed (as 
reported by the umbilical connection between the AM and tasks).

I agree that we should be processing the container report sent with the NM 
registration, and it appears that is being dropped in the reconnected event.

Comments on the patch:

I noticed that the container status processing code is _almost_ a duplicate of 
the same code in StatusUpdateWhenHealthyTransition.  One difference is that we 
don't remove containers that have completed from the launchedContainers map 
which seems wrong.  I don't see why we would process container status sent 
during a reconnect differently than a regular status update from the NM.  
Therefore I think we should refactor the code to reuse this logic, as it should 
apply here just as it does for StatusUpdateWhenHealthyTransition.

 After NM restart,completed containers are not released by RM which are sent 
 during NM registration
 --

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
 Attachments: 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
 process only ContainerState.RUNNING. If container is completed when NM was 
 down then those containers resources wont be release which result in 
 applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3194) After NM restart,completed containers are not released by RM which are sent during NM registration

2015-02-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325010#comment-14325010
 ] 

Jian He commented on YARN-3194:
---

[~rohithsharma], thanks for your explanation. could you edit the description to 
be more clear about the problem ?

- Is it possible to have a common method for below code in 
ReconnectNodeTransition and StatusUpdateWhenHealthyTransition ?
{code}
// Filter the map to only obtain just launched containers and finished
// containers.
ListContainerStatus newlyLaunchedContainers =
new ArrayListContainerStatus();
ListContainerStatus completedContainers =
new ArrayListContainerStatus();
for (NMContainerStatus remoteContainer : reconnectEvent
.getNMContainerStatuses()) {
  ContainerId containerId = remoteContainer.getContainerId();

  // Process running containers
  if (remoteContainer.getContainerState() == ContainerState.RUNNING) {
if (!rmNode.launchedContainers.contains(containerId)) {
  // Just launched container. RM knows about it the first time.
  rmNode.launchedContainers.add(containerId);
  ContainerStatus cStatus = createContainerStatus(remoteContainer);
  newlyLaunchedContainers.add(cStatus);
}
  } else {

ContainerStatus cStatus = createContainerStatus(remoteContainer);
completedContainers.add(cStatus);
  }
}
if (newlyLaunchedContainers.size() != 0
|| completedContainers.size() != 0) {
  rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo(
  newlyLaunchedContainers, completedContainers));
}
{code}
- Is below condition valid for the newly added code in ReconnectNodeTransition 
too ?
{code}
// Don't bother with containers already scheduled for cleanup, or for
// applications already killed. The scheduler doens't need to know any
// more about this container
if (rmNode.containersToClean.contains(containerId)) {
  LOG.info(Container  + containerId +  already scheduled for  +
cleanup, no further processing);
  continue;
}
if (rmNode.finishedApplications.contains(containerId
.getApplicationAttemptId().getApplicationId())) {
  LOG.info(Container  + containerId
  +  belongs to an application that is already killed,
  +  no further processing);
  continue;
}
{code}
- Add timeout to the test, testAppCleanupWhenNMRstarts - 
testProcessingContainerStatusesOnNMRestart ? and add more detailed comments 
about what the test is doing too ?
{code}
@Test
  public void testAppCleanupWhenNMRstarts() throws Exception
{code}
- Question: does the 3072 include 1024 for the AM container and 2048 for the 
allocated container ?
{code}
 Assert.assertEquals(3072, allocatedMB);
{code}
- Could you add a validation that ApplicationMasterService#allocate indeed 
receives the completed container in this scenario?

 After NM restart,completed containers are not released by RM which are sent 
 during NM registration
 --

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
 Attachments: 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
 process only ContainerState.RUNNING. If container is completed when NM was 
 down then those containers resources wont be release which result in 
 applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3194) After NM restart,completed containers are not released by RM which are sent during NM registration

2015-02-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325017#comment-14325017
 ] 

Jian He commented on YARN-3194:
---

Didn't see Jason's comments, agree with his comments too.

 After NM restart,completed containers are not released by RM which are sent 
 during NM registration
 --

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
 Attachments: 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
 process only ContainerState.RUNNING. If container is completed when NM was 
 down then those containers resources wont be release which result in 
 applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3194) After NM restart,completed containers are not released by RM which are sent during NM registration

2015-02-17 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323989#comment-14323989
 ] 

Rohith commented on YARN-3194:
--

[~djp] I see the same behaviour which you explained after NM restart from NM 
logs.

 After NM restart,completed containers are not released by RM which are sent 
 during NM registration
 --

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
 Attachments: 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
 process only ContainerState.RUNNING. If container is completed when NM was 
 down then those containers resources wont be release which result in 
 applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)