[
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325010#comment-14325010
]
Jian He commented on YARN-3194:
-------------------------------
[~rohithsharma], thanks for your explanation. could you edit the description to
be more clear about the problem ?
- Is it possible to have a common method for below code in
ReconnectNodeTransition and StatusUpdateWhenHealthyTransition ?
{code}
// Filter the map to only obtain just launched containers and finished
// containers.
List<ContainerStatus> newlyLaunchedContainers =
new ArrayList<ContainerStatus>();
List<ContainerStatus> completedContainers =
new ArrayList<ContainerStatus>();
for (NMContainerStatus remoteContainer : reconnectEvent
.getNMContainerStatuses()) {
ContainerId containerId = remoteContainer.getContainerId();
// Process running containers
if (remoteContainer.getContainerState() == ContainerState.RUNNING) {
if (!rmNode.launchedContainers.contains(containerId)) {
// Just launched container. RM knows about it the first time.
rmNode.launchedContainers.add(containerId);
ContainerStatus cStatus = createContainerStatus(remoteContainer);
newlyLaunchedContainers.add(cStatus);
}
} else {
ContainerStatus cStatus = createContainerStatus(remoteContainer);
completedContainers.add(cStatus);
}
}
if (newlyLaunchedContainers.size() != 0
|| completedContainers.size() != 0) {
rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo(
newlyLaunchedContainers, completedContainers));
}
{code}
- Is below condition valid for the newly added code in ReconnectNodeTransition
too ?
{code}
// Don't bother with containers already scheduled for cleanup, or for
// applications already killed. The scheduler doens't need to know any
// more about this container
if (rmNode.containersToClean.contains(containerId)) {
LOG.info("Container " + containerId + " already scheduled for " +
"cleanup, no further processing");
continue;
}
if (rmNode.finishedApplications.contains(containerId
.getApplicationAttemptId().getApplicationId())) {
LOG.info("Container " + containerId
+ " belongs to an application that is already killed,"
+ " no further processing");
continue;
}
{code}
- Add timeout to the test, testAppCleanupWhenNMRstarts ->
testProcessingContainerStatusesOnNMRestart ? and add more detailed comments
about what the test is doing too ?
{code}
@Test
public void testAppCleanupWhenNMRstarts() throws Exception
{code}
- Question: does the 3072 include 1024 for the AM container and 2048 for the
allocated container ?
{code}
Assert.assertEquals(3072, allocatedMB);
{code}
- Could you add a validation that ApplicationMasterService#allocate indeed
receives the completed container in this scenario?
> After NM restart,completed containers are not released by RM which are sent
> during NM registration
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-3194
> URL: https://issues.apache.org/jira/browse/YARN-3194
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.6.0
> Environment: NM restart is enabled
> Reporter: Rohith
> Assignee: Rohith
> Attachments: 0001-yarn-3194-v1.patch
>
>
> On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM
> process only ContainerState.RUNNING. If container is completed when NM was
> down then those containers resources wont be release which result in
> applications to hang.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)