[ https://issues.apache.org/jira/browse/YARN-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593320#comment-15593320 ]
Jian He commented on YARN-4597: ------------------------------- [~asuresh], some more questions and comments on the patch: - why are these two transitions added? {code} .addTransition(ContainerState.DONE, ContainerState.DONE, ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP) .addTransition(ContainerState.DONE, ContainerState.DONE, ContainerEventType.CONTAINER_LAUNCHED) .addTransition(ContainerState.DONE, ContainerState.DONE, {code} - the storeContainerKilled will be called in ContainerLaunch#cleanupContainer later, we don't need to call it here? {code} public void sendKillEvent(int exitStatus, String description) { try { context.getNMStateStore().storeContainerKilled(containerId); } catch (IOException ioe) { LOG.error("Could not log container state change to state store..", ioe); } {code} - remove unused imports in ContainersMonitor.java - remove ContainersMonitorImpl#allocatedCpuUsage unused method - why do you need to add the additional check for SCHEDULED state ? {code} // Process running containers if (remoteContainer.getState() == ContainerState.RUNNING || remoteContainer.getState() == ContainerState.SCHEDULED) { {code} - why this test needs to be changed? {code} testGetContainerStatus(container, i, EnumSet.of(ContainerState.RUNNING, ContainerState.SCHEDULED), "", {code} - similarly here in TestNodeManagerShutdown, we still need to change the test to make sure the container reaches to running state? {code} Assert.assertTrue( EnumSet.of(ContainerState.RUNNING, ContainerState.SCHEDULED) .contains(containerStatus.getState())); {code} - why do we need to change the test to run for 10 min ? {code} @Test(timeout = 600000) public void testAMRMClient() throws YarnException, IOException { {code} - unreleated to this patch: should ResourceUtilization#pmem,vmem be changed to long type ? we had specifically changed it for Resource object - we don't need to synchronize on the currentUtilization object? I don't see any other place it's synchronized {code} synchronized (currentUtilization) { {code} - In case we exceed the max-queue length, we are killing the container directly instead of queueing the container, in this case, we should not store the container as queued? {code} try { this.context.getNMStateStore().storeContainerQueued( container.getContainerId()); } catch (IOException e) { LOG.warn("Could not store container state into store..", e); } {code} - The ResourceUtilizationManager looks like only incorporated some utility methods, not sure how we will make this pluggable later.. - I think there might be some behavior change or bug for scheduling guaranteed containers when the oppotunistic-queue is enabled. -- Previously, when launching container, NM will not check for current vmem usage, and cpu usage. It assumes what RM allocated can be launched. -- Now, NM will check these limits and won't launch the container if hits the limit. -- Suppose the guarateed container hits the limit, it will be queued into queuedGuaranteedContainers. And this container will never be launched until one other container finishes which triggers the code path, even if it's not hitting these limits any more. This is a problem especially when other containers are long running container and never finishes. - The logic to select opportunisitic container: we may kill more opportunistic containers than required. e.g. -- one guarateed container comes, we select one opportunistic container -- before the selected opportunistic container is killed, another guarateed comes, then we will select two opportunistic containers to kill -- The process repeats, we may end up killing more opportunistic containers than required. > Add SCHEDULE to NM container lifecycle > -------------------------------------- > > Key: YARN-4597 > URL: https://issues.apache.org/jira/browse/YARN-4597 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Chris Douglas > Assignee: Arun Suresh > Attachments: YARN-4597.001.patch, YARN-4597.002.patch, > YARN-4597.003.patch > > > Currently, the NM immediately launches containers after resource > localization. Several features could be more cleanly implemented if the NM > included a separate stage for reserving resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org