[jira] [Commented] (YARN-4597) Add SCHEDULE to NM container lifecycle

Jian He (JIRA) Thu, 20 Oct 2016 16:13:31 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593320#comment-15593320
 ]


Jian He commented on YARN-4597:
-------------------------------

[~asuresh], some more questions and comments on the patch:

- why are these two transitions added?
{code}
    .addTransition(ContainerState.DONE, ContainerState.DONE,
        ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP)
    .addTransition(ContainerState.DONE, ContainerState.DONE,
        ContainerEventType.CONTAINER_LAUNCHED)
    .addTransition(ContainerState.DONE, ContainerState.DONE,
{code}
- the storeContainerKilled will be called in ContainerLaunch#cleanupContainer 
later, we don't need to call it here?
{code}
  public void sendKillEvent(int exitStatus, String description) {
    try {
      context.getNMStateStore().storeContainerKilled(containerId);
    } catch (IOException ioe) {
      LOG.error("Could not log container state change to state store..", ioe);
    }
{code}
- remove unused imports in ContainersMonitor.java
- remove ContainersMonitorImpl#allocatedCpuUsage unused method
- why do you need to add the additional check for SCHEDULED state ? 
{code}
      // Process running containers
      if (remoteContainer.getState() == ContainerState.RUNNING ||
          remoteContainer.getState() == ContainerState.SCHEDULED) {
          {code}
- why this test needs to be changed? 
{code}
        testGetContainerStatus(container, i,
            EnumSet.of(ContainerState.RUNNING, ContainerState.SCHEDULED), "",
{code}
- similarly here in TestNodeManagerShutdown, we still need to change the test 
to make sure the container reaches to running state?
{code}
    Assert.assertTrue(
        EnumSet.of(ContainerState.RUNNING, ContainerState.SCHEDULED)
            .contains(containerStatus.getState()));
{code}
- why do we need to change the test to run for 10 min ?
{code}
  @Test(timeout = 600000)
  public void testAMRMClient() throws YarnException, IOException {
{code}
- unreleated to this patch: should ResourceUtilization#pmem,vmem be changed to 
long type ? we had specifically changed it for Resource object
- we don't need to synchronize on the currentUtilization object? I don't see 
any other place it's synchronized 
{code}
synchronized (currentUtilization) {
{code}
-  In case we exceed the max-queue length, we are killing the container 
directly instead of queueing the container,  in this case, we should not store 
the container as queued? 
{code}
      try {
        this.context.getNMStateStore().storeContainerQueued(
            container.getContainerId());
      } catch (IOException e) {
        LOG.warn("Could not store container state into store..", e);
      }
{code}
- The ResourceUtilizationManager looks like only incorporated some utility 
methods, not sure how we will make this pluggable later..  
- I think there might be some behavior change or bug for scheduling guaranteed 
containers  when the oppotunistic-queue is enabled.
-- Previously, when launching container, NM will not check for current vmem 
usage, and cpu usage. It assumes what RM allocated can be launched.
-- Now, NM will check these limits and won't launch the container if hits the 
limit. 
-- Suppose the guarateed container hits the limit, it will be queued into 
queuedGuaranteedContainers. And this container will never be launched until one 
other container finishes which triggers the code path, even if it's not hitting 
these limits any more. This is a problem especially when other containers are 
long running container and never finishes.
- The logic to select opportunisitic container: we may kill more opportunistic 
containers than required. e.g.
-- one guarateed container comes, we select one opportunistic container 
-- before the selected opportunistic container is killed, another guarateed 
comes, then we will select two opportunistic containers to kill
-- The process repeats, we may end up killing more opportunistic containers 
than required. 

> Add SCHEDULE to NM container lifecycle
> --------------------------------------
>
>                 Key: YARN-4597
>                 URL: https://issues.apache.org/jira/browse/YARN-4597
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Chris Douglas
>            Assignee: Arun Suresh
>         Attachments: YARN-4597.001.patch, YARN-4597.002.patch, 
> YARN-4597.003.patch
>
>
> Currently, the NM immediately launches containers after resource 
> localization. Several features could be more cleanly implemented if the NM 
> included a separate stage for reserving resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4597) Add SCHEDULE to NM container lifecycle

Reply via email to