[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

Jim Brennan (Jira) Fri, 19 Jun 2020 14:26:18 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140841#comment-17140841
 ]


Jim Brennan commented on YARN-9809:
-----------------------------------

Thanks for the patch [~ebadger]!  Overall I think the design and code look good.
Here are some comments:

MockNM
- line 196 - Isn't this loop that removes completedContainers a no-op?
{noformat}
    ArrayList<ContainerId> completedContainers = new ArrayList<ContainerId>();
    status.setContainersStatuses(
        new ArrayList<ContainerStatus>(containerStats.values()));
    for (ContainerId cid : completedContainers) {
      containerStats.remove(cid);
    }
{noformat}

MockRM
- This code is repeated in a lot of tests. Maybe we could add a function 
somewhere that does this so we can just pass getMockNodeStatus() instead?
TestAbstractYarnScheduler, TestCapacityScheduler, testFairScheduler, 
TestFifoScheduler, TestNMExpiry, TestNMReconnect, TestResourceManager, 
TestRMAppLogAggregationStatus, TestRMNodeTransitions, TestRMWebServicesNodes, 
TestSchedulerHealth, 
{noformat}
    NodeStatus mockNodeStatus = mock(NodeStatus.class);
    NodeHealthStatus mockNodeHealthStatus = mock(NodeHealthStatus.class);
    when(mockNodeStatus.getNodeHealthStatus()).thenReturn(mockNodeHealthStatus);
    when(mockNodeHealthStatus.getIsNodeHealthy()).thenReturn(true);
{noformat}

RMAppManager
- This looks like an accidental edit:
{noformat}
    // Escape YarnServerCommonServiceProtossequences
{noformat}

RMNodeImpl
- line 894 Don't we have to deal with the possibility that nodeStatus is null 
here?  Seems like that is a possibilty.  I think null nodeStatus should be 
treated as healthy.  The RegisterNodeManagerRequest constructors that pass
null is what made me think this is necessary?
{noformat}
      NodeStatus nodeStatus =
          startEvent.getNodeStatus();
      RMNodeStatusEvent rmNodeStatusEvent =
          new RMNodeStatusEvent(nodeId, nodeStatus);

      NodeHealthStatus nodeHealthStatus =
          updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent);

      NodeState nodeState = null;
      if (nodeHealthStatus.getIsNodeHealthy()) {
{noformat}
- In the case where the node is unhealthy, can we just call 
reportNodeUnusable() 
instead of
{noformat}
        rmNode.context.getDispatcher().getEventHandler().handle(
            new NodesListManagerEvent(
                NodesListManagerEventType.NODE_UNUSABLE, rmNode));
        //Update the metrics
        rmNode.updateMetricsForDeactivatedNode(NodeState.RUNNING,
            NodeState.UNHEALTHY);
{noformat}

TestRMNodeTransitions
- Maybe add a testAddUnhealthy here?

TimedHealthReporterService
- Do we need to be concerned about someone who might have their own 
implementation of TimedHealthReporterService?  Should we maintain a constructor 
that takes two args and passes null for runBeforeStartup?


> NMs should supply a health status when registering with RM
> ----------------------------------------------------------
>
>                 Key: YARN-9809
>                 URL: https://issues.apache.org/jira/browse/YARN-9809
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
>         Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

Reply via email to