[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140841#comment-17140841 ]
Jim Brennan commented on YARN-9809: ----------------------------------- Thanks for the patch [~ebadger]! Overall I think the design and code look good. Here are some comments: MockNM - line 196 - Isn't this loop that removes completedContainers a no-op? {noformat} ArrayList<ContainerId> completedContainers = new ArrayList<ContainerId>(); status.setContainersStatuses( new ArrayList<ContainerStatus>(containerStats.values())); for (ContainerId cid : completedContainers) { containerStats.remove(cid); } {noformat} MockRM - This code is repeated in a lot of tests. Maybe we could add a function somewhere that does this so we can just pass getMockNodeStatus() instead? TestAbstractYarnScheduler, TestCapacityScheduler, testFairScheduler, TestFifoScheduler, TestNMExpiry, TestNMReconnect, TestResourceManager, TestRMAppLogAggregationStatus, TestRMNodeTransitions, TestRMWebServicesNodes, TestSchedulerHealth, {noformat} NodeStatus mockNodeStatus = mock(NodeStatus.class); NodeHealthStatus mockNodeHealthStatus = mock(NodeHealthStatus.class); when(mockNodeStatus.getNodeHealthStatus()).thenReturn(mockNodeHealthStatus); when(mockNodeHealthStatus.getIsNodeHealthy()).thenReturn(true); {noformat} RMAppManager - This looks like an accidental edit: {noformat} // Escape YarnServerCommonServiceProtossequences {noformat} RMNodeImpl - line 894 Don't we have to deal with the possibility that nodeStatus is null here? Seems like that is a possibilty. I think null nodeStatus should be treated as healthy. The RegisterNodeManagerRequest constructors that pass null is what made me think this is necessary? {noformat} NodeStatus nodeStatus = startEvent.getNodeStatus(); RMNodeStatusEvent rmNodeStatusEvent = new RMNodeStatusEvent(nodeId, nodeStatus); NodeHealthStatus nodeHealthStatus = updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent); NodeState nodeState = null; if (nodeHealthStatus.getIsNodeHealthy()) { {noformat} - In the case where the node is unhealthy, can we just call reportNodeUnusable() instead of {noformat} rmNode.context.getDispatcher().getEventHandler().handle( new NodesListManagerEvent( NodesListManagerEventType.NODE_UNUSABLE, rmNode)); //Update the metrics rmNode.updateMetricsForDeactivatedNode(NodeState.RUNNING, NodeState.UNHEALTHY); {noformat} TestRMNodeTransitions - Maybe add a testAddUnhealthy here? TimedHealthReporterService - Do we need to be concerned about someone who might have their own implementation of TimedHealthReporterService? Should we maintain a constructor that takes two args and passes null for runBeforeStartup? > NMs should supply a health status when registering with RM > ---------------------------------------------------------- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Eric Badger > Assignee: Eric Badger > Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org