[
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140841#comment-17140841
]
Jim Brennan commented on YARN-9809:
-----------------------------------
Thanks for the patch [~ebadger]! Overall I think the design and code look good.
Here are some comments:
MockNM
- line 196 - Isn't this loop that removes completedContainers a no-op?
{noformat}
ArrayList<ContainerId> completedContainers = new ArrayList<ContainerId>();
status.setContainersStatuses(
new ArrayList<ContainerStatus>(containerStats.values()));
for (ContainerId cid : completedContainers) {
containerStats.remove(cid);
}
{noformat}
MockRM
- This code is repeated in a lot of tests. Maybe we could add a function
somewhere that does this so we can just pass getMockNodeStatus() instead?
TestAbstractYarnScheduler, TestCapacityScheduler, testFairScheduler,
TestFifoScheduler, TestNMExpiry, TestNMReconnect, TestResourceManager,
TestRMAppLogAggregationStatus, TestRMNodeTransitions, TestRMWebServicesNodes,
TestSchedulerHealth,
{noformat}
NodeStatus mockNodeStatus = mock(NodeStatus.class);
NodeHealthStatus mockNodeHealthStatus = mock(NodeHealthStatus.class);
when(mockNodeStatus.getNodeHealthStatus()).thenReturn(mockNodeHealthStatus);
when(mockNodeHealthStatus.getIsNodeHealthy()).thenReturn(true);
{noformat}
RMAppManager
- This looks like an accidental edit:
{noformat}
// Escape YarnServerCommonServiceProtossequences
{noformat}
RMNodeImpl
- line 894 Don't we have to deal with the possibility that nodeStatus is null
here? Seems like that is a possibilty. I think null nodeStatus should be
treated as healthy. The RegisterNodeManagerRequest constructors that pass
null is what made me think this is necessary?
{noformat}
NodeStatus nodeStatus =
startEvent.getNodeStatus();
RMNodeStatusEvent rmNodeStatusEvent =
new RMNodeStatusEvent(nodeId, nodeStatus);
NodeHealthStatus nodeHealthStatus =
updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent);
NodeState nodeState = null;
if (nodeHealthStatus.getIsNodeHealthy()) {
{noformat}
- In the case where the node is unhealthy, can we just call
reportNodeUnusable()
instead of
{noformat}
rmNode.context.getDispatcher().getEventHandler().handle(
new NodesListManagerEvent(
NodesListManagerEventType.NODE_UNUSABLE, rmNode));
//Update the metrics
rmNode.updateMetricsForDeactivatedNode(NodeState.RUNNING,
NodeState.UNHEALTHY);
{noformat}
TestRMNodeTransitions
- Maybe add a testAddUnhealthy here?
TimedHealthReporterService
- Do we need to be concerned about someone who might have their own
implementation of TimedHealthReporterService? Should we maintain a constructor
that takes two args and passes null for runBeforeStartup?
> NMs should supply a health status when registering with RM
> ----------------------------------------------------------
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Eric Badger
> Assignee: Eric Badger
> Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch,
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be
> scheduled many containers before the first heartbeat. After the first
> heartbeat, the RM will mark the NM as unhealthy and kill all of the
> containers.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]