[
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217597#comment-16217597
]
Jason Lowe commented on YARN-7102:
----------------------------------
I'm wondering about the branch-2 patch test failures. The SLS failures are
either bind exceptions or cast exceptions for fair scheduler metrics which I
believe are unrelated. However twice now Jenkins has complained about
TestResourceTrackerService which is very much a related test to the code being
changed. If I run this shell script I can eventually get it to fail as well:
{noformat}
$ cd
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
$ while :;do mvn surefire:test
-Dtest=TestResourceTrackerService#testReconnectNode || break;done
{noformat}
It looks like the TestResourceTrackerService#testReconnectNode failure is
related to this test output:
{noformat}
2017-10-21 09:04:23,368 INFO [main] resourcemanager.ResourceTrackerService
(ResourceTrackerService.java:registerNodeManager(403)) - Reconnect from the
node at: host1
2017-10-21 09:04:23,368 INFO [main] resourcemanager.ResourceTrackerService
(ResourceTrackerService.java:registerNodeManager(458)) - NodeManager from node
host1(cmPort: 1234 httpPort: 3) registered with capability: <memory:5120,
vCores:5>, assigned nodeId host1:1234
2017-10-21 09:04:23,368 DEBUG [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(192)) - Dispatching the
event
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeReconnectEvent.EventType:
RECONNECTED
2017-10-21 09:04:23,368 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl
(RMNodeImpl.java:handle(659)) - Processing host1:1234 of type RECONNECTED
2017-10-21 09:04:23,368 INFO [main] resourcemanager.ResourceTrackerService
(ResourceTrackerService.java:nodeHeartbeat(522)) - Too far behind rm response
id:0 nm response id:1
2017-10-21 09:04:23,368 DEBUG [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(192)) - Dispatching the
event
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStatusEvent.EventType:
STATUS_UPDATE
2017-10-21 09:04:23,368 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl
(RMNodeImpl.java:handle(659)) - Processing host1:1234 of type STATUS_UPDATE
2017-10-21 09:04:23,368 ERROR [AsyncDispatcher event handler] rmnode.RMNodeImpl
(RMNodeImpl.java:handle(666)) - Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event:
STATUS_UPDATE at NEW
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:664)
at
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:98)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1010)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:994)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
at
org.apache.hadoop.yarn.event.DrainDispatcher$1.run(DrainDispatcher.java:83)
at java.lang.Thread.run(Thread.java:748)
2017-10-21 09:04:23,369 ERROR [AsyncDispatcher event handler] rmnode.RMNodeImpl
(RMNodeImpl.java:handle(667)) - Invalid event STATUS_UPDATE on Node host1:1234
oldState NEW
{noformat}
I believe the test issue is related to the code change. I think something
along these lines is occurring, but I haven't had time to delve into it further.
# NM re-registers with a health status change and gets a response, so NM thinks
next response is 0. In the meantime, RM has asynchronously sent a reconnect
event to the RM.
# NM heartbeats with 0 and that matches the RMNodeImpl, so RM responds with
next response ID = 1
# RMNodeImpl finally gets the reconnected node event, and because it's a status
change it swaps out the old node for the new RMNodeImpl, resetting the last
response ID to 0.
# NM heartbeats again with response ID 1. 1 != 0 so the RM thinks the NM is
"too old," when in fact it is _ahead_ of the RM. This causes the RM to send a
REBOOT command and remove the RMNodeImpl.
# Test code blindly looks up the RMNodeImpl by hostname and finds nothing,
since the RMNodeImpl reboot logic removed it, hence the NPE on line 1286 of the
test.
It looks like the response ID may be mishandled during the reconnect logic in
cases where we swap out one node for another and the NM heartbeat races ahead
of the reconnect processing. Before this code change the heartbeat wouldn't be
rejected because it's ahead of the RMNodeImpl, but now with this change those
heartbeats are rejected just as if the NM was behind the RM.
> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
> Key: YARN-7102
> URL: https://issues.apache.org/jira/browse/YARN-7102
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Botong Huang
> Assignee: Botong Huang
> Priority: Critical
> Attachments: YARN-7102-branch-2.8.v10.patch,
> YARN-7102-branch-2.8.v11.patch, YARN-7102-branch-2.8.v9.patch,
> YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch,
> YARN-7102-branch-2.v9.patch, YARN-7102.v1.patch, YARN-7102.v2.patch,
> YARN-7102.v3.patch, YARN-7102.v4.patch, YARN-7102.v5.patch,
> YARN-7102.v6.patch, YARN-7102.v7.patch, YARN-7102.v8.patch, YARN-7102.v9.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM
> heartbeat in YARN-6640, please refer to YARN-6640 for details.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]