[jira] [Commented] (YARN-7102) NM heartbeat stuck when responseId overflows MAX_INT

Jason Lowe (JIRA) Tue, 24 Oct 2017 13:10:29 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217597#comment-16217597
 ]


Jason Lowe commented on YARN-7102:
----------------------------------

I'm wondering about the branch-2 patch test failures.  The SLS failures are 
either bind exceptions or cast exceptions for fair scheduler metrics which I 
believe are unrelated.  However twice now Jenkins has complained about 
TestResourceTrackerService which is very much a related test to the code being 
changed.  If I run this shell script I can eventually get it to fail as well:
{noformat}
$ cd 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
$ while :;do mvn surefire:test 
-Dtest=TestResourceTrackerService#testReconnectNode || break;done
{noformat}

It looks like the TestResourceTrackerService#testReconnectNode failure is 
related to this test output:
{noformat}
2017-10-21 09:04:23,368 INFO  [main] resourcemanager.ResourceTrackerService 
(ResourceTrackerService.java:registerNodeManager(403)) - Reconnect from the 
node at: host1
2017-10-21 09:04:23,368 INFO  [main] resourcemanager.ResourceTrackerService 
(ResourceTrackerService.java:registerNodeManager(458)) - NodeManager from node 
host1(cmPort: 1234 httpPort: 3) registered with capability: <memory:5120, 
vCores:5>, assigned nodeId host1:1234
2017-10-21 09:04:23,368 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(192)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeReconnectEvent.EventType:
 RECONNECTED
2017-10-21 09:04:23,368 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(659)) - Processing host1:1234 of type RECONNECTED
2017-10-21 09:04:23,368 INFO  [main] resourcemanager.ResourceTrackerService 
(ResourceTrackerService.java:nodeHeartbeat(522)) - Too far behind rm response 
id:0 nm response id:1
2017-10-21 09:04:23,368 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(192)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStatusEvent.EventType:
 STATUS_UPDATE
2017-10-21 09:04:23,368 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(659)) - Processing host1:1234 of type STATUS_UPDATE
2017-10-21 09:04:23,368 ERROR [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(666)) - Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
STATUS_UPDATE at NEW
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:664)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:98)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1010)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:994)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
        at 
org.apache.hadoop.yarn.event.DrainDispatcher$1.run(DrainDispatcher.java:83)
        at java.lang.Thread.run(Thread.java:748)
2017-10-21 09:04:23,369 ERROR [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(667)) - Invalid event STATUS_UPDATE on Node  host1:1234 
oldState NEW
{noformat}

I believe the test issue is related to the code change.  I think something 
along these lines is occurring, but I haven't had time to delve into it further.
# NM re-registers with a health status change and gets a response, so NM thinks 
next response is 0.  In the meantime, RM has asynchronously sent a reconnect 
event to the RM.
# NM heartbeats with 0 and that matches the RMNodeImpl, so RM responds with 
next response ID = 1
# RMNodeImpl finally gets the reconnected node event, and because it's a status 
change it swaps out the old node for the new RMNodeImpl, resetting the last 
response ID to 0.
# NM heartbeats again with response ID 1.  1 != 0 so the RM thinks the NM is 
"too old," when in fact it is _ahead_ of the RM.  This causes the RM to send a 
REBOOT command and remove the RMNodeImpl.
# Test code blindly looks up the RMNodeImpl by hostname and finds nothing, 
since the RMNodeImpl reboot logic removed it, hence the NPE on line 1286 of the 
test.

It looks like the response ID may be mishandled during the reconnect logic in 
cases where we swap out one node for another and the NM heartbeat races ahead 
of the reconnect processing.  Before this code change the heartbeat wouldn't be 
rejected because it's ahead of the RMNodeImpl, but now with this change those 
heartbeats are rejected just as if the NM was behind the RM.


> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
>                 Key: YARN-7102
>                 URL: https://issues.apache.org/jira/browse/YARN-7102
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Critical
>         Attachments: YARN-7102-branch-2.8.v10.patch, 
> YARN-7102-branch-2.8.v11.patch, YARN-7102-branch-2.8.v9.patch, 
> YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch, 
> YARN-7102-branch-2.v9.patch, YARN-7102.v1.patch, YARN-7102.v2.patch, 
> YARN-7102.v3.patch, YARN-7102.v4.patch, YARN-7102.v5.patch, 
> YARN-7102.v6.patch, YARN-7102.v7.patch, YARN-7102.v8.patch, YARN-7102.v9.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM 
> heartbeat in YARN-6640, please refer to YARN-6640 for details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-7102) NM heartbeat stuck when responseId overflows MAX_INT

Reply via email to