[
https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sangjin Lee updated YARN-4741:
------------------------------
Attachment: nm.log
> RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async
> dispatcher event queue
> -----------------------------------------------------------------------------------------------
>
> Key: YARN-4741
> URL: https://issues.apache.org/jira/browse/YARN-4741
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.6.0
> Reporter: Sangjin Lee
> Priority: Critical
> Attachments: nm.log
>
>
> We had a pretty major incident with the RM where it was continually flooded
> with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event
> queue.
> In our setup, we had the RM HA or stateful restart *disabled*, but NM
> work-preserving restart *enabled*. Due to other issues, we did a cluster-wide
> NM restart.
> Some time during the restart (which took multiple hours), we started seeing
> the async dispatcher event queue building. Normally it would log 1,000. In
> this case, it climbed all the way up to tens of millions of events.
> When we looked at the RM log, it was full of the following messages:
> {noformat}
> 2016-02-18 01:47:29,530 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,535 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle
> this event at current state
> 2016-02-18 01:47:29,535 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,538 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle
> this event at current state
> 2016-02-18 01:47:29,538 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041
> {noformat}
> And that node in question was restarted a few minutes earlier.
> When we inspected the RM heap, it was full of
> RMNodeFinishedContainersPulledByAMEvents.
> Suspecting the NM work-preserving restart, we disabled it and did another
> cluster-wide rolling restart. Initially that seemed to have helped reduce the
> queue size, but the queue built back up to several millions and continued for
> an extended period. We had to restart the RM to resolve the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)