[ https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
fanshilun resolved YARN-4741. ----------------------------- Resolution: Duplicate > RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async > dispatcher event queue > ----------------------------------------------------------------------------------------------- > > Key: YARN-4741 > URL: https://issues.apache.org/jira/browse/YARN-4741 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: Sangjin Lee > Priority: Critical > Attachments: nm.log > > > We had a pretty major incident with the RM where it was continually flooded > with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event > queue. > In our setup, we had the RM HA or stateful restart *disabled*, but NM > work-preserving restart *enabled*. Due to other issues, we did a cluster-wide > NM restart. > Some time during the restart (which took multiple hours), we started seeing > the async dispatcher event queue building. Normally it would log 1,000. In > this case, it climbed all the way up to tens of millions of events. > When we looked at the RM log, it was full of the following messages: > {noformat} > 2016-02-18 01:47:29,530 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > 2016-02-18 01:47:29,535 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle > this event at current state > 2016-02-18 01:47:29,535 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > 2016-02-18 01:47:29,538 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle > this event at current state > 2016-02-18 01:47:29,538 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > {noformat} > And that node in question was restarted a few minutes earlier. > When we inspected the RM heap, it was full of > RMNodeFinishedContainersPulledByAMEvents. > Suspecting the NM work-preserving restart, we disabled it and did another > cluster-wide rolling restart. Initially that seemed to have helped reduce the > queue size, but the queue built back up to several millions and continued for > an extended period. We had to restart the RM to resolve the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org