Sangjin Lee created YARN-4741:
---------------------------------

             Summary: RM is flooded with 
RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue
                 Key: YARN-4741
                 URL: https://issues.apache.org/jira/browse/YARN-4741
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.6.0
            Reporter: Sangjin Lee


We had a pretty major incident with the RM where it was continually flooded 
with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event 
queue.

In our setup, we had the RM HA or stateful restart *disabled*, but NM 
work-preserving restart *enabled*. Due to other issues, we did a cluster-wide 
NM restart.

Some time during the restart (which took multiple hours), we started seeing the 
async dispatcher event queue building. Normally it would log 1,000. In this 
case, it climbed all the way up to tens of millions of events.

When we looked at the RM log, it was full of the following messages:
{noformat}
2016-02-18 01:47:29,530 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event 
FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
2016-02-18 01:47:29,535 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
this event at current state
2016-02-18 01:47:29,535 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event 
FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
2016-02-18 01:47:29,538 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
this event at current state
2016-02-18 01:47:29,538 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event 
FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
{noformat}

And that node in question was restarted a few minutes earlier.

When we inspected the RM heap, it was full of 
RMNodeFinishedContainersPulledByAMEvents.

Suspecting the NM work-preserving restart, we disabled it and did another 
cluster-wide rolling restart. Initially that seemed to have helped reduce the 
queue size, but the queue built back up to several millions and continued for 
an extended period. We had to restart the RM to resolve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to