[ 
https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-4741:
------------------------------
    Attachment: nm.log

> RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async 
> dispatcher event queue
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-4741
>                 URL: https://issues.apache.org/jira/browse/YARN-4741
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Sangjin Lee
>            Priority: Critical
>         Attachments: nm.log
>
>
> We had a pretty major incident with the RM where it was continually flooded 
> with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event 
> queue.
> In our setup, we had the RM HA or stateful restart *disabled*, but NM 
> work-preserving restart *enabled*. Due to other issues, we did a cluster-wide 
> NM restart.
> Some time during the restart (which took multiple hours), we started seeing 
> the async dispatcher event queue building. Normally it would log 1,000. In 
> this case, it climbed all the way up to tens of millions of events.
> When we looked at the RM log, it was full of the following messages:
> {noformat}
> 2016-02-18 01:47:29,530 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,535 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> 2016-02-18 01:47:29,535 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,538 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> 2016-02-18 01:47:29,538 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> {noformat}
> And that node in question was restarted a few minutes earlier.
> When we inspected the RM heap, it was full of 
> RMNodeFinishedContainersPulledByAMEvents.
> Suspecting the NM work-preserving restart, we disabled it and did another 
> cluster-wide rolling restart. Initially that seemed to have helped reduce the 
> queue size, but the queue built back up to several millions and continued for 
> an extended period. We had to restart the RM to resolve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to