[ 
https://issues.apache.org/jira/browse/YARN-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534953#comment-13534953
 ] 

Robert Joseph Evans commented on YARN-270:
------------------------------------------

It cannot exert back pressure currently, but I don't see any reason to think 
that it could not be added in the future.  Something as simple as setting a 
high water mark on the number of pending events and throttling events from 
incoming connections until the congestion subsides.

We have see a similar issue in the IPC layer on the AM when too many reducers 
were trying to download the mapper locations.  Granted this is not the same 
code, but it was caused by asynchronously handling events and buffering up the 
data so when we got behind we eventually got OOMs.  I think we will continue to 
see more issues as we scale up until we solve it generally, or every single 
client API call will have to be updated eventually to avoid overloading the 
system.
                
> RM scheduler event handler thread gets behind
> ---------------------------------------------
>
>                 Key: YARN-270
>                 URL: https://issues.apache.org/jira/browse/YARN-270
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 0.23.5
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> We had a couple of incidents on a 2800 node cluster where the RM scheduler 
> event handler thread got behind processing events and basically become 
> unusable.  It was still processing apps, but taking a long time (1 hr 45 
> minutes) to accept new apps.   this actually happened twice within 5 days.
> We are using the capacity scheduler and at the time had between 400 and 500 
> applications running.  There were another 250 apps that were in the SUBMITTED 
> state in the RM but the scheduler hadn't processed those to put in pending 
> state yet.  We had about 15 queues none of them hierarchical.  We also had 
> plenty of space lefts on the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to