[
https://issues.apache.org/jira/browse/YARN-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545937#comment-13545937
]
Robert Joseph Evans commented on YARN-270:
------------------------------------------
I agree that part of the fix needs to be making the scheduler parallel, but we
also need a general way to apply back pressure otherwise there will always be a
way to accidentally bring down the system with a DOS. We recently saw what
appears to be a very similar issue show up on an MRAppMaster. We still don't
understand exactly what triggered it, but a job that would typically take 5 to
10 mins to complete was still running 17 hours later because the queue filled
up which caused the JVM to start garbage collecting like crazy which in turn
made it so it could not process all of the events coming in, which made the
queue fill up even more. We plan to address this in the short term by making
the JVM OMM much sooner than is the default, but it is still just a band-aid on
the underlying problem that unless there is back pressure there is always the
possibility for incoming requests to overwhelm the system.
> RM scheduler event handler thread gets behind
> ---------------------------------------------
>
> Key: YARN-270
> URL: https://issues.apache.org/jira/browse/YARN-270
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 0.23.5
> Reporter: Thomas Graves
> Assignee: Thomas Graves
>
> We had a couple of incidents on a 2800 node cluster where the RM scheduler
> event handler thread got behind processing events and basically become
> unusable. It was still processing apps, but taking a long time (1 hr 45
> minutes) to accept new apps. this actually happened twice within 5 days.
> We are using the capacity scheduler and at the time had between 400 and 500
> applications running. There were another 250 apps that were in the SUBMITTED
> state in the RM but the scheduler hadn't processed those to put in pending
> state yet. We had about 15 queues none of them hierarchical. We also had
> plenty of space lefts on the cluster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira