[ 
https://issues.apache.org/jira/browse/YARN-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545937#comment-13545937
 ] 

Robert Joseph Evans commented on YARN-270:
------------------------------------------

I agree that part of the fix needs to be making the scheduler parallel, but we 
also need a general way to apply back pressure otherwise there will always be a 
way to accidentally bring down the system with a DOS.  We recently saw what 
appears to be a very similar issue show up on an MRAppMaster.  We still don't 
understand exactly what triggered it, but a job that would typically take 5 to 
10 mins to complete was still running 17 hours later because the queue filled 
up which caused the JVM to start garbage collecting like crazy which in turn 
made it so it could not process all of the events coming in, which made the 
queue fill up even more. We plan to address this in the short term by making 
the JVM OMM much sooner than is the default, but it is still just a band-aid on 
the underlying problem that unless there is back pressure there is always the 
possibility for incoming requests to overwhelm the system.
                
> RM scheduler event handler thread gets behind
> ---------------------------------------------
>
>                 Key: YARN-270
>                 URL: https://issues.apache.org/jira/browse/YARN-270
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 0.23.5
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> We had a couple of incidents on a 2800 node cluster where the RM scheduler 
> event handler thread got behind processing events and basically become 
> unusable.  It was still processing apps, but taking a long time (1 hr 45 
> minutes) to accept new apps.   this actually happened twice within 5 days.
> We are using the capacity scheduler and at the time had between 400 and 500 
> applications running.  There were another 250 apps that were in the SUBMITTED 
> state in the RM but the scheduler hadn't processed those to put in pending 
> state yet.  We had about 15 queues none of them hierarchical.  We also had 
> plenty of space lefts on the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to