[ 
https://issues.apache.org/jira/browse/YARN-4618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110648#comment-15110648
 ] 

Jun Gong commented on YARN-4618:
--------------------------------

hi [~Naganarasimha], we have met same problem in our cluster. We use 
FairScheduler. In FairScheduler#update(), we find some app's demand resource is 
negative, these apps are most HIVE SQL jobs, they requests so many containers 
that their demand resources overflow, then their queue overflows, the queue 
never gets resource. We are not sure whether it is HIVE SQL's bug, but we 
handle this problem in fair scheduler, in *updateDemand* we check whether 
demand resource is negative, and set it to *maxRes* if it is negative.

> RM Stops allocating containers if large number of pending containers
> --------------------------------------------------------------------
>
>                 Key: YARN-4618
>                 URL: https://issues.apache.org/jira/browse/YARN-4618
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>
> In  one of the test found that when RM is having so many pending container 
> request to be served RM Stops assigning containers.
> Cluster simulated is with 100 TB 
> Root total = 600k containers = 
> Queue 1 = 300k containers = 1328800000 MB
> Queue 2 = 300k containers = 1428800000 MB
> Each container request is with 4GB. 
> {{ParentQueue#assignContainers}} is as below
> {noformat}
>     // Check if this queue need more resource, simply skip allocation if this
>     // queue doesn't need more resources.
>     if (!super.hasPendingResourceRequest(node.getPartition(),
>         clusterResource, schedulingMode)) {
>       if (LOG.isDebugEnabled()) {
>         LOG.debug("Skip this queue=" + getQueuePath()
>             + ", because it doesn't need more resource, schedulingMode="
>             + schedulingMode.name() + " node-partition=" + 
> node.getPartition());
>       }
>       return CSAssignment.NULL_ASSIGNMENT;
>     }
> {noformat}
> When the pending resource > MAX VALUE and become *negative*  {{- 167XXXXXXX 
> MB}} and always NULL_ASSIGNMENT is return.
> Tools used to test SLS.
> For checking pendingResource request we should first check any pending 
> containers (from getMetrics()) are there to be served. If pending containers 
> are available then return true else consider other check for increase request.
> Thoughts ??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to