[ 
https://issues.apache.org/jira/browse/YARN-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951337#comment-17951337
 ] 

KevenHe commented on YARN-11784:
--------------------------------

We encountered a similar issue. The Hadoop version is 2.6, and although the 
cluster resources are sufficient, ResourceManage cannot allocate resources 
quickly, resulting in tasks being stuck in the Accepted state and a significant 
backlog in the pending queue

> Counter inaccurate and resource negative
> ----------------------------------------
>
>                 Key: YARN-11784
>                 URL: https://issues.apache.org/jira/browse/YARN-11784
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.2
>         Environment: Centos7.9
>            Reporter: Guoliang
>            Priority: Major
>         Attachments: tic_stream_49-1, yarnweb1.png, yarnweb2.png
>
>
> We have encountered the following issues in the production environment, which 
> may be caused by the same problem. Please help analyze which bug it is (YARN 
> version 2.7.2.22)
> (1) Running for a period of time may result in fewer resources being actually 
> scheduled. But the RM page shows that there are still resources available, 
> but the actual scheduling cannot be provided. For example, some queues have 
> pending jobs, but the resources in this queue are sufficient, and the 
> scheduling strategy and threshold are normal. The total resources of the 
> cluster are also sufficient. Note: In the history of RM scheduler counters, 
> there have been cases where a single Container is -1. How many of them are 
> still -1? I guess it's related to this
> (2) The A queue and cluster also have resources, and the A task cannot be 
> scheduled as AM or allocated to NM, so it has been stuck in this analysis. A 
> solution has been found, which is to increase the maxAMShare value. After 
> modifying the pending job history, it can be scheduled immediately. The A 
> queue resources have always had a large amount of idle time Discovered a 
> phenomenon. The number of resourcemanageability, queueinfo, and apps pending 
> in the queue obtained by RM web and jmx is inaccurate, but there is no 
> problem when calling the rm 8088 API ws/v1/cluster/apps.
> Refer to the attached file tic_steam-49 as shown in the figure below 
> Actually, all 49 jobs are in Running status, but the numbers displayed on the 
> web page and jmx are incorrect
> 我们在生产环境遇到了如下问题,可能是同一问题导致,请帮分析下是哪的bug(YARN版本2.7.2.22) 
> (1)运行一段时间,可实际调度资源会少。但RM页面显示还有资源,实际调度不出来 例: 部分队列存在Pending 
> 的job,但此队列资源够,调度策略和阀值都正常。 集群总资源也够。 
> 注:历史看RM调度器计数器里,存在单Container是-1的情况,怎么个数还是-1,猜跟这有关 
> (2)也是A队列和集群有资源,A任务无法调度起AM,也不分配到NM,就一直卡这 分析: 
> 发现了解决方式,增大maxAMShare值就可以。改完历史Pending 的job马上就能调度起来。A队列资源一直有大量空闲. 分发现个现象。 RM 
> web和jmx获取到的队列里的resourcemanager_queueinfo_AppsPending个数是不准的,但调rm 8088 api 
> ws/v1/cluster/apps 里的没有问题。 
> 看如下图和tic_stream_49附件文件. 实际49个job都是Running状态,但web页面和jmx显示的数都是不对的
> !yarnweb1.png!
> !yarnweb2.png!
> [^tic_stream_49]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to