[
https://issues.apache.org/jira/browse/YARN-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951337#comment-17951337
]
KevenHe edited comment on YARN-11784 at 5/14/25 6:44 AM:
---------------------------------------------------------
We encountered a similar issue. The Hadoop version is 2.6, and although the
cluster resources are sufficient, ResourceManage cannot allocate resources
quickly, resulting in tasks being stuck in the Accepted state and a significant
backlog in the pending queue
!https://api3-eeft-drive.feishu.cn/space/api/box/stream/download/preview/FVLJbiK4IoWmfcxhgXictlY0n1e/?preview_type=16!
was (Author: JIRAUSER309656):
We encountered a similar issue. The Hadoop version is 2.6, and although the
cluster resources are sufficient, ResourceManage cannot allocate resources
quickly, resulting in tasks being stuck in the Accepted state and a significant
backlog in the pending queue
> Counter inaccurate and resource negative
> ----------------------------------------
>
> Key: YARN-11784
> URL: https://issues.apache.org/jira/browse/YARN-11784
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.2
> Environment: Centos7.9
> Reporter: Guoliang
> Priority: Major
> Attachments: tic_stream_49-1, yarnweb1.png, yarnweb2.png
>
>
> We have encountered the following issues in the production environment, which
> may be caused by the same problem. Please help analyze which bug it is (YARN
> version 2.7.2.22)
> (1) Running for a period of time may result in fewer resources being actually
> scheduled. But the RM page shows that there are still resources available,
> but the actual scheduling cannot be provided. For example, some queues have
> pending jobs, but the resources in this queue are sufficient, and the
> scheduling strategy and threshold are normal. The total resources of the
> cluster are also sufficient. Note: In the history of RM scheduler counters,
> there have been cases where a single Container is -1. How many of them are
> still -1? I guess it's related to this
> (2) The A queue and cluster also have resources, and the A task cannot be
> scheduled as AM or allocated to NM, so it has been stuck in this analysis. A
> solution has been found, which is to increase the maxAMShare value. After
> modifying the pending job history, it can be scheduled immediately. The A
> queue resources have always had a large amount of idle time Discovered a
> phenomenon. The number of resourcemanageability, queueinfo, and apps pending
> in the queue obtained by RM web and jmx is inaccurate, but there is no
> problem when calling the rm 8088 API ws/v1/cluster/apps.
> Refer to the attached file tic_steam-49 as shown in the figure below
> Actually, all 49 jobs are in Running status, but the numbers displayed on the
> web page and jmx are incorrect
> 我们在生产环境遇到了如下问题,可能是同一问题导致,请帮分析下是哪的bug(YARN版本2.7.2.22)
> (1)运行一段时间,可实际调度资源会少。但RM页面显示还有资源,实际调度不出来 例: 部分队列存在Pending
> 的job,但此队列资源够,调度策略和阀值都正常。 集群总资源也够。
> 注:历史看RM调度器计数器里,存在单Container是-1的情况,怎么个数还是-1,猜跟这有关
> (2)也是A队列和集群有资源,A任务无法调度起AM,也不分配到NM,就一直卡这 分析:
> 发现了解决方式,增大maxAMShare值就可以。改完历史Pending 的job马上就能调度起来。A队列资源一直有大量空闲. 分发现个现象。 RM
> web和jmx获取到的队列里的resourcemanager_queueinfo_AppsPending个数是不准的,但调rm 8088 api
> ws/v1/cluster/apps 里的没有问题。
> 看如下图和tic_stream_49附件文件. 实际49个job都是Running状态,但web页面和jmx显示的数都是不对的
> !yarnweb1.png!
> !yarnweb2.png!
> [^tic_stream_49]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]