[jira] [Comment Edited] (YARN-11784) Counter inaccurate and resource negative

Guoliang (Jira) Mon, 03 Mar 2025 03:45:24 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931942#comment-17931942
 ]


Guoliang edited comment on YARN-11784 at 3/3/25 11:43 AM:
----------------------------------------------------------

Addendum, the queue resources are sufficient (both cpu/mem), the cluster 
resources are also sufficient, and the NM node status and resources are normal.

 

Phenomenon: Jobs in the queue cannot be scheduled, stuck here, and AM is not 
assigned to nodes. Yarn WEB is stuck and the Num Pending Applications number 
under the job queue is not displayed correctly. It keeps showing that it is not 
a cache. Refer to the attached 2 images and files (the 49 jobs in the file are 
actually in Running status, but the page displays 47 Running jobs and 2 Pending 
jobs, which have always been)

 

Phenomenon: Some queues have sufficient resources, but the job is not running 
and remains in the Accepted state, without giving any AM score to nm. As long 
as maxAMShare is raised, it will be scheduled quickly

 

I tried a temporary solution: simply increase the maxAMShare value of the Fair 
scheduler's stuck queue, and once the stuck job is changed, it will be 
immediately assigned to the nm node. But reducing it won't work, and maxAMShare 
still has an upper limit of 1, which can only be considered a temporary solution



补充，队列资源够(cpu/mem都是)，集群资源也够，NM节点状态及资源都正常。
 
现象： 队列里job无法调度，一直卡这，AM不分配到节点。Yarn WEB 卡着job队列下面的Num Pending 
Applications数显示的也不对，一直显示并不是缓存。参看附件2个图及附件文件（文件里的49个job实际都是Running状态，但页面显示Running个数47和Pending个数2，一直是）

现象：有的队列，资源一直够，但job就是不Running，一直是Accepted状态，也不给nm分am。只要调高maxAMShare，就很快会调度起来.
 
试了临时解决：只要增大fair调度器卡着队列的maxAMShare值就可以，改完卡这的job马上就会分配am到nm节点。 
但调小就不行，并且maxAMShare还有上限值1，只能说是临时解决方式


was (Author: man2601010):
Addendum, the queue resources are sufficient (both cpu/mem), the cluster 
resources are also sufficient, and the NM node status and resources are normal.
Phenomenon: Jobs in the queue cannot be scheduled, stuck here, and AM is not 
assigned to nodes.
I tried a temporary solution: simply increase the maxAMShare value of the Fair 
scheduler's stuck queue, and once the stuck job is changed, it will be 
immediately assigned to the nm node. But reducing it won't work, and maxAMShare 
still has an upper limit of 1, which can only be considered a temporary solution


补充，队列资源够(cpu/mem都是)，集群资源也够，NM节点状态及资源都正常。
 
现象： 队列里job无法调度，一直卡这，AM不分配到节点。
 
试了临时解决：只要增大fair调度器卡着队列的maxAMShare值就可以，改完卡这的job马上就会分配am到nm节点。 
但调小就不行，并且maxAMShare还有上限值1，只能说是临时解决方式

> Counter inaccurate and resource negative
> ----------------------------------------
>
>                 Key: YARN-11784
>                 URL: https://issues.apache.org/jira/browse/YARN-11784
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Guoliang
>            Priority: Major
>         Attachments: tic_stream_49-1, yarnweb1.png, yarnweb2.png
>
>
> We have encountered the following issues in the production environment, which 
> may be caused by the same problem. Please help analyze which bug it is (YARN 
> version 2.7.2.22)
> (1) Running for a period of time may result in fewer resources being actually 
> scheduled. But the RM page shows that there are still resources available, 
> but the actual scheduling cannot be provided. For example, some queues have 
> pending jobs, but the resources in this queue are sufficient, and the 
> scheduling strategy and threshold are normal. The total resources of the 
> cluster are also sufficient. Note: In the history of RM scheduler counters, 
> there have been cases where a single Container is -1. How many of them are 
> still -1? I guess it's related to this
> (2) The A queue and cluster also have resources, and the A task cannot be 
> scheduled as AM or allocated to NM, so it has been stuck in this analysis. A 
> solution has been found, which is to increase the maxAMShare value. After 
> modifying the pending job history, it can be scheduled immediately. The A 
> queue resources have always had a large amount of idle time Discovered a 
> phenomenon. The number of resourcemanageability, queueinfo, and apps pending 
> in the queue obtained by RM web and jmx is inaccurate, but there is no 
> problem when calling the rm 8088 API ws/v1/cluster/apps.
> Refer to the attached file tic_steam-49 as shown in the figure below 
> Actually, all 49 jobs are in Running status, but the numbers displayed on the 
> web page and jmx are incorrect
> 我们在生产环境遇到了如下问题，可能是同一问题导致，请帮分析下是哪的bug(YARN版本2.7.2.22) 
> （1）运行一段时间，可实际调度资源会少。但RM页面显示还有资源，实际调度不出来 例： 部分队列存在Pending 
> 的job，但此队列资源够，调度策略和阀值都正常。 集群总资源也够。 
> 注：历史看RM调度器计数器里，存在单Container是-1的情况，怎么个数还是-1，猜跟这有关 
> （2）也是A队列和集群有资源，A任务无法调度起AM，也不分配到NM，就一直卡这 分析： 
> 发现了解决方式，增大maxAMShare值就可以。改完历史Pending 的job马上就能调度起来。A队列资源一直有大量空闲. 分发现个现象。 RM 
> web和jmx获取到的队列里的resourcemanager_queueinfo_AppsPending个数是不准的，但调rm 8088 api 
> ws/v1/cluster/apps 里的没有问题。 
> 看如下图和tic_stream_49附件文件. 实际49个job都是Running状态，但web页面和jmx显示的数都是不对的
> !yarnweb1.png!
> !yarnweb2.png!
> [^tic_stream_49]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-11784) Counter inaccurate and resource negative

Reply via email to