Guoliang created YARN-11784:
-------------------------------

             Summary: Counter inaccurate and resource negative
                 Key: YARN-11784
                 URL: https://issues.apache.org/jira/browse/YARN-11784
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Guoliang
         Attachments: tic_stream_49-1, yarnweb1.png, yarnweb2.png

We have encountered the following issues in the production environment, which 
may be caused by the same problem. Please help analyze which bug it is (YARN 
version 2.7.2.22)
(1) Running for a period of time may result in fewer resources being actually 
scheduled. But the RM page shows that there are still resources available, but 
the actual scheduling cannot be provided. For example, some queues have pending 
jobs, but the resources in this queue are sufficient, and the scheduling 
strategy and threshold are normal. The total resources of the cluster are also 
sufficient. Note: In the history of RM scheduler counters, there have been 
cases where a single Container is -1. How many of them are still -1? I guess 
it's related to this
(2) The A queue and cluster also have resources, and the A task cannot be 
scheduled as AM or allocated to NM, so it has been stuck in this analysis. A 
solution has been found, which is to increase the maxAMShare value. After 
modifying the pending job history, it can be scheduled immediately. The A queue 
resources have always had a large amount of idle time Discovered a phenomenon. 
The number of resourcemanageability, queueinfo, and apps pending in the queue 
obtained by RM web and jmx is inaccurate, but there is no problem when calling 
the rm 8088 API ws/v1/cluster/apps.
Refer to the attached file tic_steam-49 as shown in the figure below Actually, 
all 49 jobs are in Running status, but the numbers displayed on the web page 
and jmx are incorrect

我们在生产环境遇到了如下问题,可能是同一问题导致,请帮分析下是哪的bug(YARN版本2.7.2.22) 
(1)运行一段时间,可实际调度资源会少。但RM页面显示还有资源,实际调度不出来 例: 部分队列存在Pending 
的job,但此队列资源够,调度策略和阀值都正常。 集群总资源也够。 
注:历史看RM调度器计数器里,存在单Container是-1的情况,怎么个数还是-1,猜跟这有关 
(2)也是A队列和集群有资源,A任务无法调度起AM,也不分配到NM,就一直卡这 分析: 
发现了解决方式,增大maxAMShare值就可以。改完历史Pending 的job马上就能调度起来。A队列资源一直有大量空闲. 分发现个现象。 RM 
web和jmx获取到的队列里的resourcemanager_queueinfo_AppsPending个数是不准的,但调rm 8088 api 
ws/v1/cluster/apps 里的没有问题。 
看如下图和tic_stream_49附件文件. 实际49个job都是Running状态,但web页面和jmx显示的数都是不对的
!yarnweb1.png!
!yarnweb2.png!
[^tic_stream_49]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to