Guoliang created YARN-11784: ------------------------------- Summary: Counter inaccurate and resource negative Key: YARN-11784 URL: https://issues.apache.org/jira/browse/YARN-11784 Project: Hadoop YARN Issue Type: Bug Reporter: Guoliang Attachments: tic_stream_49-1, yarnweb1.png, yarnweb2.png
We have encountered the following issues in the production environment, which may be caused by the same problem. Please help analyze which bug it is (YARN version 2.7.2.22) (1) Running for a period of time may result in fewer resources being actually scheduled. But the RM page shows that there are still resources available, but the actual scheduling cannot be provided. For example, some queues have pending jobs, but the resources in this queue are sufficient, and the scheduling strategy and threshold are normal. The total resources of the cluster are also sufficient. Note: In the history of RM scheduler counters, there have been cases where a single Container is -1. How many of them are still -1? I guess it's related to this (2) The A queue and cluster also have resources, and the A task cannot be scheduled as AM or allocated to NM, so it has been stuck in this analysis. A solution has been found, which is to increase the maxAMShare value. After modifying the pending job history, it can be scheduled immediately. The A queue resources have always had a large amount of idle time Discovered a phenomenon. The number of resourcemanageability, queueinfo, and apps pending in the queue obtained by RM web and jmx is inaccurate, but there is no problem when calling the rm 8088 API ws/v1/cluster/apps. Refer to the attached file tic_steam-49 as shown in the figure below Actually, all 49 jobs are in Running status, but the numbers displayed on the web page and jmx are incorrect 我们在生产环境遇到了如下问题,可能是同一问题导致,请帮分析下是哪的bug(YARN版本2.7.2.22) (1)运行一段时间,可实际调度资源会少。但RM页面显示还有资源,实际调度不出来 例: 部分队列存在Pending 的job,但此队列资源够,调度策略和阀值都正常。 集群总资源也够。 注:历史看RM调度器计数器里,存在单Container是-1的情况,怎么个数还是-1,猜跟这有关 (2)也是A队列和集群有资源,A任务无法调度起AM,也不分配到NM,就一直卡这 分析: 发现了解决方式,增大maxAMShare值就可以。改完历史Pending 的job马上就能调度起来。A队列资源一直有大量空闲. 分发现个现象。 RM web和jmx获取到的队列里的resourcemanager_queueinfo_AppsPending个数是不准的,但调rm 8088 api ws/v1/cluster/apps 里的没有问题。 看如下图和tic_stream_49附件文件. 实际49个job都是Running状态,但web页面和jmx显示的数都是不对的 !yarnweb1.png! !yarnweb2.png! [^tic_stream_49] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org