[ https://issues.apache.org/jira/browse/YARN-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951337#comment-17951337 ]
KevenHe commented on YARN-11784: -------------------------------- We encountered a similar issue. The Hadoop version is 2.6, and although the cluster resources are sufficient, ResourceManage cannot allocate resources quickly, resulting in tasks being stuck in the Accepted state and a significant backlog in the pending queue > Counter inaccurate and resource negative > ---------------------------------------- > > Key: YARN-11784 > URL: https://issues.apache.org/jira/browse/YARN-11784 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.2 > Environment: Centos7.9 > Reporter: Guoliang > Priority: Major > Attachments: tic_stream_49-1, yarnweb1.png, yarnweb2.png > > > We have encountered the following issues in the production environment, which > may be caused by the same problem. Please help analyze which bug it is (YARN > version 2.7.2.22) > (1) Running for a period of time may result in fewer resources being actually > scheduled. But the RM page shows that there are still resources available, > but the actual scheduling cannot be provided. For example, some queues have > pending jobs, but the resources in this queue are sufficient, and the > scheduling strategy and threshold are normal. The total resources of the > cluster are also sufficient. Note: In the history of RM scheduler counters, > there have been cases where a single Container is -1. How many of them are > still -1? I guess it's related to this > (2) The A queue and cluster also have resources, and the A task cannot be > scheduled as AM or allocated to NM, so it has been stuck in this analysis. A > solution has been found, which is to increase the maxAMShare value. After > modifying the pending job history, it can be scheduled immediately. The A > queue resources have always had a large amount of idle time Discovered a > phenomenon. The number of resourcemanageability, queueinfo, and apps pending > in the queue obtained by RM web and jmx is inaccurate, but there is no > problem when calling the rm 8088 API ws/v1/cluster/apps. > Refer to the attached file tic_steam-49 as shown in the figure below > Actually, all 49 jobs are in Running status, but the numbers displayed on the > web page and jmx are incorrect > 我们在生产环境遇到了如下问题,可能是同一问题导致,请帮分析下是哪的bug(YARN版本2.7.2.22) > (1)运行一段时间,可实际调度资源会少。但RM页面显示还有资源,实际调度不出来 例: 部分队列存在Pending > 的job,但此队列资源够,调度策略和阀值都正常。 集群总资源也够。 > 注:历史看RM调度器计数器里,存在单Container是-1的情况,怎么个数还是-1,猜跟这有关 > (2)也是A队列和集群有资源,A任务无法调度起AM,也不分配到NM,就一直卡这 分析: > 发现了解决方式,增大maxAMShare值就可以。改完历史Pending 的job马上就能调度起来。A队列资源一直有大量空闲. 分发现个现象。 RM > web和jmx获取到的队列里的resourcemanager_queueinfo_AppsPending个数是不准的,但调rm 8088 api > ws/v1/cluster/apps 里的没有问题。 > 看如下图和tic_stream_49附件文件. 实际49个job都是Running状态,但web页面和jmx显示的数都是不对的 > !yarnweb1.png! > !yarnweb2.png! > [^tic_stream_49] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org