Jin Gary created YARN-10404: ------------------------------- Summary: ResourseManager use HA mode both in Standby Key: YARN-10404 URL: https://issues.apache.org/jira/browse/YARN-10404 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.1.1 Reporter: Jin Gary Attachments: image-2020-08-24-11-07-50-899.png
our yarn use capacity scheduler, config is {code:java} //代码占位符 yarn.scheduler.capacity.maximum-am-resource-percent=0.2 yarn.scheduler.capacity.maximum-applications=10000 yarn.scheduler.capacity.node-locality-delay=20 yarn.scheduler.capacity.queue-mappings=u:username:queuename,... yarn.scheduler.capacity.queue-mappings-override.enable=true yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator yarn.scheduler.capacity.root.acl_administer_queue=* yarn.scheduler.capacity.root.acl_submit_applications=* yarn.scheduler.capacity.root.capacity=100 yarn.scheduler.capacity.root.default.acl_submit_applications=* yarn.scheduler.capacity.root.default.capacity=1 yarn.scheduler.capacity.root.default.maximum-am-resource-percent=1 yarn.scheduler.capacity.root.default.maximum-applications=10 yarn.scheduler.capacity.root.default.maximum-capacity=30 yarn.scheduler.capacity.root.default.minimum-user-limit-percent=100 yarn.scheduler.capacity.root.default.ordering-policy=fair yarn.scheduler.capacity.root.default.ordering-policy.fair.enable-size-based-weight=false yarn.scheduler.capacity.root.default.priority=0 yarn.scheduler.capacity.root.default.state=RUNNING yarn.scheduler.capacity.root.default.user-limit-factor=10 yarn.scheduler.capacity.root.developer.acl_administer_queue=* yarn.scheduler.capacity.root.developer.acl_submit_applications=* yarn.scheduler.capacity.root.developer.capacity=7 yarn.scheduler.capacity.root.developer.developer_data.acl_administer_queue=* yarn.scheduler.capacity.root.developer.developer_data.acl_submit_applications=* yarn.scheduler.capacity.root.developer.developer_data.capacity=50 yarn.scheduler.capacity.root.developer.developer_data.maximum-am-resource-percent=0.8 yarn.scheduler.capacity.root.developer.developer_data.maximum-capacity=70 yarn.scheduler.capacity.root.developer.developer_data.minimum-user-limit-percent=30 yarn.scheduler.capacity.root.developer.developer_data.ordering-policy=fair yarn.scheduler.capacity.root.developer.developer_data.ordering-policy.fair.enable-size-based-weight=false yarn.scheduler.capacity.root.developer.developer_data.priority=20 yarn.scheduler.capacity.root.developer.developer_data.state=RUNNING yarn.scheduler.capacity.root.developer.developer_data.user-limit-factor=5 yarn.scheduler.capacity.root.developer.developer_recsys.acl_administer_queue=* yarn.scheduler.capacity.root.developer.developer_recsys.acl_submit_applications=* yarn.scheduler.capacity.root.developer.developer_recsys.capacity=50 yarn.scheduler.capacity.root.developer.developer_recsys.maximum-am-resource-percent=0.8 yarn.scheduler.capacity.root.developer.developer_recsys.maximum-capacity=70 yarn.scheduler.capacity.root.developer.developer_recsys.minimum-user-limit-percent=30 yarn.scheduler.capacity.root.developer.developer_recsys.ordering-policy=fair yarn.scheduler.capacity.root.developer.developer_recsys.ordering-policy.fair.enable-size-based-weight=false yarn.scheduler.capacity.root.developer.developer_recsys.priority=20 yarn.scheduler.capacity.root.developer.developer_recsys.state=RUNNING yarn.scheduler.capacity.root.developer.developer_recsys.user-limit-factor=5 yarn.scheduler.capacity.root.developer.maximum-am-resource-percent=1 yarn.scheduler.capacity.root.developer.maximum-capacity=40 yarn.scheduler.capacity.root.developer.minimum-user-limit-percent=30 yarn.scheduler.capacity.root.developer.ordering-policy=priority-utilization yarn.scheduler.capacity.root.developer.priority=20 yarn.scheduler.capacity.root.developer.queues=developer_data,developer_recsys yarn.scheduler.capacity.root.developer.state=RUNNING yarn.scheduler.capacity.root.developer.user-limit-factor=3 yarn.scheduler.capacity.root.olap.acl_administer_queue=* yarn.scheduler.capacity.root.olap.acl_submit_applications=* yarn.scheduler.capacity.root.olap.capacity=5 yarn.scheduler.capacity.root.olap.maximum-am-resource-percent=1 yarn.scheduler.capacity.root.olap.maximum-capacity=30 yarn.scheduler.capacity.root.olap.minimum-user-limit-percent=100 yarn.scheduler.capacity.root.olap.ordering-policy=fifo yarn.scheduler.capacity.root.olap.priority=50 yarn.scheduler.capacity.root.olap.state=RUNNING yarn.scheduler.capacity.root.olap.user-limit-factor=3 yarn.scheduler.capacity.root.ordering-policy=priority-utilization yarn.scheduler.capacity.root.pipeline.acl_administer_queue=* yarn.scheduler.capacity.root.pipeline.acl_submit_applications=* yarn.scheduler.capacity.root.pipeline.capacity=30 yarn.scheduler.capacity.root.pipeline.maximum-allocation-mb=21504 yarn.scheduler.capacity.root.pipeline.maximum-capacity=30 yarn.scheduler.capacity.root.pipeline.minimum-user-limit-percent=100 yarn.scheduler.capacity.root.pipeline.ordering-policy=fifo yarn.scheduler.capacity.root.pipeline.priority=60 yarn.scheduler.capacity.root.pipeline.state=RUNNING yarn.scheduler.capacity.root.pipeline.user-limit-factor=1 yarn.scheduler.capacity.root.priority=0 yarn.scheduler.capacity.root.queues=default,developer,pipeline,task,olap,realtime,yarn-system yarn.scheduler.capacity.root.realtime.acl_administer_queue=* yarn.scheduler.capacity.root.realtime.acl_submit_applications=* yarn.scheduler.capacity.root.realtime.capacity=6 yarn.scheduler.capacity.root.realtime.maximum-capacity=6 yarn.scheduler.capacity.root.realtime.minimum-user-limit-percent=100 yarn.scheduler.capacity.root.realtime.ordering-policy=fifo yarn.scheduler.capacity.root.realtime.priority=60 yarn.scheduler.capacity.root.realtime.state=RUNNING yarn.scheduler.capacity.root.realtime.user-limit-factor=1 yarn.scheduler.capacity.root.task.acl_administer_queue=* yarn.scheduler.capacity.root.task.acl_submit_applications=* yarn.scheduler.capacity.root.task.capacity=51 yarn.scheduler.capacity.root.task.maximum-am-resource-percent=1 yarn.scheduler.capacity.root.task.maximum-capacity=59 yarn.scheduler.capacity.root.task.minimum-user-limit-percent=60 yarn.scheduler.capacity.root.task.ordering-policy=priority-utilization yarn.scheduler.capacity.root.task.priority=40 yarn.scheduler.capacity.root.task.queues=task_data,task_recsys yarn.scheduler.capacity.root.task.state=RUNNING yarn.scheduler.capacity.root.task.task_data.acl_administer_queue=* yarn.scheduler.capacity.root.task.task_data.acl_submit_applications=* yarn.scheduler.capacity.root.task.task_data.capacity=70 yarn.scheduler.capacity.root.task.task_data.maximum-am-resource-percent=0.9 yarn.scheduler.capacity.root.task.task_data.maximum-capacity=80 yarn.scheduler.capacity.root.task.task_data.minimum-user-limit-percent=90 yarn.scheduler.capacity.root.task.task_data.ordering-policy=fifo yarn.scheduler.capacity.root.task.task_data.priority=40 yarn.scheduler.capacity.root.task.task_data.state=RUNNING yarn.scheduler.capacity.root.task.task_data.user-limit-factor=3 yarn.scheduler.capacity.root.task.task_recsys.acl_administer_queue=* yarn.scheduler.capacity.root.task.task_recsys.acl_submit_applications=* yarn.scheduler.capacity.root.task.task_recsys.capacity=30 yarn.scheduler.capacity.root.task.task_recsys.maximum-am-resource-percent=0.9 yarn.scheduler.capacity.root.task.task_recsys.maximum-capacity=50 yarn.scheduler.capacity.root.task.task_recsys.minimum-user-limit-percent=100 yarn.scheduler.capacity.root.task.task_recsys.ordering-policy=fifo yarn.scheduler.capacity.root.task.task_recsys.priority=40 yarn.scheduler.capacity.root.task.task_recsys.state=RUNNING yarn.scheduler.capacity.root.task.task_recsys.user-limit-factor=3 yarn.scheduler.capacity.root.task.user-limit-factor=2 yarn.scheduler.capacity.root.yarn-system.acl_administer_queue=yarn-ats yarn.scheduler.capacity.root.yarn-system.acl_submit_applications=yarn-ats yarn.scheduler.capacity.root.yarn-system.capacity=0 yarn.scheduler.capacity.root.yarn-system.default-application-lifetime=-1 yarn.scheduler.capacity.root.yarn-system.disable_preemption=true yarn.scheduler.capacity.root.yarn-system.maximum-am-resource-percent=0.5 yarn.scheduler.capacity.root.yarn-system.maximum-application-lifetime=-1 yarn.scheduler.capacity.root.yarn-system.maximum-capacity=100 yarn.scheduler.capacity.root.yarn-system.minimum-user-limit-percent=100 yarn.scheduler.capacity.root.yarn-system.ordering-policy=fifo yarn.scheduler.capacity.root.yarn-system.priority=32768 yarn.scheduler.capacity.root.yarn-system.state=RUNNING yarn.scheduler.capacity.root.yarn-system.user-limit-factor=1 {code} As you can see, preemption is enabled between the queues. Now there are two faults: # active rm will enter standby mode during operation, but standby rm not automatically switch to active, and two rm logs frequently print abnormal information: {panel} 2020-08-18 22:35:19,451 WARN rmapp.RMAppImpl (RMAppImpl.java:<init>(473)) - The specific max attempts: 0 for application: 2769 is invalid, because it is out of the range [1, 3]. Use the global max attempts instead. 2020-08-18 22:35:19,451 INFO collector.TimelineCollectorManager (TimelineCollectorManager.java:putIfAbsent(149)) - the collector for application_1597677167198_2769 already exists! 2020-08-18 22:35:19,451 INFO placement.UserGroupMappingPlacementRule (UserGroupMappingPlacementRule.java:getPlacementForApp(201)) - Application application_1597677167198_2770 user kylin mapping [olap] to [org.apache.hadoop.yarn.server.resourcemanager.placement.ApplicationPlacementContext@4ad0f252] override true 2020-08-18 22:35:19,451 WARN rmapp.RMAppImpl (RMAppImpl.java:<init>(473)) - The specific max attempts: 0 for application: 2770 is invalid, because it is out of the range [1, 3]. Use the global max attempts instead. 2020-08-18 22:35:19,451 INFO collector.TimelineCollectorManager (TimelineCollectorManager.java:putIfAbsent(149)) - the collector for application_1597677167198_2770 already exists! 2020-08-18 22:35:19,452 INFO placement.UserGroupMappingPlacementRule (UserGroupMappingPlacementRule.java:getPlacementForApp(201)) - Application application_1597677167198_2771 user kylin mapping [olap] to [org.apache.hadoop.yarn.server.resourcemanager.placement.ApplicationPlacementContext@2287c6d0] override true {panel} After a while, there was an exception: ERROR curator.ConnectionState (ConnectionState.java:checkTimeouts(228)) - Connection timed out for connection string ( 2. No tasks in the queue are running, but the resource occupancy rate has not been released, this causes the newly submitted task to remain ACCEPTED !image-2020-08-24-11-07-50-899.png! ps: We occasionally use yarn application -movetoqueue application_1478676388082_963529 -queue task_data to make tasks get sufficient queue resources -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org