Andras Gyori created YARN-11067: ----------------------------------- Summary: Resource overcommitment due to incorrect resource normalisation logical order Key: YARN-11067 URL: https://issues.apache.org/jira/browse/YARN-11067 Project: Hadoop YARN Issue Type: Bug Reporter: Andras Gyori Assignee: Andras Gyori
A rather serious overcommitment issue was discovered when using ABSOLUTE resources as capacities. A minimal way to reproduce the issue is the following: # We have a cluster with 32 GB memory and 16 VCores. Create the following hierarchy with the corresponding capacities: ## root.capacity = [memory=54GiB, vcores=28] ## root.a.capacity = [memory=50GiB, vcores=20] ## root.a1.capacity = [memory=30GiB, vcores=15] ## root.a2.capacity = [memory=20GiB, vcores=5] # Remove a Node from the cluster (this is not even an unusual event), eg. a Node with resource [memory=8GiB, vcores=4] # Due to the normalised resource ratio is calculated BEFORE the effective resource of the queue is recalculated, it will create a cascade which results in an overcommitment in the queue hierarchy (see [https://github.com/apache/hadoop/blob/5ef335da1ed49e06cc8973412952e09ed08bb9c0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java#L1294)] -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org