[jira] [Created] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure

Prabhu Joseph (Jira) Tue, 27 Dec 2022 07:42:06 -0800

Prabhu Joseph created YARN-11403:
------------------------------------

             Summary: Decommission Node reduces the maximumAllocation and leads 
to Job Failure
                 Key: YARN-11403
                 URL: https://issues.apache.org/jira/browse/YARN-11403
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.3.4
            Reporter: Prabhu Joseph
            Assignee: Prabhu Joseph



When a node is put into Decommission, ClusterNodeTracker updates the 
maximumAllocation to the totalResources in use from that node. This could lead 
to Job Failure (with below error message) when the Job requests for a container 
of size greater than the new maximumAllocation.

{code}
22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a 
row.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[vcores], Requested 
resource=<memory:896, max memory:2147483647, vCores:2, max vCores:2147483647>, 
maximum allowed allocation=<memory:896, vCores:1>, please note that maximum 
allowed allocation is calculated by scheduler based on maximum resource of 
registered NodeManagers, which might be less than configured maximum 
allocation=<memory:122880, vCores:128>
{code}

**Repro:**

1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
Resource Memory 10GB and configured maxAllocation is 10GB.
2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark 
before it requests for Executors)
3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes 
maxAllocation to come down to 2GB.
4. Now notify the Spark Job. It requests for 4GB executor Size but the new 
maxAllocation is 2GB and so will fail.








--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure

Reply via email to