[jira] [Updated] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure

2022-12-27 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11403:
-
Description: 
When a node is put into Decommission, ClusterNodeTracker updates the 
maximumAllocation to the totalResources in use from that node. This could lead 
to Job Failure (with below error message) when the Job requests for a container 
of size greater than the new maximumAllocation.
{code:java}
22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a 
row.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[vcores], Requested 
resource=, 
maximum allowed allocation=, please note that maximum 
allowed allocation is calculated by scheduler based on maximum resource of 
registered NodeManagers, which might be less than configured maximum 
allocation=
{code}
*Repro:*

1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
Resource Memory 10GB and configured maxAllocation is 10GB.
2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
ApplicationMaster (2GB) is launched on node1. 
3. Put both nodes into Decommission. This makes maxAllocation to come down to 
2GB.
4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas 
maxAllocation is only 2GB.

  was:
When a node is put into Decommission, ClusterNodeTracker updates the 
maximumAllocation to the totalResources in use from that node. This could lead 
to Job Failure (with below error message) when the Job requests for a container 
of size greater than the new maximumAllocation.

{code}
22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a 
row.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[vcores], Requested 
resource=, 
maximum allowed allocation=, please note that maximum 
allowed allocation is calculated by scheduler based on maximum resource of 
registered NodeManagers, which might be less than configured maximum 
allocation=
{code}

*Repro:*

1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
Resource Memory 10GB and configured maxAllocation is 10GB.
2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark 
before it requests for Executors)
3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes 
maxAllocation to come down to 2GB.
4. Now notify the Spark Job. It requests for 4GB executor Size but the new 
maxAllocation is 2GB and so will fail.







> Decommission Node reduces the maximumAllocation and leads to Job Failure
> 
>
> Key: YARN-11403
> URL: https://issues.apache.org/jira/browse/YARN-11403
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> When a node is put into Decommission, ClusterNodeTracker updates the 
> maximumAllocation to the totalResources in use from that node. This could 
> lead to Job Failure (with below error message) when the Job requests for a 
> container of size greater than the new maximumAllocation.
> {code:java}
> 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in 
> a row.
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[vcores], Requested 
> resource= vCores:2147483647>, maximum allowed allocation=, please 
> note that maximum allowed allocation is calculated by scheduler based on 
> maximum resource of registered NodeManagers, which might be less than 
> configured maximum allocation=
> {code}
> *Repro:*
> 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
> Resource Memory 10GB and configured maxAllocation is 10GB.
> 2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
> ApplicationMaster (2GB) is launched on node1. 
> 3. Put both nodes into Decommission. This makes maxAllocation to come down to 
> 2GB.
> 4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas 
> maxAllocation is only 2GB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Updated] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure

2022-12-27 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11403:
-
Description: 
When a node is put into Decommission, ClusterNodeTracker updates the 
maximumAllocation to the totalResources in use from that node. This could lead 
to Job Failure (with below error message) when the Job requests for a container 
of size greater than the new maximumAllocation.

{code}
22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a 
row.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[vcores], Requested 
resource=, 
maximum allowed allocation=, please note that maximum 
allowed allocation is calculated by scheduler based on maximum resource of 
registered NodeManagers, which might be less than configured maximum 
allocation=
{code}

*Repro:*

1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
Resource Memory 10GB and configured maxAllocation is 10GB.
2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark 
before it requests for Executors)
3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes 
maxAllocation to come down to 2GB.
4. Now notify the Spark Job. It requests for 4GB executor Size but the new 
maxAllocation is 2GB and so will fail.






  was:
When a node is put into Decommission, ClusterNodeTracker updates the 
maximumAllocation to the totalResources in use from that node. This could lead 
to Job Failure (with below error message) when the Job requests for a container 
of size greater than the new maximumAllocation.

{code}
22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a 
row.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[vcores], Requested 
resource=, 
maximum allowed allocation=, please note that maximum 
allowed allocation is calculated by scheduler based on maximum resource of 
registered NodeManagers, which might be less than configured maximum 
allocation=
{code}

**Repro:**

1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
Resource Memory 10GB and configured maxAllocation is 10GB.
2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark 
before it requests for Executors)
3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes 
maxAllocation to come down to 2GB.
4. Now notify the Spark Job. It requests for 4GB executor Size but the new 
maxAllocation is 2GB and so will fail.







> Decommission Node reduces the maximumAllocation and leads to Job Failure
> 
>
> Key: YARN-11403
> URL: https://issues.apache.org/jira/browse/YARN-11403
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> When a node is put into Decommission, ClusterNodeTracker updates the 
> maximumAllocation to the totalResources in use from that node. This could 
> lead to Job Failure (with below error message) when the Job requests for a 
> container of size greater than the new maximumAllocation.
> {code}
> 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in 
> a row.
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[vcores], Requested 
> resource= vCores:2147483647>, maximum allowed allocation=, please 
> note that maximum allowed allocation is calculated by scheduler based on 
> maximum resource of registered NodeManagers, which might be less than 
> configured maximum allocation=
> {code}
> *Repro:*
> 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
> Resource Memory 10GB and configured maxAllocation is 10GB.
> 2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
> ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark 
> before it requests for Executors)
> 3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes 
> maxAllocation to come down to 2GB.
> 4. Now notify the Spark Job. It requests for 4GB executor Size but the new 
> maxAllocation is 2GB and so will fail.



--
This message was sent by Atlassian Jira