[ 
https://issues.apache.org/jira/browse/YARN-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746429#comment-16746429
 ] 

Zhankun Tang edited comment on YARN-9205 at 1/18/19 4:06 PM:
-------------------------------------------------------------

[~leftnoteasy], [~yuan_zac] . The root cause of this issue seems clear now.

In YARN-8720, it did changes in DefaultAMSProcessor#allocate as below:
{code:java}
Resource maximumCapacity = getScheduler().getMaximumResourceCapability();{code}
to
{code:java}
Resource maximumCapacity = 
getScheduler().getMaximumResourceCapability(app.getQueue());{code}
So when an application request custom resource "<memory=2G, vcore=1, A=1>", 
this will be compared with value from 
"getScheduler().getMaximumResourceCapability(app.getQueue());". And it will 
return resource without A like "<memory=8G, vcore=4>" from "

Resource queueMaxAllocation = ((LeafQueue)queue).getMaximumAllocation()

" in CS#getMaximumResourceCapability(String queueName).

Then when comparing the two resource object, CS throws the exception 
complaining request is larger than queue's "maximumAllocation".

That's the direct reason to app failure.

But why the queue's maxAllocation doesn't have resource "A"? 

We know the queue's "maximumAllocation" is assigned in AbstractCSQueue#

setupQueueConfigs:
{code:java}
this.maximumAllocation =
configuration.getMaximumAllocationPerQueue(
getQueuePath());{code}
And it then goes to CapacitySchedulerConfiguration:
{code:java}
Resource clusterMax = 
ResourceUtils.fetchMaximumAllocationFromConfig(this);{code}
And it then goes to ResourceUtils:

 
{code:java}
public static Resource fetchMaximumAllocationFromConfig(Configuration conf) {
Map<String, ResourceInformation> resourceInformationMap =
getResourceInformationMapFromConfig(conf);
...
private static Map<String, ResourceInformation> 
getResourceInformationMapFromConfig(
Configuration conf) {
Map<String, ResourceInformation> resourceInformationMap = new HashMap<>();
String[] resourceNames = conf.getStrings(YarnConfiguration.RESOURCE_TYPES);
LOG.debug("resourceNames from config: " + resourceNames);
if (resourceNames != null && resourceNames.length != 0) {
for (String resourceName : resourceNames) {
...
{code}
 

 The root cause is here.
{code:java}
String[] resourceNames = conf.getStrings(YarnConfiguration.RESOURCE_TYPES);
LOG.debug("resourceNames from config: " + resourceNames);{code}
If you print the "resourceNames", it is a NULL! Any custom resource will be 
ignored here.

The conf is a CS configuration passed in but has no "resource-types.xml" 
contents in it.

To verify it,  see patch 02 just add the "yarn-site.xml" into the CS 
configuration, and the application can run successfully.

Now the root cause is clear, CS configuration doesn't contain content of 
"resource-types.xml" which causes "getResourceInformationMapFromConfig" ignore 
the custom resource. And set wrong queue's "maximumAllocatioin".

We know that CS configuration was constructed based on configuration passed 
into CS scheduler. But I did a further check on the configurations which first 
initialized in RM's main function and passed to CS and then to CS 
configuration. It doesn't load "resource-types.xml" too.

Not sure if the patch 02 is a fix or workaround. Any input? [~tarunparimi]

 


was (Author: tangzhankun):
[~leftnoteasy], [~yuan_zac] . The root cause of this issue seems clear now.

In YARN-8720, it did changes in DefaultAMSProcessor#allocate as below:
{code:java}
Resource maximumCapacity = getScheduler().getMaximumResourceCapability();{code}
to
{code:java}
Resource maximumCapacity = 
getScheduler().getMaximumResourceCapability(app.getQueue());{code}
So when an application request custom resource "<memory=2G, vcore=1, A=1>", 
this will be compared with value from 
"getScheduler().getMaximumResourceCapability(app.getQueue());". And it will 
return resource without A like "<memory=8G, vcore=4>" from "

Resource queueMaxAllocation = ((LeafQueue)queue).getMaximumAllocation()

" in CS#getMaximumResourceCapability(String queueName).

Then when comparing the two resource object, CS throws the exception 
complaining request is larger than queue's "maximumAllocation".

That's the direct reason to app failure.

But why the queue's maxAllocation doesn't have resource "A"? 

We know the queue's "maximumAllocation" is assigned in AbstractCSQueue#

setupQueueConfigs:
{code:java}
this.maximumAllocation =
configuration.getMaximumAllocationPerQueue(
getQueuePath());{code}
And it then goes to CapacitySchedulerConfiguration:
{code:java}
Resource clusterMax = 
ResourceUtils.fetchMaximumAllocationFromConfig(this);{code}
And it then goes to ResourceUtils:

 
{code:java}
public static Resource fetchMaximumAllocationFromConfig(Configuration conf) {
Map<String, ResourceInformation> resourceInformationMap =
getResourceInformationMapFromConfig(conf);
...
private static Map<String, ResourceInformation> 
getResourceInformationMapFromConfig(
Configuration conf) {
Map<String, ResourceInformation> resourceInformationMap = new HashMap<>();
String[] resourceNames = conf.getStrings(YarnConfiguration.RESOURCE_TYPES);
LOG.debug("resourceNames from config: " + resourceNames);
if (resourceNames != null && resourceNames.length != 0) {
for (String resourceName : resourceNames) {
...
{code}
 

 The root cause is here.
{code:java}
String[] resourceNames = conf.getStrings(YarnConfiguration.RESOURCE_TYPES);
LOG.debug("resourceNames from config: " + resourceNames);{code}
If you print the "resourceNames", it is a NULL! Any custom resource will be 
ignored here.

The conf is a CS configuration passed in but has no "yarn-site.xml" contents in 
it.

To verify it,  see patch 02 just add the "yarn-site.xml" into the CS 
configuration, and the application can run successfully.

Now the root cause is clear, CS configuration doesn't contains content of 
"resource-types.xml" which causes "getResourceInformationMapFromConfig" ignore 
the custom resource. And set wrong queue's "maximumAllocatioin".

We know that CS configuration was constructed based on configuration passed 
into CS scheduler. But I did a further check on the configurations which first 
initialized in RM's main function and passed to CS and then to CS 
configuration. It doesn't load "resource-types.xml" too.

Not sure if the patch 02 is a fix or workaround. Any input? [~tarunparimi]

 

> When using custom resource type, application will fail to run due to the 
> CapacityScheduler throws 
> InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION) 
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9205
>                 URL: https://issues.apache.org/jira/browse/YARN-9205
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.3.0
>            Reporter: Zhankun Tang
>            Assignee: Zhankun Tang
>            Priority: Critical
>         Attachments: YARN-9205-trunk.001.patch, YARN-9205-trunk.002.patch
>
>
> In a non-secure cluster. Reproduce it as follows:
>  # Set capacity scheduler in yarn-site.xml
>  # Use default capacity-scheduler.xml
>  # Set custom resource type "cmp.com/hdw" in resource-types.xml
>  # Set a value say 10 in node-resources.xml
>  # Start cluster
>  # Submit a distribute shell application which requests some "cmp.com/hdw"
> The AM will get an exception from CapacityScheduler and then failed. This bug 
> doesn't exist in FairScheduler.
> {code:java}
> 2019-01-17 22:12:11,286 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:2048, vCores:2, cmp.com/hdw: 
> 2>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: 
> GUARANTEED, Enforce Execution Type: false}]Resource Profile[]
> 2019-01-17 22:12:12,326 ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[cmp.com/hdw], 
> Requested resource=<memory:2048, vCores:2, cmp.com/hdw: 2>, maximum allowed 
> allocation=<memory:8192, vCores:4>, please note that maximum allowed 
> allocation is calculated by scheduler based on maximum resource of registered 
> NodeManagers, which might be less than configured maximum 
> allocation=<memory:8192, vCores:4, cmp.com/hdw: 9223372036854775807>
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.throwInvalidResourceException(SchedulerUtils.java:492)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkResourceRequestAgainstAvailableResource(SchedulerUtils.java:388)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:315)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:293)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:301)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:240)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> ...{code}
> Did a roughly debugging, below method return the wrong maximum capacity.
> DefaultAMSProcessor.java, Line 234.
> {code:java}
> Resource maximumCapacity =
>  getScheduler().getMaximumResourceCapability(app.getQueue());{code}
> The above code seems should return "<memory:8192, vCores:4, cmp.com/hdw:10>" 
> but returns "<memory:8192, vCores:4>".
> This incorrect value might be caused by queue maximum allocation calculation 
> involved in YARN-8720:
> AbstractCSQueue.java Line364
> {code:java}
> this.maximumAllocation =
>  configuration.getMaximumAllocationPerQueue(
>  getQueuePath());{code}
> And this invokes CapacitySchedulerConfiguration.java Line 895:
> {code:java}
> Resource clusterMax = ResourceUtils.fetchMaximumAllocationFromConfig(this);
> {code}
> Passing a "this" which is not a YarnConfiguration instance will cause below 
> code return null for resource names and then only contains mandatory 
> resources. This might be the root cause.
> {code:java}
> private static Map<String, ResourceInformation> 
> getResourceInformationMapFromConfig(
> ...
> // NULL value here!
> String[] resourceNames = conf.getStrings(YarnConfiguration.RESOURCE_TYPES);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to