[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250153#comment-17250153
 ] 

Siddharth Ahuja edited comment on YARN-10528 at 12/16/20, 7:51 AM:
-------------------------------------------------------------------

I have made the behaviour similar to the {{reservation}} element in code.

Performed the following testing on the single node cluster:

Have FS XML as follows:

{code}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <aclSubmitApps>*</aclSubmitApps>
        <aclAdministerApps>*</aclAdministerApps>
        <queue name="default">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="users" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <maxAMShare>0.76</maxAMShare> 
<------------------------------------------------- root.users is a parent queue 
with maxAMShare set. This should not be possible.
        </queue>
        <queue name="blah">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
        <queue name="blah2" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child2">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
    </queue>
    <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
    <queueMaxAMShareDefault>0.75</queueMaxAMShareDefault>
    <queuePlacementPolicy>
        <rule name="specified" create="true"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
    </queuePlacementPolicy>
</allocations>
{code}

Refresh YARN queues and observe the RM logs:

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:12:29,665 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)


2020-12-16 18:15:04,056 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

Now, update FS XML such that {{maxAMShare}} is not set for root.users but set 
for a parent queue which is not explicitly tagged as one with "type=parent":

{code}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <aclSubmitApps>*</aclSubmitApps>
        <aclAdministerApps>*</aclAdministerApps>
        <queue name="default">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="users" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="blah">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            
<maxAMShare>0.76</maxAMShare><-----------------------------------------------------Set
 maxAMShare for root.blah which is a parent queue to root.blah.child. This is 
no good as well.
            <queue name="child">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
        <queue name="blah2" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child2">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
    </queue>
    <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
    <queueMaxAMShareDefault>0.75</queueMaxAMShareDefault>
    <queuePlacementPolicy>
        <rule name="specified" create="true"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
    </queuePlacementPolicy>
</allocations>
{code}

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:20:49,345 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.blah are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)

2020-12-16 18:20:49,937 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.blah are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)
{code}

Now, stop RM and restart RM. RM should fail to start:

{code}
2020-12-16 18:20:49,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Loading allocation file 
file:/Users/sidtheadmin/Cloudera/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT/etc/hadoop/fair-scheduler.xml
2020-12-16 18:20:49,345 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.blah are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
2020-12-16 18:20:49,934 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Loading allocation file 
file:/Users/sidtheadmin/Cloudera/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT/etc/hadoop/fair-scheduler.xml
2020-12-16 18:20:49,937 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.blah are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)
{code}

Therefore, from above, if RM is currently running and a bad config is applied 
through refreshQueues, then, RM continues to function with still the old 
settings in use as the new (bad) one is not accepted.

However, if the RM is restarted with a bad setting, then, it fails fast. Again, 
this behaviour is the same as the reservation element.

FWIW, I deleted an existing newline in the{{ loadQueue()}} method. Even though 
this is not specifically concerning the fixes for this issue, this was done to 
prevent the checkstyle error of method length exceeding 150 lines. It was not 
worth refactoring anything existing to prevent this error so the easiest way 
out was to just delete the redundant newline.

I have also implemented the JUnits and tested them thoroughly.


was (Author: sahuja):
I have made the behaviour similar to the reservation element in code.

Performed the following testing on the single node cluster:

Have FS XML as follows:

{code}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <aclSubmitApps>*</aclSubmitApps>
        <aclAdministerApps>*</aclAdministerApps>
        <queue name="default">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="users" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <maxAMShare>0.76</maxAMShare> 
<------------------------------------------------- root.users is a parent queue 
with maxAMShare set. This should not be possible.
        </queue>
        <queue name="blah">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
        <queue name="blah2" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child2">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
    </queue>
    <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
    <queueMaxAMShareDefault>0.75</queueMaxAMShareDefault>
    <queuePlacementPolicy>
        <rule name="specified" create="true"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
    </queuePlacementPolicy>
</allocations>
{code}

Refresh YARN queues and observe the RM logs:

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:12:29,665 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)


2020-12-16 18:15:04,056 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

Now, update FS XML such that maxAMShare is not set for root.users but set for a 
parent queue which is not explicitly tagged as one with "type=parent":

{code}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <aclSubmitApps>*</aclSubmitApps>
        <aclAdministerApps>*</aclAdministerApps>
        <queue name="default">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="users" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="blah">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            
<maxAMShare>0.76</maxAMShare><-----------------------------------------------------Set
 maxAMShare for root.blah which is a parent queue to root.blah.child. This is 
no good as well.
            <queue name="child">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
        <queue name="blah2" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child2">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
    </queue>
    <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
    <queueMaxAMShareDefault>0.75</queueMaxAMShareDefault>
    <queuePlacementPolicy>
        <rule name="specified" create="true"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
    </queuePlacementPolicy>
</allocations>
{code}

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:20:49,345 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.blah are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)

2020-12-16 18:20:49,937 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.blah are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)
{code}

Now, stop RM and restart RM. RM should fail to start:

{code}
2020-12-16 18:20:49,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Loading allocation file 
file:/Users/sidtheadmin/Cloudera/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT/etc/hadoop/fair-scheduler.xml
2020-12-16 18:20:49,345 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.blah are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
2020-12-16 18:20:49,934 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Loading allocation file 
file:/Users/sidtheadmin/Cloudera/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT/etc/hadoop/fair-scheduler.xml
2020-12-16 18:20:49,937 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.blah are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)
{code}

Therefore, from above, if RM is currently running and a bad config is applied 
through refreshQueues, then, RM continues to function with still the old 
settings in use as the new (bad) one is not accepted.

However, if the RM is restarted with a bad setting, then, it fails fast. Again, 
this behaviour is the same as the reservation element.

FWIW, I deleted an existing newline in the loadQueue() method. Even though this 
is not specifically concerning the fixes for this issue, this was done to 
prevent the checkstyle error of method length exceeding 150 lines. It was not 
worth refactoring anything existing to prevent this error so the easiest way 
out was to just delete the redundant newline.

I have also implemented the JUnits and tested them thoroughly.

> maxAMShare should only be accepted for leaf queues, not parent queues
> ---------------------------------------------------------------------
>
>                 Key: YARN-10528
>                 URL: https://issues.apache.org/jira/browse/YARN-10528
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Siddharth Ahuja
>            Assignee: Siddharth Ahuja
>            Priority: Major
>         Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent 
> queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <allocations>
>     <queue name="root">
>         <weight>1.0</weight>
>         <schedulingPolicy>drf</schedulingPolicy>
>         <aclSubmitApps>*</aclSubmitApps>
>         <aclAdministerApps>*</aclAdministerApps>
>         <queue name="default">
>             <weight>1.0</weight>
>             <schedulingPolicy>drf</schedulingPolicy>
>         </queue>
>         <queue name="users" type="parent">
>             <weight>1.0</weight>
>             <schedulingPolicy>drf</schedulingPolicy>
>             <maxAMShare>1.0</maxAMShare>
>         </queue>
>     </queue>
>     <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
>     <queuePlacementPolicy>
>         <rule name="specified" create="true"/>
>         <rule name="nestedUserQueue" create="true">
>             <rule name="default" create="true" queue="users"/>
>         </rule>
>         <rule name="default"/>
>     </queuePlacementPolicy>
> </allocations>
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users.<user> actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to