Szilard Nemeth created YARN-10787:
-------------------------------------

             Summary: Queue submit ACL check is wrong when CS queue is ambiguous
                 Key: YARN-10787
                 URL: https://issues.apache.org/jira/browse/YARN-10787
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Szilard Nemeth
            Assignee: Gergely Pollak


Let's suppose we have a Capacity Scheduler configuration with 2 or more leaf 
queues with the same name in the queue hierarchy. That's what we call an 
ambiguous queue name.
Let's also enable ACL checks and define acl_submit_applications / 
acl_administer_queue configs with the correct value, adding the username to the 
ACL value there.


Here's a minimalistic YARN + CS config:

1. YARN config snippet: 
{code}
<property><name>yarn.acl.enable</name><value>true</value>
{code}


2. CS config snippet:
{code}
<property>
        <name>yarn.scheduler.capacity.root.someparent1.queues</name>
        <value>anyotherqueue1,somequeue,anyotherqueue2</value>
</property>
<property>
        <name>yarn.scheduler.capacity.root.someparent2.queues</name>
        <value>anyotherqueue3,somequeue,anyotherqueue4</value>
</property>
<property>
        
<name>yarn.scheduler.capacity.root.someparent1.somequeue.acl_submit_applications</name>
        <value>someuser1 </value>
</property>
<property>
        
<name>yarn.scheduler.capacity.root.someparent2.somequeue.acl_submit_applications</name>
        <value>someuser1 </value>
</property>
<property>
        
<name>yarn.scheduler.capacity.root.someparent1.somequeue.acl_administer_queue</name>
        <value>someuser1 </value>
</property>
<property>
        
<name>yarn.scheduler.capacity.root.someparent2.somequeue.acl_administer_queue</name>
        <value>someuser1 </value>
</property>
{code}

So in this case, we have an ambiguous queue named "somequeue" under 2 different 
paths: 
- root.someparent1.somequeue
- root.someparent2.somequeue

When a user submits an application correctly with the full queue path e.g. 
root.someparent1.somequeue, YARN will still fail to place the application to 
that queue and will use the short name.



3. LOG SNIPPET
{code}
2021-05-20 22:04:32,031 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.placement.CSMappingPlacementRule: 
Placement final result 'root.someparent1.somequeue' for application 
'application_1621540945412_0001'
 2021-05-20 22:04:32,031 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Placed application 
with ID application_1621540945412_0001 in queue: somequeue, original submission 
queue was: root.someparent1.somequeue
 2021-05-20 22:04:32,031 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Ambiguous queue reference: somequeue please use full queue path instead.
 2021-05-20 22:04:32,031 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Application 'application_1621540945412_0001' is submitted without priority 
hence considering default queue/cluster priority: 0
 2021-05-20 22:04:32,032 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Priority '0' is acceptable in queue : somequeue for application: 
application_1621540945412_0001
 2021-05-20 22:04:32,993 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Exception in 
submitting application_1621540945412_0001
 org.apache.hadoop.yarn.exceptions.YarnException: 
org.apache.hadoop.security.AccessControlException: User someuser1 does not have 
permission to submit application_1621540945412_0001 to queue somequeue
 at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
{code}

4. FULL STACKTRACE:
{code}
 org.apache.hadoop.yarn.exceptions.YarnException: 
org.apache.hadoop.security.AccessControlException: User someuser1 does not have 
permission to submit application_1621540945412_0001 to queue somequeue
        at 
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:433)
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:330)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:650)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
        at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
Caused by: org.apache.hadoop.security.AccessControlException: User someuser1 
does not have permission to submit application_1621540945412_0001 to queue 
somequeue
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:436)
        ... 12 more
2021-05-20 22:04:32,994 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=someuser1     
   IP=172.17.61.133        OPERATION=Submit Application Request    
TARGET=ClientRMService  RESULT=FAILURE  DESCRIPTION=Exception in submitting 
application PERMISSIONS=org.apache.hadoop.security.AccessControlException: User 
someuser1 does not have permission to submit application_1621540945412_0001 to 
queue somequeue      APPID=application_1621540945412_0001    QUEUENAME=somequeue
{code}



DETAILS:


1. The whole thing happens in RMAppManager#createAndPopulateNewRMApp:
Class / method: 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager#createAndPopulateNewRMApp
Link: 
https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L407



2. RMAppManager#copyPlacementQueueToSubmissionContext is called for 
applications that are new, meaning we are not recovering, an application is 
submitted in a normal way:
[Class / 
method|org.apache.hadoop.yarn.server.resourcemanager.RMAppManager#copyPlacementQueueToSubmissionContext

[Called 
at|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L420]

[Method 
link|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L991]

The problem is that copyPlacementQueueToSubmissionContext sets the queue of 
context (ApplicationSubmissionContext object) from placementContext.getQueue 
(ApplicationPlacementContext object). If placementcontext holds the queue name 
in the short form, this will override the default submission queue value, let's 
suppose it was the full queue path.
An example of a generated log from this method: 
{code}
 2021-05-20 22:04:32,031 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Placed application 
with ID application_1621540945412_0001 in queue: somequeue, original submission 
queue was: root.someparent1.somequeue
{code}



3. The problematic code block is here: [Code 
block|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L446-L475]

3.1 First, the short queuename will be gathered from submissionContext, as it 
was overridden by 'copyPlacementQueueToSubmissionContext': 
[Link|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L448]
This is a bad design, as here we are relying on the fact that the queue name 
was overridden in the submission context object.

3.2 Since the queue name will be in the short form and it's ambiguous, the call 
to 
[scheduler.getQueue()|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L450]
 will return null, as it's implemented like this by design: If the queue name 
is ambiguous, it returns null.

3.3 The condition of checking if csqueue is null AND placementContext is not 
null will evaluate to true [here|
https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L452]

3.4. The Parent queue will be queried from CS by the parent queue name of the 
placement context: 
[Link|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L456]

3.5 Finally, the ACL check fails as csqueue is the queue object of the parent 
queue of the queue 'root.someparent1.somequeue' which will be the queue: 
'root.someparent1'.
In this case, the user don't have a submission ACL set for the parent queue, 
but the leaf queue so the ACL check fails.


LIST OF THINGS TO FIX AND DO:
- Add a unit testcase that replicates the above config and the issue.
- Rename copyPlacementQueueToSubmissionContext: This method not really copies 
anything, it simply overrides the queue value.

- Add Debug log to print csqueue object before the authorization code: [Auth 
code 
block|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L459-L475]

- Fix log messages: As 'copyPlacementQueueToSubmissionContext' overrides (not 
copies) the original queue name with the queue name from the PlacementContext, 
all calls to submissionContext.getQueue() will return the short queue name. 
This results in very misleading log messages as well, including the exception 
message itself: 
{code}
 org.apache.hadoop.yarn.exceptions.YarnException: 
org.apache.hadoop.security.AccessControlException: User someuser1 does not have 
permission to submit application_1621540945412_0001 to queue somequeue
{code}
All log messages should print the original submission queue, if possible.

- Actual code fix for the issue: Use full queue path to get the queue object. 
Again, this is the code block where the fix should happen: 
[LINK|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L447-L458]

'queueName' should have the value set from: 
org.apache.hadoop.yarn.server.resourcemanager.placement.ApplicationPlacementContext#getFullQueuePath.

The equivalent of this in the linked code block:
{code}
placementContext.getFullQueuePath()
{code}
This should happen only if placementContext is not null.


LONG TERM FIX: 
Investigate if it's possible to eliminate copyPlacementQueueToSubmissionContext.
This could introduce nasty backward incompatible issues with recovery, so it 
should be thought through really carefully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to