[
https://issues.apache.org/jira/browse/YARN-11114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shilun Fan updated YARN-11114:
------------------------------
Target Version/s: 3.4.0
Affects Version/s: 3.4.0
> RMWebServices returns only apps matching exactly the submitted queue name
> -------------------------------------------------------------------------
>
> Key: YARN-11114
> URL: https://issues.apache.org/jira/browse/YARN-11114
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: capacity scheduler, webapp
> Affects Versions: 3.4.0
> Reporter: Szilard Nemeth
> Assignee: Szilard Nemeth
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0
>
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> I've added 2 testcases that demonstrate the issue with [this
> commit|https://github.com/szilard-nemeth/hadoop/commit/88dcf40f4dab564477542b8efb82f4f20d132eee].
> 1. With 'testAppsQueryByQueueShortname', there's a finishedApp submitted to
> "root.default" and there's a runningApp that is submitted to "default".
> The testcase queries the apps by queue name "default" and the response only
> contains the runningApp, which is submitted to "default" so the other app
> that is submitted to "root.default" is not returned.
> 2. With 'testAppsQueryByQueueFullname', there's a finishedApp submitted to
> "root.default" and there's a runningApp that is submitted to "default" (same
> setup as above).
> The testcase queries the apps by queue name "root.default" (which is the full
> queue path) and the response only contains the finishedApp, which is
> submittted to "root.default" so the other app that is submitted to "default"
> is not returned.
> A trivial conclusion of this is that only those applications are included in
> the response that exactly match the queue name where the application is
> submitted to, either specified explicity at submission or resolved by the
> placement engine.
> Before YARN-9879 was implemented, Capacity Scheduler was only capable of
> definining a leaf queue with a specific name in the whole hierarchy once,
> meaning that leaf queue names were unique.
> For example root.a.testQueue and root.b.testQueue couldn't coexist, as the
> leaf queue name is the same.
> At this point, I supposed that YARN-9879 is causing this issue, but as the
> behaviour of CS before YARN-9879 was merged didn't allow two leaf queues with
> the same name, a query of "root.default" and "default" could easily work as
> it was guaranteed that there's not another "default" leaf queue in the
> hierarchy, just one. I digged a bit further.
> I also noticed that YARN-8659 ([commit
> link|https://github.com/apache/hadoop/commit/7c13872cbbb6f1b0b1c2dde894885b41186b3797])
> could have introduced this issue a long time ago, as it removed the iterator
> logic that queried the applications with method YarnScheduler#getAppsInQueue
> (see
> [this|https://github.com/apache/hadoop/commit/7c13872cbbb6f1b0b1c2dde894885b41186b3797#diff-5b432bf3a8eb3e039878300ffb9db1f728226b9e3f63c4eb53be5ed5a833390aL843]).
> Let's follow the implementation of YarnScheduler#getAppsInQueue for CS:
> 1. First of all,
> [here|https://github.com/apache/hadoop/blob/4c05d257ba3f3311b5bbc993f6e5e35637487d88/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L2501-L2509]
> is the method definition.
> [CapacityScheduler#getQueue|https://github.com/apache/hadoop/blob/4c05d257ba3f3311b5bbc993f6e5e35637487d88/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L824-L829]
> is called from here.
> 2.
> [CapacityScheduler#getQueue|https://github.com/apache/hadoop/blob/4c05d257ba3f3311b5bbc993f6e5e35637487d88/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L824-L829]
> is then calling
> [QueueManager#getQueue|https://github.com/apache/hadoop/blob/da09d68056d4e6a9490ddc6d9ae816b65217e117/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerQueueManager.java#L136-L138].
> 3.
> [QueueManager#getQueue|https://github.com/apache/hadoop/blob/da09d68056d4e6a9490ddc6d9ae816b65217e117/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerQueueManager.java#L136-L138]
> is then calling [CSQueueStore#get|#get].
> 4. [CSQueueStore#get|#get] calls the 'getMap' fields getOrDefault method
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L260].
> 4.1 CSQueueStore#getMap (field) stores the Queue objects mapped to their
> short and full names (e.g. 'default' and 'root.default').
> [CSQueueStore#add|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L122-L152]
> is the method that is responsible for adding the CSQueue objects.
> 4.2 The first getMap.put call is invoked
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L134]
> with the full queue name.
> 4.3 The second getMap.put call is invoked via
> [CSQueueStore#updateGetMapForShortName|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L102-L120]
>
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L113].
> As a conclusion, in
> [ClientRMService#getApplications|https://github.com/apache/hadoop/blob/d2869940094d330434f3e82d16b1cad3c6023437/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L880-L993],
> the app filtering by queues seems wrong for me.
> The block that filters by queues is
> [here|https://github.com/apache/hadoop/blob/d2869940094d330434f3e82d16b1cad3c6023437/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L915-L918].
> This should be enhanced by querying the apps from
> YarnScheduler#getAppsInQueue, as it both handles the short and full queue
> names for CS in the end.
> It's crucial to not just fall back to the logic that was replaced by
> YARN-8659 ([commit
> link|https://github.com/apache/hadoop/commit/7c13872cbbb6f1b0b1c2dde894885b41186b3797]).
> As the original issue was there that rmContext.getRMApps() returns both
> running and finished apps, while scheduler.getAppsInQueue only returns
> running apps.
> h2. NOTES
> *NOTE #1:*
> As there's no way to get the short queue name + the full queue name from
> RmApp / RmAppImpl, it's currently not possible to compare the queue filter of
> the RM client request with both type of queue names of the application.
> *NOTE #2:*
> scheduler.getAppsInQueue(queue) will only return running apps, so for running
> apps, it's possible to retrieve the apps by queue name, and it will work with
> both short and full names. However, for non-running apps, only the submitted
> app name would work for filtering.
> *NOTE #3 (plan for implementation):*
> It would be completely reasonable to consider both running and non-running
> apps while querying, however I think it never worked that way.
> Before YARN-8659, only running apps were considered and before YARN-9879,
> both running + non-running apps were considered but only the stored queue
> name (in RmAppImpl) was compared to the app filter's queue name, which was
> either the short or the full queue name.
> All in all, I don't want to change this behavior and also I think it would
> make the code more convoluted if RmAppImpl would store the short and the full
> queue names as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]