[ https://issues.apache.org/jira/browse/YARN-11114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated YARN-11114: ---------------------------------- Labels: pull-request-available (was: ) > RMWebServices returns only apps matching exactly the submitted queue name > ------------------------------------------------------------------------- > > Key: YARN-11114 > URL: https://issues.apache.org/jira/browse/YARN-11114 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, webapp > Reporter: Szilard Nemeth > Assignee: Szilard Nemeth > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > I've added 2 testcases that demonstrate the issue with [this > commit|https://github.com/szilard-nemeth/hadoop/commit/88dcf40f4dab564477542b8efb82f4f20d132eee]. > 1. With 'testAppsQueryByQueueShortname', there's a finishedApp submitted to > "root.default" and there's a runningApp that is submitted to "default". > The testcase queries the apps by queue name "default" and the response only > contains the runningApp, which is submitted to "default" so the other app > that is submitted to "root.default" is not returned. > 2. With 'testAppsQueryByQueueFullname', there's a finishedApp submitted to > "root.default" and there's a runningApp that is submitted to "default" (same > setup as above). > The testcase queries the apps by queue name "root.default" (which is the full > queue path) and the response only contains the finishedApp, which is > submittted to "root.default" so the other app that is submitted to "default" > is not returned. > A trivial conclusion of this is that only those applications are included in > the response that exactly match the queue name where the application is > submitted to, either specified explicity at submission or resolved by the > placement engine. > Before YARN-9879 was implemented, Capacity Scheduler was only capable of > definining a leaf queue with a specific name in the whole hierarchy once, > meaning that leaf queue names were unique. > For example root.a.testQueue and root.b.testQueue couldn't coexist, as the > leaf queue name is the same. > At this point, I supposed that YARN-9879 is causing this issue, but as the > behaviour of CS before YARN-9879 was merged didn't allow two leaf queues with > the same name, a query of "root.default" and "default" could easily work as > it was guaranteed that there's not another "default" leaf queue in the > hierarchy, just one. I digged a bit further. > I also noticed that YARN-8659 ([commit > link|https://github.com/apache/hadoop/commit/7c13872cbbb6f1b0b1c2dde894885b41186b3797]) > could have introduced this issue a long time ago, as it removed the iterator > logic that queried the applications with method YarnScheduler#getAppsInQueue > (see > [this|https://github.com/apache/hadoop/commit/7c13872cbbb6f1b0b1c2dde894885b41186b3797#diff-5b432bf3a8eb3e039878300ffb9db1f728226b9e3f63c4eb53be5ed5a833390aL843]). > Let's follow the implementation of YarnScheduler#getAppsInQueue for CS: > 1. First of all, > [here|https://github.com/apache/hadoop/blob/4c05d257ba3f3311b5bbc993f6e5e35637487d88/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L2501-L2509] > is the method definition. > [CapacityScheduler#getQueue|https://github.com/apache/hadoop/blob/4c05d257ba3f3311b5bbc993f6e5e35637487d88/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L824-L829] > is called from here. > 2. > [CapacityScheduler#getQueue|https://github.com/apache/hadoop/blob/4c05d257ba3f3311b5bbc993f6e5e35637487d88/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L824-L829] > is then calling > [QueueManager#getQueue|https://github.com/apache/hadoop/blob/da09d68056d4e6a9490ddc6d9ae816b65217e117/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerQueueManager.java#L136-L138]. > 3. > [QueueManager#getQueue|https://github.com/apache/hadoop/blob/da09d68056d4e6a9490ddc6d9ae816b65217e117/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerQueueManager.java#L136-L138] > is then calling [CSQueueStore#get|#get]. > 4. [CSQueueStore#get|#get] calls the 'getMap' fields getOrDefault method > [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L260]. > 4.1 CSQueueStore#getMap (field) stores the Queue objects mapped to their > short and full names (e.g. 'default' and 'root.default'). > [CSQueueStore#add|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L122-L152] > is the method that is responsible for adding the CSQueue objects. > 4.2 The first getMap.put call is invoked > [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L134] > with the full queue name. > 4.3 The second getMap.put call is invoked via > [CSQueueStore#updateGetMapForShortName|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L102-L120] > > [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L113]. > As a conclusion, in > [ClientRMService#getApplications|https://github.com/apache/hadoop/blob/d2869940094d330434f3e82d16b1cad3c6023437/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L880-L993], > the app filtering by queues seems wrong for me. > The block that filters by queues is > [here|https://github.com/apache/hadoop/blob/d2869940094d330434f3e82d16b1cad3c6023437/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L915-L918]. > This should be enhanced by querying the apps from > YarnScheduler#getAppsInQueue, as it both handles the short and full queue > names for CS in the end. > It's crucial to not just fall back to the logic that was replaced by > YARN-8659 ([commit > link|https://github.com/apache/hadoop/commit/7c13872cbbb6f1b0b1c2dde894885b41186b3797]). > As the original issue was there that rmContext.getRMApps() returns both > running and finished apps, while scheduler.getAppsInQueue only returns > running apps. > h2. NOTES > *NOTE #1:* > As there's no way to get the short queue name + the full queue name from > RmApp / RmAppImpl, it's currently not possible to compare the queue filter of > the RM client request with both type of queue names of the application. > *NOTE #2:* > scheduler.getAppsInQueue(queue) will only return running apps, so for running > apps, it's possible to retrieve the apps by queue name, and it will work with > both short and full names. However, for non-running apps, only the submitted > app name would work for filtering. > *NOTE #3 (plan for implementation):* > It would be completely reasonable to consider both running and non-running > apps while querying, however I think it never worked that way. > Before YARN-8659, only running apps were considered and before YARN-9879, > both running + non-running apps were considered but only the stored queue > name (in RmAppImpl) was compared to the app filter's queue name, which was > either the short or the full queue name. > All in all, I don't want to change this behavior and also I think it would > make the code more convoluted if RmAppImpl would store the short and the full > queue names as well. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org