[ https://issues.apache.org/jira/browse/YARN-11387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830971#comment-17830971 ]
ASF GitHub Bot commented on YARN-11387: --------------------------------------- slfan1989 commented on code in PR #6660: URL: https://github.com/apache/hadoop/pull/6660#discussion_r1539407981 ########## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-globalpolicygenerator/src/main/java/org/apache/hadoop/yarn/server/globalpolicygenerator/applicationcleaner/DefaultApplicationCleaner.java: ########## @@ -46,47 +45,38 @@ public void run() { LOG.info("Application cleaner run at time {}", now); FederationStateStoreFacade facade = getGPGContext().getStateStoreFacade(); Review Comment: Step 1: Retrieve all applications stored in the StateStore, which represents all applications submitted to the Router. Step 2: Use the Router's REST API to fetch all running tasks. This API will invoke applications from all active SubClusters. Step 3: Compare the results of Step1 and Step2 to identify applications that exist in Step1 but not in Step2. Delete these applications. There is a potential issue with this approach. If a particular SubCluster is undergoing maintenance, such as RM restart, Step2 will not be able to fetch the complete list of running applications. As a result, during the comparison in Step3, there is a risk of mistakenly deleting applications that are still running. We have three SubClusters: subClusterA, subClusterB, and subClusterC, with an equal allocation ratio of 1:1:1. We submit six applications through routerA. app1 and app2 are allocated to subClusterA app3 and app4 to subClusterB app5 and app6 to subClusterC. Among these, app1, app3, and app5 have completed their execution, and we expect to retain app2, app4, and app6 in the StateStore. In the normal scenario: Comparing the steps mentioned above: Step 1: We will retrieve six applications [app1, app2, app3, app4, app5, app6] from the StateStore. Step 2: We will fetch three applications [app2, app4, app6] from the Router's REST interface. Step 3: By comparing Step 1 and Step 2, we can identify that applications [app1, app3, app5] should be deleted. In the exceptional scenario: Comparing the steps mentioned above: Step 1: We will retrieve six applications [app1, app2, app3, app4, app5, app6] from the StateStore. Step 2: We will fetch the list of running applications from the Router's REST interface. However, due to maintenance in subClusterB and subClusterC, we can only obtain the applications running in subClusterA [app2]. Step 3: By comparing Step 1 and Step 3, we can identify that applications [app1, app3, app4, app5, app6] should be deleted. In this case, we had an error deletion. > [GPG] YARN GPG mistakenly deleted applicationid > ----------------------------------------------- > > Key: YARN-11387 > URL: https://issues.apache.org/jira/browse/YARN-11387 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation > Affects Versions: 3.2.1, 3.4.0 > Reporter: zhangjunj > Assignee: Shilun Fan > Priority: Major > Labels: federation, gpg, pull-request-available > Attachments: YARN-11387-YARN-11387.v1.patch, > yarn-gpg-mistakenly-deleted-applicationid.png > > Original Estimate: 168h > Remaining Estimate: 168h > > In [YARN-7599|https://issues.apache.org/jira/browse/YARN-7599], the > Federation can delete expired applicationid, but YARN GPG uses getRouter() > method to obtain application information for multiple clusters. If there are > too many applicationids that more than 200,000 , it will not be possible to > pull all the applicationid information at one time, resulting in the > possibility of accidental deletion. The following error is reported for spark > component. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org