[jira] [Commented] (YARN-11387) [GPG] YARN GPG mistakenly deleted applicationid

ASF GitHub Bot (Jira) Tue, 26 Mar 2024 07:51:13 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-11387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830971#comment-17830971
 ]


ASF GitHub Bot commented on YARN-11387:
---------------------------------------

slfan1989 commented on code in PR #6660:
URL: https://github.com/apache/hadoop/pull/6660#discussion_r1539407981


##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-globalpolicygenerator/src/main/java/org/apache/hadoop/yarn/server/globalpolicygenerator/applicationcleaner/DefaultApplicationCleaner.java:
##########
@@ -46,47 +45,38 @@ public void run() {
     LOG.info("Application cleaner run at time {}", now);
 
     FederationStateStoreFacade facade = getGPGContext().getStateStoreFacade();

Review Comment:
   Step 1: Retrieve all applications stored in the StateStore, which represents 
all applications submitted to the Router.
   Step 2: Use the Router's REST API to fetch all running tasks. This API will 
invoke applications from all active SubClusters.
   Step 3: Compare the results of Step1 and Step2 to identify applications that 
exist in Step1 but not in Step2. Delete these applications.
   
   There is a potential issue with this approach. If a particular SubCluster is 
undergoing maintenance, such as RM restart, Step2 will not be able to fetch the 
complete list of running applications. As a result, during the comparison in 
Step3, there is a risk of mistakenly deleting applications that are still 
running.
   
   We have three SubClusters: subClusterA, subClusterB, and subClusterC, with 
an equal allocation ratio of 1:1:1.
   
   We submit six applications through routerA.
   
   app1 and app2 are allocated to subClusterA
   app3 and app4 to subClusterB
   app5 and app6 to subClusterC.
   Among these, app1, app3, and app5 have completed their execution, and we 
expect to retain app2, app4, and app6 in the StateStore.
   
   In the normal scenario:
   
   Comparing the steps mentioned above:
   
   Step 1: We will retrieve six applications [app1, app2, app3, app4, app5, 
app6] from the StateStore.
   Step 2: We will fetch three applications [app2, app4, app6] from the 
Router's REST interface.
   Step 3: By comparing Step 1 and Step 2, we can identify that applications 
[app1, app3, app5] should be deleted.
   
   In the exceptional scenario:
   
   Comparing the steps mentioned above:
   
   Step 1: We will retrieve six applications [app1, app2, app3, app4, app5, 
app6] from the StateStore.
   Step 2: We will fetch the list of running applications from the Router's 
REST interface. However, due to maintenance in subClusterB and subClusterC, we 
can only obtain the applications running in subClusterA [app2].
   Step 3: By comparing Step 1 and Step 3, we can identify that applications 
[app1, app3, app4, app5, app6] should be deleted.
   
   In this case, we had an error deletion.





> [GPG] YARN GPG mistakenly deleted applicationid
> -----------------------------------------------
>
>                 Key: YARN-11387
>                 URL: https://issues.apache.org/jira/browse/YARN-11387
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: federation
>    Affects Versions: 3.2.1, 3.4.0
>            Reporter: zhangjunj
>            Assignee: Shilun Fan
>            Priority: Major
>              Labels: federation, gpg, pull-request-available
>         Attachments: YARN-11387-YARN-11387.v1.patch, 
> yarn-gpg-mistakenly-deleted-applicationid.png
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In [YARN-7599|https://issues.apache.org/jira/browse/YARN-7599], the 
> Federation can delete expired applicationid, but  YARN GPG uses getRouter() 
> method to obtain application information for multiple clusters. If there are 
> too many applicationids that more than 200,000 , it will not be possible to 
> pull all the applicationid information at one time, resulting in the 
> possibility of accidental deletion. The following error is reported for spark 
> component.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11387) [GPG] YARN GPG mistakenly deleted applicationid

Reply via email to