wangzhihui created YARN-11626:
---------------------------------

             Summary: Optimization of the safeDelete operation in ZKRMStateStore
                 Key: YARN-11626
                 URL: https://issues.apache.org/jira/browse/YARN-11626
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager
    Affects Versions: 3.3.0, 3.1.1, 3.0.0-alpha4
            Reporter: wangzhihui


h1. Description 
 * We can be observed that removing app info started at 06:17:20, but the 
NoNodeException was received at 06:17:35. 
 * During the 15s interval, Curator was retrying the metadata operation. Due to 
the non-idempotent nature of the Zookeeper deletion operation, in one of the 
retry attempts, the metadata operation was successful but no response was 
received. In the next retry it resulted in a NoNodeException, triggering the 
STATE_STORE_FENCED event and ultimately causing the current ResourceManager to 
switch to standby .

{code:java}
2023-10-28 06:17:20,359 INFO  recovery.RMStateStore 
(RMStateStore.java:transition(333)) - Removing info for app: 
application_1697410508608_140368
2023-10-28 06:17:20,359 INFO  resourcemanager.RMAppManager 
(RMAppManager.java:checkAppNumCompletedLimit(303)) - Application should be 
expired, max number of completed apps kept in memory met: 
maxCompletedAppsInMemory = 1000, removing app application_1697410508608_140368 
from memory:
2023-10-28 06:17:35,665 ERROR recovery.RMStateStore 
(RMStateStore.java:transition(337)) - Error removing app: 
application_1697410508608_140368
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
2023-10-28 06:17:35,666 INFO  recovery.RMStateStore 
(RMStateStore.java:handleStoreEvent(1147)) - RMStateStore state change from 
ACTIVE to FENCED
2023-10-28 06:17:35,666 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
STATE_STORE_FENCED, caused by 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
2023-10-28 06:17:35,666 INFO  resourcemanager.ResourceManager 
(ResourceManager.java:transitionToStandby(1309)) - Transitioning to standby 
state
 {code}
h1. Solution

The NoNodeException clearly indicates that the Znode no longer exists, so we 
can safely ignore this exception to avoid triggering a larger impact on the 
cluster caused by ResourceManager failover.
h1. Other

We also need to discuss and optimize the same issues in safeCreate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to