[
https://issues.apache.org/jira/browse/YARN-11626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dinesh Chitlangia resolved YARN-11626.
--------------------------------------
Fix Version/s: 3.5.0
Resolution: Fixed
> Optimization of the safeDelete operation in ZKRMStateStore
> ----------------------------------------------------------
>
> Key: YARN-11626
> URL: https://issues.apache.org/jira/browse/YARN-11626
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
> Reporter: wangzhihui
> Priority: Minor
> Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h1. Description
> * We can be observed that removing app info started at 06:17:20, but the
> NoNodeException was received at 06:17:35.
> * During the 15s interval, Curator was retrying the metadata operation. Due
> to the non-idempotent nature of the Zookeeper deletion operation, in one of
> the retry attempts, the metadata operation was successful but no response was
> received. In the next retry it resulted in a NoNodeException, triggering the
> STATE_STORE_FENCED event and ultimately causing the current ResourceManager
> to switch to standby .
> {code:java}
> 2023-10-28 06:17:20,359 INFO recovery.RMStateStore
> (RMStateStore.java:transition(333)) - Removing info for app:
> application_1697410508608_140368
> 2023-10-28 06:17:20,359 INFO resourcemanager.RMAppManager
> (RMAppManager.java:checkAppNumCompletedLimit(303)) - Application should be
> expired, max number of completed apps kept in memory met:
> maxCompletedAppsInMemory = 1000, removing app
> application_1697410508608_140368 from memory:
> 2023-10-28 06:17:35,665 ERROR recovery.RMStateStore
> (RMStateStore.java:transition(337)) - Error removing app:
> application_1697410508608_140368
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> 2023-10-28 06:17:35,666 INFO recovery.RMStateStore
> (RMStateStore.java:handleStoreEvent(1147)) - RMStateStore state change from
> ACTIVE to FENCED
> 2023-10-28 06:17:35,666 ERROR resourcemanager.ResourceManager
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type
> STATE_STORE_FENCED, caused by
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
> 2023-10-28 06:17:35,666 INFO resourcemanager.ResourceManager
> (ResourceManager.java:transitionToStandby(1309)) - Transitioning to standby
> state
> {code}
> h1. Solution
> The NoNodeException clearly indicates that the Znode no longer exists, so we
> can safely ignore this exception to avoid triggering a larger impact on the
> cluster caused by ResourceManager failover.
> h1. Other
> We also need to discuss and optimize the same issues in safeCreate.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]