[ https://issues.apache.org/jira/browse/YARN-9498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xianghao Lu updated YARN-9498: ------------------------------ Attachment: YARN-9498.001.patch > ZooKeeper data size limit make YARN cluster down > ------------------------------------------------ > > Key: YARN-9498 > URL: https://issues.apache.org/jira/browse/YARN-9498 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.7.2 > Reporter: Xianghao Lu > Assignee: Xianghao Lu > Priority: Major > Attachments: YARN-9498.001.patch > > > As far as I know, we can't write data larger than 1M into ZooKeeper and can't > read data larger 4M from ZooKeeper. > Recently, I ran into this issue twice where we were hitting the default ZK > server message size configs. > *For the first time,* a long run app creates 16557 attempts, we ran into this > issue when the app finished and RM deleted its attempts from zk. > The exception information is as follows > > {code:java} > 2019-04-04 15:17:09,451 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > updating appAttempt: appattempt_1552966650039_558667_000001 > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:975) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:972) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1151) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1184) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:972) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:986) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1027) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:701) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:314) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:296) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:914) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:990) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:190) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116) > at java.lang.Thread.run(Thread.java:724) > {code} > *For the second time,* a app's attempt failed with a very long diagnostic > message(5M), we ran into this issue when RM store the diagnostic message in > zk.(YARN-6125, YARN-6967 have fixed this case) > The exception information is as follows > {code:java} > 2019-03-28 11:54:26,179 INFO org.apache.zookeeper.ClientCnxn: Unable to read > additional data from server sessionid 0x169c231283802be, likely server has > closed socket, closing socket connect > ion and attempting reconnect > 2019-03-28 11:54:26,279 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$6.run(ZKRMStateStore.java:1003) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$6.run(ZKRMStateStore.java:999) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1151) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1184) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doDeleteMultiWithRetries(ZKRMStateStore.java:999) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:729) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:256) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:238) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:914) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:990) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:190) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116) > at java.lang.Thread.run(Thread.java:724) > {code} > > Besides, YARN-5006(limit application_*** data size) YARN-2962(limit the > number of znodes under RMAppRoot) YARN-7262(limit the number of znodes under > RMDelegationTokensRoot) YARN-8865(clear expired RMDelegationToken) are also > related with this issue. > The ZooKeeper statestore layout looks like this > {code:java} > * |--- AMRMTokenSecretManagerRoot > * |--- RMAppRoot > * |-----application_*** > * |-----appattempt_*** > * |--- EpochNode > * |--- RMVersionNode > * |--- RMDTSecretManagerRoot > * |----- RMDTSequentialNumber > * |----- RMDTMasterKeysRoot > * |----- RMDelegationTokensRoot > * |----- Token_*** > {code} > > About my first case, I have two solutions > 1) add a hierarchy for appattempts like YARN-2962, YARN-7262 > 2) limit the number of appattempts > I think there is no need to save too many appattempts in statestore, so I > maked a patch for the second solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org