[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Leitao Guo updated YARN-2368: ----------------------------- Attachment: YARN-2368.patch > ResourceManager failed when ZKRMStateStore tries to update znode data larger > than 1MB > ------------------------------------------------------------------------------------- > > Key: YARN-2368 > URL: https://issues.apache.org/jira/browse/YARN-2368 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.4.1 > Reporter: Leitao Guo > Priority: Critical > Attachments: YARN-2368.patch > > > Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed > finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode > larger than 1MB, which is the default configuration of ZooKeeper server and > client in 'jute.maxbuffer'. > ResourceManager log shows as the following: > ------------------------------------------------------------ > 2014-07-25 22:33:11,078 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2014-07-25 22:33:11,078 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2014-07-25 22:33:11,214 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for > /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_000001 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > Meanwhile ZooKeeps logs as the following: > ------------------------------------------------------------ > 2014-07-25 22:10:09,742 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception > causing close of session 0x247684586e70006 due to java.io.IOException: Len > error 1530747 > ... ... > 2014-07-25 22:33:10,966 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception > causing close of session 0x247684586e70006 due to java.io.IOException: Len > error 1530747 -- This message was sent by Atlassian JIRA (v6.2#6252)