[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2019-09-29 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940667#comment-16940667
 ] 

Bibin Chundatt commented on YARN-2368:
--

[~zhuqi]

Incase you want to set "jute.maxbuffer" could probably make use of 
*YARN_RESOURCEMANAGER_OPTS*.
At applicationSubmission time the znode size is limited by YARN-5006
IIUC YARN-2962 helps in limitting the number of nodes under on znode hierarchy.
Attempt level update some discussion is already happening in YARN-9847.



> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
> attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
> client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
> 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
> packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
> success /10.153.80.8:58890
> 2014-07-25 22:

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2015-05-06 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530819#comment-14530819
 ] 

Xuan Gong commented on YARN-2368:
-

It is duplicate as YARN-2962. Close this as duplicate and we can discuss the 
issue there.

> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Priority: Critical
>  Labels: BB2015-05-TBR
> Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
> attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
> client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
> 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
> packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
> success /10.153.80.8:58890
> 2014-07-25 22:10:09,742 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
> causing close of session 0x247684586e70006 due to java.io.IOException: Len 
> error 1530
> 747
> 2014-07-25 22:10:09,743 [myid:1] - INF

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2015-03-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380135#comment-14380135
 ] 

Karthik Kambatla commented on YARN-2368:


Thanks for reporting this, [~guoleitao]. The jutebuffer size depends on the 
number of applications users want to store in the state-store. I believe 
YARN-2962 is a better fix. 

> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Priority: Critical
> Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
> attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
> client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
> 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
> packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
> success /10.153.80.8:58890
> 2014-07-25 22:10:09,742 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
> causing close of session 0x247684586e70006 due to java.io.IOException:

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2015-03-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355748#comment-14355748
 ] 

Hadoop QA commented on YARN-2368:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658387/YARN-2368.patch
  against trunk revision aa92b76.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 5 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6904//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6904//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6904//console

This message is automatically generated.

> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Priority: Critical
> Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
>

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2015-03-10 Thread David Morel (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355471#comment-14355471
 ] 

David Morel commented on YARN-2368:
---

Passing "-Djute.maxbuffer=" in the startup scripts environment (in 
/etc/hadoop/conf/yarn-env.sh or  /etc/default/hadoop-yarn-resourcemanager) to 
the YARN_RESOURCEMANAGER_OPTS variable does the trick. It's picked up by the RM 
binary and effective.

> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Priority: Critical
> Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
> attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
> client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
> 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
> packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
> success /10.153.80.8:58890
> 2014-07-25 22:10:09,742 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
> causing c

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2014-07-29 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078954#comment-14078954
 ] 

Leitao Guo commented on YARN-2368:
--

Thanks [~ozawa] for your comments. 

I deployed hadoop-2.3.0-cdh5.1.0 with 22-queue fairscheduler on my 20-node 
cluster. Two resourcemanagers are deployed exclusively on 10.153.80.8 and 
10.153.80.18. 

Jobs are submitted from gridmix:
{code}
sudo -u mapred hadoop jar /usr/lib/hadoop-mapreduce/hadoop-gridmix.jar 
-Dgridmix.min.file.size=10485760 
-Dgridmix.job-submission.use-queue-in-trace=true 
-Dgridmix.distributed-cache-emulation.enable=false  -generate 34816m 
hdfs:///user/mapred/foo/ hdfs:///tmp/job-trace.json
{code}
job-trace.json is generated by Rumen, with 6,000 jobs, average #maptasks per 
job is  320 and average #reducetasks is 25.

I found 3 times (gridmix tested more than 3 times) that resourcemanager failed 
when handle STATE_STORE_OP_FAILED event. At the same time, zookeeper throws out 
 'Len error IOException'
{code}
... ...
2014-07-24 21:00:51,170 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted 
socket connection from /10.153.80.8:47135
2014-07-24 21:00:51,171 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
attempting to renew session 0x247678daa88001a at /10.153.80.8:47135
2014-07-24 21:00:51,171 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client: 
0x247678daa88001a
2014-07-24 21:00:51,171 [myid:3] - INFO  
[QuorumPeer[myid=3]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
0x247678daa88001a with negotiated timeout 1 for client /10.153.80.8:47135
2014-07-24 21:00:51,171 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
packet /10.153.80.8:47135
2014-07-24 21:00:51,172 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth success 
/10.153.80.8:47135
2014-07-24 21:00:51,186 [myid:3] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247678daa88001a due to java.io.IOException: Len 
error 1813411
2014-07-24 21:00:51,186 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket 
connection for client /10.153.80.8:47135 which had sessionid 0x247678daa88001a

... ...

2014-07-25 22:10:08,919 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted 
socket connection from /10.153.80.8:50480
2014-07-25 22:10:08,921 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
attempting to renew session 0x247684586e70006 at /10.153.80.8:50480
2014-07-25 22:10:08,922 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@595] - Established 
session 0x247684586e70006 with negotiated timeout 1 for client 
/10.153.80.8:50480
2014-07-25 22:10:08,922 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
packet /10.153.80.8:50480
2014-07-25 22:10:08,923 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth success 
/10.153.80.8:50480
2014-07-25 22:10:08,934 [myid:3] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530747
2014-07-25 22:10:08,934 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket 
connection for client /10.153.80.8:50480 which had sessionid 0x247684586e70006

... ...

2014-07-26 02:22:59,627 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted 
socket connection from /10.153.80.18:60588
2014-07-26 02:22:59,629 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
attempting to renew session 0x2476de7c1af0002 at /10.153.80.18:60588
2014-07-26 02:22:59,629 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@595] - Established 
session 0x2476de7c1af0002 with negotiated timeout 1 for client 
/10.153.80.18:60588
2014-07-26 02:22:59,630 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
packet /10.153.80.18:60588
2014-07-26 02:22:59,630 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth success 
/10.153.80.18:60588
2014-07-26 02:22:59,648 [myid:3] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x2476de7c1af0002 due to java.io.IOException: Len 
error 1649043
2014-07-26 02:22:59,648 [myid:3] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket 
connection for client /10.153.80.18:60588 which had sessionid 0x2476de7c1af0

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2014-07-29 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078805#comment-14078805
 ] 

Tsuyoshi OZAWA commented on YARN-2368:
--

[~breno.leitao], oops, sorry for calling you wrongly. I tried to mention 
[~guoleitao].


> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Priority: Critical
> Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
> attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
> client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
> 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
> packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
> success /10.153.80.8:58890
> 2014-07-25 22:10:09,742 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
> causing close of session 0x247684586e70006 due to java.io.IOException: Len 
> error 1530
> 747
> 2014-07-25 22:10:09,743 [myid:1] - INFO  
> [NIOServerCxn.Factory:0

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2014-07-29 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078804#comment-14078804
 ] 

Tsuyoshi OZAWA commented on YARN-2368:
--

Thanks for your contribution, [~breno.leitao]! Could you explain the condition 
you faced this problem? If we face this problem very often, it's critical 
problem of ZKRMStateStore. However, data stored in ZKRMStateStore are small 
basically. Therefore, I think it's strange that this kind of problem appear. 
Additionally, if the max data size is fixed, we should make it default value 
not to face this problem.

> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Priority: Critical
> Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
> attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
> client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
> 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
> packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897]

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2014-07-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077616#comment-14077616
 ] 

Hadoop QA commented on YARN-2368:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658387/YARN-2368.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4471//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4471//console

This message is automatically generated.

> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Priority: Critical
> Attachments: YARN-2368.patch
>
>
> Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager log shows as the following:
> 
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> at 
> org.apache.hadoop.yar