[ https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292937#comment-14292937 ]
Chun Chen commented on YARN-2992: --------------------------------- [~kasha] [~rohithsharma] [~jianhe], we are constantly facing the following error RM log {code} 2015-01-27 00:13:19,379 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.196.128.13/10.196.128.13:2181. Will not attempt to authenticate using SASL (unknown erro r) 2015-01-27 00:13:19,383 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.196.128.13/10.196.128.13:2181, initiating session 2015-01-27 00:13:19,404 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.196.128.13/10.196.128.13:2181, sessionid = 0x24ab193421e4812, negotiated timeout = 10000 2015-01-27 00:13:19,417 WARN org.apache.zookeeper.ClientCnxn: Session 0x24ab193421e4812 for server 10.196.128.13/10.196.128.13:2181, unexpected error, closing socket connection and attempti ng reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) 2015-01-27 00:13:19,517 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:898) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.access$600(ZKRMStateStore.java:82) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:1003) 2015-01-27 00:13:19,518 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 934 {code} ZK log {code} 2015-01-27 00:13:19,300 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.240.92.100:46464 2015-01-27 00:13:19,302 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46464 2015-01-27 00:13:19,302 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x24ab193421e4812 2015-01-27 00:13:19,303 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 0x24ab193421e4812 with negotiated timeout 10000 for client /10.240.92.100:46464 2015-01-27 00:13:19,303 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth packet /10.240.92.100:46464 2015-01-27 00:13:19,303 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success /10.240.92.100:46464 2015-01-27 00:13:19,320 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x24ab193421e4812 due to java.io.IOException: Len error 1425415 2015-01-27 00:13:19,321 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.240.92.100:46464 which had sessionid 0x24ab193421e4812 2015-01-27 00:13:23,093 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.240.92.100:46477 2015-01-27 00:13:23,159 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46477 2015-01-27 00:13:23,159 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x24ab193421e4812 2015-01-27 00:13:23,160 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 0x24ab193421e4812 with negotiated timeout 10000 for client /10.240.92.100:46477 2015-01-27 00:13:23,160 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth packet /10.240.92.100:46477 2015-01-27 00:13:23,160 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success /10.240.92.100:46477 2015-01-27 00:13:23,170 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x24ab193421e4812 due to java.io.IOException: Len error 1425415 2015-01-27 00:13:23,171 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.240.92.100:46477 which had sessionid 0x24ab193421e4812 {code} It seems when zk receives a large packet ( Len error 1425415 > 1M) and then initiatively closed the socket connection due to the large packet error. Haven't tried the patch here which renews a zk connection each time it fails to execute a zk operation, but I can't figure out why this happens, since VerifyActiveStatusThread only creates and deletes the fencing node. The packet shouldn't be larger than 1M. Do you have any clues why this happens? Thanks. > ZKRMStateStore crashes due to session expiry > -------------------------------------------- > > Key: YARN-2992 > URL: https://issues.apache.org/jira/browse/YARN-2992 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: Karthik Kambatla > Assignee: Karthik Kambatla > Priority: Blocker > Fix For: 2.7.0 > > Attachments: yarn-2992-1.patch > > > We recently saw the RM crash with the following stacktrace. On session > expiry, we should gracefully transition to standby. > {noformat} > 2014-12-18 06:28:42,689 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687) > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)