[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292937#comment-14292937
 ] 

Chun Chen commented on YARN-2992:
---------------------------------

[~kasha] [~rohithsharma] [~jianhe], we are constantly facing the following error
RM log
{code}
2015-01-27 00:13:19,379 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.196.128.13/10.196.128.13:2181. Will not attempt to 
authenticate using SASL (unknown erro
r)
2015-01-27 00:13:19,383 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 10.196.128.13/10.196.128.13:2181, initiating session
2015-01-27 00:13:19,404 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server 10.196.128.13/10.196.128.13:2181, sessionid = 
0x24ab193421e4812, negotiated timeout = 
10000
2015-01-27 00:13:19,417 WARN org.apache.zookeeper.ClientCnxn: Session 
0x24ab193421e4812 for server 10.196.128.13/10.196.128.13:2181, unexpected 
error, closing socket connection and attempti
ng reconnect
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2015-01-27 00:13:19,517 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:898)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.access$600(ZKRMStateStore.java:82)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:1003)
2015-01-27 00:13:19,518 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying 
operation on ZK. Retry no. 934
{code}

ZK log
{code}
2015-01-27 00:13:19,300 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted 
socket connection from /10.240.92.100:46464
2015-01-27 00:13:19,302 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client 
attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46464
2015-01-27 00:13:19,302 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 
0x24ab193421e4812
2015-01-27 00:13:19,303 [myid:1] - INFO  
[QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 
0x24ab193421e4812 with negotiated timeout 10000 for client /10.240.92.100:46464
2015-01-27 00:13:19,303 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth 
packet /10.240.92.100:46464
2015-01-27 00:13:19,303 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success 
/10.240.92.100:46464
2015-01-27 00:13:19,320 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
causing close of session 0x24ab193421e4812 due to java.io.IOException: Len 
error 1425415
2015-01-27 00:13:19,321 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket 
connection for client /10.240.92.100:46464 which had sessionid 0x24ab193421e4812
2015-01-27 00:13:23,093 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted 
socket connection from /10.240.92.100:46477
2015-01-27 00:13:23,159 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client 
attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46477
2015-01-27 00:13:23,159 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 
0x24ab193421e4812
2015-01-27 00:13:23,160 [myid:1] - INFO  
[QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 
0x24ab193421e4812 with negotiated timeout 10000 for client /10.240.92.100:46477
2015-01-27 00:13:23,160 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth 
packet /10.240.92.100:46477
2015-01-27 00:13:23,160 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success 
/10.240.92.100:46477
2015-01-27 00:13:23,170 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
causing close of session 0x24ab193421e4812 due to java.io.IOException: Len 
error 1425415
2015-01-27 00:13:23,171 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket 
connection for client /10.240.92.100:46477 which had sessionid 0x24ab193421e4812
{code}

It seems when zk receives a large packet ( Len error 1425415 > 1M) and then 
initiatively closed the socket connection due to the large packet error. 
Haven't tried the patch here which renews a zk connection each time it fails to 
execute a zk operation, but I can't figure out why this happens, since 
VerifyActiveStatusThread only creates and deletes the fencing node. The packet 
shouldn't be larger than 1M. Do you have any clues why this happens? Thanks.

> ZKRMStateStore crashes due to session expiry
> --------------------------------------------
>
>                 Key: YARN-2992
>                 URL: https://issues.apache.org/jira/browse/YARN-2992
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>             Fix For: 2.7.0
>
>         Attachments: yarn-2992-1.patch
>
>
> We recently saw the RM crash with the following stacktrace. On session 
> expiry, we should gracefully transition to standby. 
> {noformat}
> 2014-12-18 06:28:42,689 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired 
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to