I’ve checked the logs of the zookeeper quorum nodes, and it seems that problem is not here.
I found that client is closing connection due to timeout, only after this zookeeper server has EOF exception. Here is the logs from hadoop zkfc process 2015-10-01 07:12:17,820 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x25020ff36c40002, closing socket connection and attempting reconnect 2015-10-01 07:12:17,938 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode… As you can see client timeouted after 3334ms, but the timeout configured in hdfs-site.xml dfs.ha.fencing.ssh.connect-timeout is 10.000ms so why does the client timed out? Thanks. > On 24 Sep 2015, at 15:36, Flavio Junqueira <[email protected]> wrote: > > I can see that the client is disconnecting from the server, and there is also > a new round of leader election for the zookeeper servers. If this is > happening, then yeah, your ensemble is unstable. If the ensemble leader > election is being triggered frequently, then I'd start by looking there. Try > to determine why the ensemble is failing to continue with the same leader. If > ensemble elections aren't happening frequently, then another possibility is > that GC pauses are causing the session to expire. > > -Flavio > >> On 24 Sep 2015, at 05:24, Akmal Abbasov <[email protected]> wrote: >> >> Hi, >> I am using zookeeper 3.4.6 >> I have a spark cluster configured with HA. Once per 1-2 days, the active >> spark master is shutting down with a message >> 15/09/23 18:58:18 INFO zookeeper.ClientCnxn: Unable to read additional data >> from server sessionid 0x34ffa68dbd10021, likely server has closed socket, >> closing socket connection and attempting reconnect >> 15/09/23 18:58:18 INFO state.ConnectionStateManager: State change: SUSPENDED >> 15/09/23 18:58:18 INFO master.ZooKeeperLeaderElectionAgent: We have lost >> leadership >> 15/09/23 18:58:18 ERROR master.Master: Leadership has been revoked -- master >> shutting down. >> 15/09/23 18:58:18 INFO util.Utils: Shutdown hook called >> >> I don’t have the zookeeper logs from the same period, but the logs are full >> of the these messages >> 2015-09-24 05:07:42,228 [myid:1] - INFO >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - >> Accepted socket connection from /10.0.8.4:34705 >> 2015-09-24 05:07:42,229 [myid:1] - WARN >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection >> request from old client /10.0.8.4:34705; will be dropped if server is in r-o >> mode >> 2015-09-24 05:07:42,229 [myid:1] - INFO >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client >> attempting to establish new session at /10.0.8.4:34705 >> 2015-09-24 05:07:42,292 [myid:1] - INFO >> [CommitProcessor:1:ZooKeeperServer@617] - Established session >> 0x14ffd3670130030 with negotiated timeout 20001 for client /10.0.8.4:34705 >> 2015-09-24 05:07:42,302 [myid:1] - WARN >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end >> of stream exception >> EndOfStreamException: Unable to read additional data from client sessionid >> 0x14ffd3670130030, likely client has closed socket >> at >> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) >> at >> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) >> at java.lang.Thread.run(Thread.java:745) >> 2015-09-24 05:07:42,303 [myid:1] - INFO >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed >> socket connection for client /10.0.8.4:34705 which had sessionid >> 0x14ffd3670130030 >> 2015-09-24 05:07:42,314 [myid:1] - ERROR >> [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception: >> java.nio.channels.CancelledKeyException >> at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) >> at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) >> at >> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151) >> at >> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081) >> at >> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404) >> at >> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74) >> 2015-09-24 05:07:42,334 [myid:1] - INFO >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - >> Accepted socket connection from /10.0.8.4:34707 >> 2015-09-24 05:07:42,334 [myid:1] - WARN >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection >> request from old client /10.0.8.4:34707; will be dropped if server is in r-o >> mode >> 2015-09-24 05:07:42,335 [myid:1] - INFO >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client >> attempting to establish new session at /10.0.8.4:34707 >> 2015-09-24 05:07:42,357 [myid:1] - INFO >> [CommitProcessor:1:ZooKeeperServer@617] - Established session >> 0x14ffd3670130031 with negotiated timeout 20001 for client /10.0.8.4:34707 >> 2015-09-24 05:07:42,364 [myid:1] - WARN >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end >> of stream exception >> EndOfStreamException: Unable to read additional data from client sessionid >> 0x14ffd3670130031, likely client has closed socket >> at >> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) >> at >> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) >> at java.lang.Thread.run(Thread.java:745) >> 2015-09-24 05:07:42,365 [myid:1] - INFO >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed >> socket connection for client /10.0.8.4:34707 which had sessionid >> 0x14ffd3670130031 >> 2015-09-24 05:07:42,376 [myid:1] - ERROR >> [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception: >> java.nio.channels.CancelledKeyException >> at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) >> at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) >> at >> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151) >> at >> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081) >> at >> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404) >> at >> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74) >> >> Also there are >> 2015-09-24 06:29:54,459 [myid:1] - INFO >> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@139] - >> Shutting down >> 2015-09-24 06:29:54,459 [myid:1] - INFO >> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@441] - shutting down >> 2015-09-24 06:29:54,459 [myid:1] - INFO >> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@105] - >> Shutting down >> 2015-09-24 06:29:54,459 [myid:1] - INFO >> [FollowerRequestProcessor:1:FollowerRequestProcessor@95] - >> FollowerRequestProcessor exited loop! >> 2015-09-24 06:29:54,460 [myid:1] - INFO >> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@181] - Shutting down >> 2015-09-24 06:29:54,464 [myid:1] - INFO >> [CommitProcessor:1:CommitProcessor@150] - CommitProcessor exited loop! >> 2015-09-24 06:29:54,465 [myid:1] - INFO >> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] - >> shutdown of request processor complete >> 2015-09-24 06:29:54,466 [myid:1] - INFO >> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] - >> Shutting down >> 2015-09-24 06:29:54,466 [myid:1] - INFO >> [SyncThread:1:SyncRequestProcessor@187] - SyncRequestProcessor exited! >> 2015-09-24 06:29:54,466 [myid:1] - INFO >> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@714] - LOOKING >> 2015-09-24 06:29:54,584 [myid:1] - INFO >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - >> Accepted socket connection from /10.0.8.58:36137 >> 2015-09-24 06:29:54,584 [myid:1] - WARN >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception >> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not >> running >> 2015-09-24 06:29:54,584 [myid:1] - INFO >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed >> socket connection for client /10.0.8.58:36137 (no session established for >> client) >> 2015-09-24 06:29:54,679 [myid:1] - INFO >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - >> Accepted socket connection from /10.0.8.57:57410 >> 2015-09-24 06:29:54,680 [myid:1] - WARN >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception >> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not >> running >> 2015-09-24 06:29:54,680 [myid:1] - INFO >> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed >> socket connection for client /10.0.8.57:57410 (no session established for >> client) >> >> I also observed that hadoop-zkfc restarts very frequently. >> Any ideas what could be wrong? >> >> Thanks. >
