Hi I have recently come across something I have not seen before: all 5 nodes of 
a ZK cluster shutting themselves down at around the same time.  All servers 
worked fine for a long time. Ranging from 2016-01-17 19:31 to 19:33:xx, the 
servers complained about not being able to follow the leader and the zk 
processes shut themselves down. One of the servers (actual host as in 
bare-metal) was somehow powered off at 19:33.  The ZK cluster was down for 
approximately two minutes. Leader election took 90 seconds. The leader then 
expired more than 100 client connections. A snippet from the log file of one of 
the nodes:2016-01-17 19:32:17,081 [myid:1] - WARN  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
following the leaderjava.net.SocketTimeoutException: Read timed out    at 
java.net.SocketInputStream.socketRead0(Native Method)    at 
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)    at 
java.net.SocketInputStream.read(SocketInputStream.java:170)    at 
java.net.SocketInputStream.read(SocketInputStream.java:141)    at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)    at 
java.io.BufferedInputStream.read(BufferedInputStream.java:265)    at 
java.io.DataInputStream.readInt(DataInputStream.java:387)    at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)    at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
    at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)    
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)    
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)   
 at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)2016-01-17
 19:32:17,081 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown 
calledjava.lang.Exception: shutdown Follower    at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)    at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)Followed 
by closing client connections. I have analysed garbage collection behaviour on 
these five nodes and found that on that day two of the nodes underwent one GC 
collection of 80 ms each.  What does the read time out indicate? Does this mean 
that there was a network issue or was there maybe too much data to sync on in a 
timely manner? Why would all nodes shut down at pretty much the same time?   
All the best,  -- Bruno                                        

Reply via email to