Andor, Thanks for your time, i am waiting for 3.5 stable version to upgrade. Log says read timeout right, what kind of packet or data is it reading from leader?
Ram On Wed, Jul 4, 2018, 12:24 AM Andor Molnar <an...@cloudera.com.invalid> wrote: > Unfortunately I cannot imagine anything other than what Norbert already > mentioned. If the followers were stable, a problem in the DC-DC link could > explain why all the observers have gone in a moment. If it had been a > problem with leader overloading, even the followers would have gone with > the observers too. > > If none of these cases happened, I'm afraid I cannot help more. I'm not > aware of a similar, existing issue. Maybe more senior devs can comment. > > However, your version is quite old. Most production clusters are running > 3.4.6 or 3.4.9 as far as I'm concerned. You might want to upgrade to the > latest stable version though which is 3.4.12 at the moment. 3.4.13 will be > out soon as well. > > Regards, > Andor > > > > > On Tue, Jul 3, 2018 at 8:13 PM, rammohan ganapavarapu < > rammohanga...@gmail.com> wrote: > > > Andor, > > > > Zk version that i use is zk_version 3.4.5-1392090, built on 09/30/2012 > > 17:52 GMT > > No Auth or encryption config > > None my of network graphs showing any dip or unusual pattern thats why i > am > > thinking there may not be any n/w issue. I have those nodes in cloud so > > checking with them to see if any n/w issue between regions. > > > > Thanks, > > Ram > > > > > > On Tue, Jul 3, 2018 at 6:29 AM Andor Molnar <an...@cloudera.com.invalid> > > wrote: > > > > > Hi Rammohan, > > > > > > Would you please elaborate on the details of your cluster setup? > > > Which ZooKeeper version do you use? > > > Do you use authentication / encryption? > > > Would you please attach config files and log files of other nodes like > > > leader and followers? > > > > > > How did you make sure that there was no network problem at the time > when > > > issue happened? > > > Would you please attach graphs / diagrams on the network traffic > > including > > > latency and bandwidth usage between the affected data centers? > > > > > > Regards, > > > Andor > > > > > > > > > > > > > > > On Tue, Jul 3, 2018 at 2:56 PM, rammohan ganapavarapu < > > > rammohanga...@gmail.com> wrote: > > > > > > > Yes I am sure there is no network issues, if leader is busy in GC > > > followers > > > > on the same DC would have been shutdown as we right but it wasn't the > > > case. > > > > > > > > On Tue, Jul 3, 2018, 1:56 AM Norbert Kalmar > > <nkal...@cloudera.com.invalid > > > > > > > > wrote: > > > > > > > > > Hi Ram, > > > > > > > > > > Are you sure there were no network error? For me, this looks like > it > > > > could > > > > > be due to failed heartbeats (as shutdown was called after the > > timeout). > > > > > > > > > > It is also possible the leader was busy (maybe garbage collection > > > caused > > > > > pause?) - especially if you store big(ish) chunks of data in > > ZooKeeper. > > > > > (There is plan to integrate JVMPauseMonitor to ZooKeeper for this > > > reason > > > > > actually). > > > > > > > > > > Regards, > > > > > Norbert > > > > > > > > > > On Mon, Jul 2, 2018 at 9:13 PM rammohan ganapavarapu < > > > > > rammohanga...@gmail.com> wrote: > > > > > > > > > > > All, > > > > > > > > > > > > I have multi data-center ldap cluster setup with other > data-center > > > with > > > > > all > > > > > > observers all of sudden all the observer threads went down with > the > > > > > > following message, any idea why they went down? We don't see any > > > > network > > > > > > related issues between data-centers. > > > > > > > > > > > > > > > > > > 2018-06-29 05:32:59,036 [myid:222] - WARN > > > > > > [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@79] - > > Exception > > > > when > > > > > > observing the leader > > > > > > java.net.SocketTimeoutException: Read timed out > > > > > > at java.net.SocketInputStream.socketRead0(Native Method) > > > > > > at java.net.SocketInputStream.socketRead(SocketInputStream. > > java:116) > > > > > > at java.net.SocketInputStream.read(SocketInputStream.java:170) > > > > > > at java.net.SocketInputStream.read(SocketInputStream.java:141) > > > > > > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > > > > > > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > > > > > > at java.io.DataInputStream.readInt(DataInputStream.java:387) > > > > > > at org.apache.jute.BinaryInputArchive.readInt( > > > > BinaryInputArchive.java:63) > > > > > > at > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.QuorumPacket. > > > > deserialize(QuorumPacket.java:83) > > > > > > at > > > > > > > > > > > org.apache.jute.BinaryInputArchive.readRecord( > > > > BinaryInputArchive.java:108) > > > > > > at > > > > > org.apache.zookeeper.server.quorum.Learner.readPacket( > > Learner.java:152) > > > > > > at > > > > > > > > > > > org.apache.zookeeper.server.quorum.Observer.observeLeader( > > > > Observer.java:75) > > > > > > at org.apache.zookeeper.server.quorum.QuorumPeer.run( > > > > QuorumPeer.java:727) > > > > > > 2018-06-29 05:32:59,244 [myid:222] - INFO > > > > > > [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@137] - > > shutdown > > > > > called > > > > > > java.lang.Exception: shutdown Observer > > > > > > at > > > > > org.apache.zookeeper.server.quorum.Observer.shutdown( > > Observer.java:137) > > > > > > at org.apache.zookeeper.server.quorum.QuorumPeer.run( > > > > QuorumPeer.java:731) > > > > > > > > > > > > > > > > > > Thanks, > > > > > > Ram > > > > > > > > > > > > > > > > > > > > >