Re: Follower never recovers and keeps saying ZooKeeperServer not running

2018-07-03 Thread Steph van Schalkwyk
I've seen this bur can't remember what caused it.
What's your zNode size?
steph

On Tue, Jul 3, 2018, 9:17 PM Benjamin Jaton 
wrote:

> Hello,
>
> I'm wondering what can cause a ZK follower to check out like this:
>
> 2018-07-03T13:43:28,814 [myid:] - ERROR [LearnerHandler-/10.0.0.248:40282
> :LearnerHandler@620] - Unexpected exception causing shutdown while sock
> still open
> java.io.EOFException: null
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> ~[?:1.8.0_171]
> at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> ~[zookeeper-3.5.0-alpha.jar:3.5.2-alpha-SNAPSHOT-1702518]
> at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> ~[zookeeper-3.5.0-alpha.jar:?]
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> ~[zookeeper-3.5.0-alpha.jar:3.5.2-alpha-SNAPSHOT-1702518]
> at
>
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:514)
> [zookeeper-3.5.0-alpha.jar:?]
> 2018-07-03T13:43:28,814 [myid:] - WARN  [LearnerHandler-/10.0.0.248:40282
> :LearnerHandler@635] - *** GOODBYE /10.0.0.248:40282 
> 2018-07-03T13:43:28,814 [myid:] - ERROR [LearnerHandler-/10.0.0.237:54408
> :LearnerHandler@620] - Unexpected exception causing shutdown while sock
> still open
> java.io.EOFException: null
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> ~[?:1.8.0_171]
> at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> ~[zookeeper-3.5.0-alpha.jar:3.5.2-alpha-SNAPSHOT-1702518]
> at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> ~[zookeeper-3.5.0-alpha.jar:?]
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> ~[zookeeper-3.5.0-alpha.jar:3.5.2-alpha-SNAPSHOT-1702518]
> at
>
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:514)
> [zookeeper-3.5.0-alpha.jar:?]
> 2018-07-03T13:43:28,814 [myid:] - INFO
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from
> /
> 10.0.0.237:46754
> 2018-07-03T13:43:28,814 [myid:] - WARN  [LearnerHandler-/10.0.0.237:54408
> :LearnerHandler@635] - *** GOODBYE /10.0.0.237:54408 
> 2018-07-03T13:43:28,814 [myid:] - INFO
> [ConnnectionExpirer:MBeanRegistry@128] - Unregister MBean
>
> [org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=replica.2,name2=Leader,name3=Connections,name4=10.0.0.111,name5=0x25f3766]
> 2018-07-03T13:43:28,815 [myid:] - INFO
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from
> /
> 10.0.0.248:33198
> 2018-07-03T13:43:28,815 [myid:] - INFO
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from
> /
> 10.0.0.237:46780
> 2018-07-03T13:43:28,815 [myid:] - INFO
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from
> /
> 10.0.0.248:33268
> 2018-07-03T13:43:28,815 [myid:] - INFO
> [ConnnectionExpirer:NIOServerCnxn@606] - Closed socket connection for
> client /10.0.0.111:39162 which had sessionid 0x25f3766
> 2018-07-03T13:43:28,815 [myid:] - INFO
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from
> /
> 10.0.0.237:46810
> 2018-07-03T13:43:28,815 [myid:] - INFO
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from
> /
> 10.0.0.237:46824
> 2018-07-03T13:43:28,815 [myid:] - INFO
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from
> /
> 10.0.0.237:46826
> 2018-07-03T13:43:28,815 [myid:] - INFO
>
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@613
> ]
> - Shutting down
> 2018-07-03T13:43:28,815 [myid:] - INFO
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from
> /
> 10.0.0.248:33316
> 2018-07-03T13:43:28,815 [myid:] - INFO
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from
> /
> 10.0.0.248:33548
> 2018-07-03T13:43:28,815 [myid:] - INFO
>
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@619
> ]
> - Shutdown called
> java.lang.Exception: shutdown Leader! reason: Not sufficient followers
> synced, only synced with sids: [ [2] ]
> at
> org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:619)
> [zookeeper-3.5.0-alpha.jar:?]
> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:590)
> [zookeeper-3.5.0-alpha.jar:?]
> at
> 

Follower never recovers and keeps saying ZooKeeperServer not running

2018-07-03 Thread Benjamin Jaton
Hello,

I'm wondering what can cause a ZK follower to check out like this:

2018-07-03T13:43:28,814 [myid:] - ERROR [LearnerHandler-/10.0.0.248:40282
:LearnerHandler@620] - Unexpected exception causing shutdown while sock
still open
java.io.EOFException: null
at java.io.DataInputStream.readInt(DataInputStream.java:392)
~[?:1.8.0_171]
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
~[zookeeper-3.5.0-alpha.jar:3.5.2-alpha-SNAPSHOT-1702518]
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
~[zookeeper-3.5.0-alpha.jar:?]
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
~[zookeeper-3.5.0-alpha.jar:3.5.2-alpha-SNAPSHOT-1702518]
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:514)
[zookeeper-3.5.0-alpha.jar:?]
2018-07-03T13:43:28,814 [myid:] - WARN  [LearnerHandler-/10.0.0.248:40282
:LearnerHandler@635] - *** GOODBYE /10.0.0.248:40282 
2018-07-03T13:43:28,814 [myid:] - ERROR [LearnerHandler-/10.0.0.237:54408
:LearnerHandler@620] - Unexpected exception causing shutdown while sock
still open
java.io.EOFException: null
at java.io.DataInputStream.readInt(DataInputStream.java:392)
~[?:1.8.0_171]
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
~[zookeeper-3.5.0-alpha.jar:3.5.2-alpha-SNAPSHOT-1702518]
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
~[zookeeper-3.5.0-alpha.jar:?]
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
~[zookeeper-3.5.0-alpha.jar:3.5.2-alpha-SNAPSHOT-1702518]
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:514)
[zookeeper-3.5.0-alpha.jar:?]
2018-07-03T13:43:28,814 [myid:] - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /
10.0.0.237:46754
2018-07-03T13:43:28,814 [myid:] - WARN  [LearnerHandler-/10.0.0.237:54408
:LearnerHandler@635] - *** GOODBYE /10.0.0.237:54408 
2018-07-03T13:43:28,814 [myid:] - INFO
[ConnnectionExpirer:MBeanRegistry@128] - Unregister MBean
[org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=replica.2,name2=Leader,name3=Connections,name4=10.0.0.111,name5=0x25f3766]
2018-07-03T13:43:28,815 [myid:] - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /
10.0.0.248:33198
2018-07-03T13:43:28,815 [myid:] - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /
10.0.0.237:46780
2018-07-03T13:43:28,815 [myid:] - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /
10.0.0.248:33268
2018-07-03T13:43:28,815 [myid:] - INFO
[ConnnectionExpirer:NIOServerCnxn@606] - Closed socket connection for
client /10.0.0.111:39162 which had sessionid 0x25f3766
2018-07-03T13:43:28,815 [myid:] - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /
10.0.0.237:46810
2018-07-03T13:43:28,815 [myid:] - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /
10.0.0.237:46824
2018-07-03T13:43:28,815 [myid:] - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /
10.0.0.237:46826
2018-07-03T13:43:28,815 [myid:] - INFO
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@613]
- Shutting down
2018-07-03T13:43:28,815 [myid:] - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /
10.0.0.248:33316
2018-07-03T13:43:28,815 [myid:] - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /
10.0.0.248:33548
2018-07-03T13:43:28,815 [myid:] - INFO
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@619]
- Shutdown called
java.lang.Exception: shutdown Leader! reason: Not sufficient followers
synced, only synced with sids: [ [2] ]
at
org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:619)
[zookeeper-3.5.0-alpha.jar:?]
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:590)
[zookeeper-3.5.0-alpha.jar:?]
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1077)
[zookeeper-3.5.0-alpha.jar:?]
2018-07-03T13:43:28,815 [myid:] - INFO
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):NIOServerCnxn@606]
- Closed socket connection for client /10.0.0.248:33198 (no session
established for client)
2018-07-03T13:43:28,816 [myid:] - INFO

Re: Observer went down with Read timed out exception

2018-07-03 Thread rammohan ganapavarapu
Andor,

Zk  version that i use is zk_version 3.4.5-1392090, built on 09/30/2012
17:52 GMT
No Auth or encryption config
None my of network graphs showing any dip or unusual pattern thats why i am
thinking there may not be any n/w issue. I have those nodes in cloud so
checking with them to see if any n/w issue between regions.

Thanks,
Ram


On Tue, Jul 3, 2018 at 6:29 AM Andor Molnar 
wrote:

> Hi Rammohan,
>
> Would you please elaborate on the details of your cluster setup?
> Which ZooKeeper version do you use?
> Do you use authentication / encryption?
> Would you please attach config files and log files of other nodes like
> leader and followers?
>
> How did you make sure that there was no network problem at the time when
> issue happened?
> Would you please attach graphs / diagrams on the network traffic including
> latency and bandwidth usage between the affected data centers?
>
> Regards,
> Andor
>
>
>
>
> On Tue, Jul 3, 2018 at 2:56 PM, rammohan ganapavarapu <
> rammohanga...@gmail.com> wrote:
>
> > Yes I am sure there is no network issues, if leader is busy in GC
> followers
> > on the same DC would have been shutdown as we right but it wasn't the
> case.
> >
> > On Tue, Jul 3, 2018, 1:56 AM Norbert Kalmar  >
> > wrote:
> >
> > > Hi Ram,
> > >
> > > Are you sure there were no network error? For me, this looks like it
> > could
> > > be due to failed heartbeats (as shutdown was called after the timeout).
> > >
> > > It is also possible the leader was busy (maybe garbage collection
> caused
> > > pause?) - especially if you store big(ish) chunks of data in ZooKeeper.
> > > (There is plan to integrate JVMPauseMonitor to ZooKeeper for this
> reason
> > > actually).
> > >
> > > Regards,
> > > Norbert
> > >
> > > On Mon, Jul 2, 2018 at 9:13 PM rammohan ganapavarapu <
> > > rammohanga...@gmail.com> wrote:
> > >
> > > > All,
> > > >
> > > > I have multi data-center ldap cluster setup with other data-center
> with
> > > all
> > > > observers all of sudden all the observer threads went down with the
> > > > following message, any idea why they went down? We don't see any
> > network
> > > > related issues between data-centers.
> > > >
> > > >
> > > > 2018-06-29 05:32:59,036 [myid:222] - WARN
> > > > [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@79] - Exception
> > when
> > > > observing the leader
> > > > java.net.SocketTimeoutException: Read timed out
> > > > at java.net.SocketInputStream.socketRead0(Native Method)
> > > > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> > > > at java.net.SocketInputStream.read(SocketInputStream.java:170)
> > > > at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > > > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> > > > at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> > > > at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > > > at org.apache.jute.BinaryInputArchive.readInt(
> > BinaryInputArchive.java:63)
> > > > at
> > > >
> > > >
> > > org.apache.zookeeper.server.quorum.QuorumPacket.
> > deserialize(QuorumPacket.java:83)
> > > > at
> > > >
> > > org.apache.jute.BinaryInputArchive.readRecord(
> > BinaryInputArchive.java:108)
> > > > at
> > > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
> > > > at
> > > >
> > > org.apache.zookeeper.server.quorum.Observer.observeLeader(
> > Observer.java:75)
> > > > at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> > QuorumPeer.java:727)
> > > > 2018-06-29 05:32:59,244 [myid:222] - INFO
> > > > [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@137] - shutdown
> > > called
> > > > java.lang.Exception: shutdown Observer
> > > > at
> > > org.apache.zookeeper.server.quorum.Observer.shutdown(Observer.java:137)
> > > > at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> > QuorumPeer.java:731)
> > > >
> > > >
> > > > Thanks,
> > > > Ram
> > > >
> > >
> >
>


Re: Observer went down with Read timed out exception

2018-07-03 Thread rammohan ganapavarapu
Yes I am sure there is no network issues, if leader is busy in GC followers
on the same DC would have been shutdown as we right but it wasn't the case.

On Tue, Jul 3, 2018, 1:56 AM Norbert Kalmar 
wrote:

> Hi Ram,
>
> Are you sure there were no network error? For me, this looks like it could
> be due to failed heartbeats (as shutdown was called after the timeout).
>
> It is also possible the leader was busy (maybe garbage collection caused
> pause?) - especially if you store big(ish) chunks of data in ZooKeeper.
> (There is plan to integrate JVMPauseMonitor to ZooKeeper for this reason
> actually).
>
> Regards,
> Norbert
>
> On Mon, Jul 2, 2018 at 9:13 PM rammohan ganapavarapu <
> rammohanga...@gmail.com> wrote:
>
> > All,
> >
> > I have multi data-center ldap cluster setup with other data-center with
> all
> > observers all of sudden all the observer threads went down with the
> > following message, any idea why they went down? We don't see any network
> > related issues between data-centers.
> >
> >
> > 2018-06-29 05:32:59,036 [myid:222] - WARN
> > [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@79] - Exception when
> > observing the leader
> > java.net.SocketTimeoutException: Read timed out
> > at java.net.SocketInputStream.socketRead0(Native Method)
> > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> > at java.net.SocketInputStream.read(SocketInputStream.java:170)
> > at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> > at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> > at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> > at
> >
> >
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> > at
> >
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
> > at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
> > at
> >
> org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:75)
> > at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> > 2018-06-29 05:32:59,244 [myid:222] - INFO
> > [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@137] - shutdown
> called
> > java.lang.Exception: shutdown Observer
> > at
> org.apache.zookeeper.server.quorum.Observer.shutdown(Observer.java:137)
> > at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:731)
> >
> >
> > Thanks,
> > Ram
> >
>


Re: Observer went down with Read timed out exception

2018-07-03 Thread Norbert Kalmar
Hi Ram,

Are you sure there were no network error? For me, this looks like it could
be due to failed heartbeats (as shutdown was called after the timeout).

It is also possible the leader was busy (maybe garbage collection caused
pause?) - especially if you store big(ish) chunks of data in ZooKeeper.
(There is plan to integrate JVMPauseMonitor to ZooKeeper for this reason
actually).

Regards,
Norbert

On Mon, Jul 2, 2018 at 9:13 PM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:

> All,
>
> I have multi data-center ldap cluster setup with other data-center with all
> observers all of sudden all the observer threads went down with the
> following message, any idea why they went down? We don't see any network
> related issues between data-centers.
>
>
> 2018-06-29 05:32:59,036 [myid:222] - WARN
> [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@79] - Exception when
> observing the leader
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
> at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
> at
> org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:75)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> 2018-06-29 05:32:59,244 [myid:222] - INFO
> [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@137] - shutdown called
> java.lang.Exception: shutdown Observer
> at org.apache.zookeeper.server.quorum.Observer.shutdown(Observer.java:137)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:731)
>
>
> Thanks,
> Ram
>