Re: zk server falling apart from quorum due to connection loss and couldn't connect back

Deepak Jagtap Mon, 27 Jan 2014 11:34:27 -0800

Hi German,

Thanks for the followup!
I have log files for all the servers and are quite big (greater than  25MB)
hence could not
upload send the log files through mail.
Is it ok if I file a bug on this this and upload logs there?


Thanks & Regards,
Deepak



On Sun, Jan 26, 2014 at 1:53 AM, German Blanco <
[email protected]> wrote:

> Hello Deepak,
>
> sorry for the slow response.
> I can't figure out what might be going on here without the log files.
> The traces you see in S2 do not indicate any problem, as far as I see. It
> seems that you have a client running in S2 that tries to connect to that
> server. Since S2 hasn't been able to join a quorum, the server attending
> clients hasn't been started and the connection is rejected.
> Maybe, to start with, you could start by uploading the traces around the
> connection loss between S2 and S3 (say a couple of minutes before and
> after).
>
> Regards,
>
> German.
>
>
> On Thu, Jan 23, 2014 at 8:42 PM, Deepak Jagtap <[email protected]
> >wrote:
>
> > Hi,
> >
> > zoo.cfg is :
> >
> > maxClientCnxns=50
> > # The number of milliseconds of each tick
> > tickTime=2000
> > # The number of ticks that the initial
> > # synchronization phase can take
> > initLimit=10
> > # The number of ticks that can pass between
> > # sending a request and getting an acknowledgement
> > syncLimit=5
> > # the directory where the snapshot is stored.
> > dataDir=/var/lib/zookeeper
> > # the port at which the clients will connect
> > clientPort=2181
> >
> > autopurge.snapRetainCount=3
> > autopurge.purgeInterval=1
> > dynamicConfigFile=/etc/maxta/zookeeper/conf/zoo.cfg.dynamic
> >
> >
> >
> > zoo.cfg.dynamic is:
> >
> > server.1=169.254.1.1:2888:3888:participant;0.0.0.0:2181
> > server.2=169.254.1.2:2888:3888:participant;0.0.0.0:2181
> > server.3=169.254.1.3:2888:3888:participant;0.0.0.0:2181
> > version=1
> >
> >
> > Thanks & Regards,
> > Deepak
> >
> >
> > On Thu, Jan 23, 2014 at 11:30 AM, German Blanco <
> > [email protected]> wrote:
> >
> > > Sorry but the attachment didn't make it through.
> > > It might be safer to put the files somewhere in the web and send a
> link.
> > >
> > >
> > > On Thu, Jan 23, 2014 at 8:00 PM, Deepak Jagtap <
> [email protected]
> > > >wrote:
> > >
> > > > Hi German,
> > > >
> > > > Please find zookeeper config files attached.
> > > >
> > > > Thanks & Regards,
> > > > Deepak
> > > >
> > > >
> > > > On Thu, Jan 23, 2014 at 12:59 AM, German Blanco <
> > > > [email protected]> wrote:
> > > >
> > > >> Hello!
> > > >>
> > > >> Could you please post your configuration files?
> > > >>
> > > >> Regards,
> > > >>
> > > >> German.
> > > >>
> > > >>
> > > >> On Thu, Jan 23, 2014 at 2:28 AM, Deepak Jagtap <
> > [email protected]
> > > >> >wrote:
> > > >>
> > > >> > Hi All,
> > > >> >
> > > >> > We have deployed zookeeper version 3.5.0.1515976, with 3 zk
> servers
> > in
> > > >> the
> > > >> > quorum.
> > > >> > The problem we are facing is that one zookeeper server in the
> quorum
> > > >> falls
> > > >> > apart, and never becomes part of the cluster until we restart
> > > zookeeper
> > > >> > server on that node.
> > > >> >
> > > >> > Our interpretation from zookeeper logs on all nodes is as follows:
> > > >> > (For simplicity assume S1=> zk server1, S2 => zk server2, S3 => zk
> > > >> server
> > > >> > 3)
> > > >> > Initially S3 is the leader while S1 and S2 are followers.
> > > >> >
> > > >> > S2 hits 46 sec latency while fsyncing write ahead log and results
> in
> > > >> loss
> > > >> > of connection with S3.
> > > >> >  S3 in turn prints following error message:
> > > >> >
> > > >> > Unexpected exception causing shutdown while sock still open
> > > >> > java.net.SocketTimeoutException: Read timed out
> > > >> > Stack trace
> > > >> > ******* GOODBYE /169.254.1.2:47647(S2) ********
> > > >> >
> > > >> > S2 in this case closes connection with S3(leader) and shuts down
> > > >> follower
> > > >> > with following log messages:
> > > >> > Closing connection to leader, exception during packet send
> > > >> > java.net.SocketException: Socket close
> > > >> > Follower@194] - shutdown called
> > > >> > java.lang.Exception: shutdown Follower
> > > >> >
> > > >> > After this point S3 could never reestablish connection with S2 and
> > > >> leader
> > > >> > election mechanism keeps failing. S3 now keeps printing following
> > > >> message
> > > >> > repeatedly:
> > > >> > Cannot open channel to 2 at election address /169.254.1.2:3888
> > > >> > java.net.ConnectException: Connection refused.
> > > >> >
> > > >> > While S3 is in this state, S2 repeatedly keeps printing following
> > > >> message:
> > > >> > INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> > > >> > :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket
> > connection
> > > >> from
> > > >> > /
> > > >> > 127.0.0.1:60667
> > > >> > Exception causing close of session 0x0: ZooKeeperServer not
> running
> > > >> > Closed socket connection for client /127.0.0.1:60667 (no session
> > > >> > established for client)
> > > >> >
> > > >> > Leader election never completes successfully and causing S2 to
> fall
> > > >> apart
> > > >> > from the quorum.
> > > >> > S2 was out of quorum for almost 1 week.
> > > >> >
> > > >> > While debugging this issue, we found out that both election and
> peer
> > > >> > connection ports on S2  can't be telneted from any of the node
> (S1,
> > > S2,
> > > >> > S3). Network connectivity is not the issue. Later, we restarted
> the
> > ZK
> > > >> > server S2 (service zookeeper-server restart) -- now we could
> telnet
> > to
> > > >> both
> > > >> > the ports and S2 joined the ensemble after a leader election
> > attempt.
> > > >> > Any idea what might be forcing S2 to get into a situation where it
> > > won't
> > > >> > accept any connections on the leader election and peer connection
> > > ports?
> > > >> >
> > > >> > Should I file a jira on this and upload all log files while
> > submitting
> > > >> the
> > > >> > jira as log files are close to 250MB each?
> > > >> >
> > > >> > Thanks & Regards,
> > > >> > Deepak
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: zk server falling apart from quorum due to connection loss and couldn't connect back

Reply via email to