Re: zk server falling apart from quorum due to connection loss and couldn't connect back

German Blanco Wed, 29 Jan 2014 00:49:22 -0800

OK, that might be. I added a comment in the JIRA case that you created
(ZOOKEEPER-1869, for others to know the reference) stating that at some
point the logs say "leaving the listener" for the election in server 2 and
it is not clear if the server restarts the listener from there.
I think it is better to continue the discussion in the JIRA case and leave
this thread here.



On Tue, Jan 28, 2014 at 9:44 PM, Deepak Jagtap <[email protected]>wrote:

> Hi German,
>
> I went through the zookeeper logs again and it looks like a zookeeper bug
> to me.
> Leader election was initiated and it never completed as one zookeeper
> server went in zombie (hung) state.
> Please note that zookeeper was running all the nodes when this happened.
>
> Thanks & Regards,
> Deepak
>
>
>
>
> On Mon, Jan 27, 2014 at 1:12 PM, Deepak Jagtap <[email protected]
> >wrote:
>
> > Doprbox link for log files:
> > https://dl.dropboxusercontent.com/u/36429721/zklog.tgz
> >
> >
> > On Mon, Jan 27, 2014 at 1:12 PM, Deepak Jagtap <[email protected]
> >wrote:
> >
> >> Jira has attachment limit of 10MB, hence uploaded log files on dropbox.
> >>
> >> *Please refer events close to  *
> >>
> >> *"2014-01-07 10:34:01" *
> >>
> >> *timestamp on all nodes.*
> >>
> >>
> >> * Thanks & Regards,*
> >>
> >> *Deepak *
> >>
> >>
> >> On Mon, Jan 27, 2014 at 12:34 PM, German Blanco <
> >> [email protected]> wrote:
> >>
> >>> I don't see why it would be a problem for anybody.
> >>> If this happens not to be a problem in ZooKeeper we can always close
> the
> >>> bug case.
> >>>
> >>>
> >>> On Mon, Jan 27, 2014 at 8:33 PM, Deepak Jagtap <
> [email protected]
> >>> >wrote:
> >>>
> >>> > Hi German,
> >>> >
> >>> > Thanks for the followup!
> >>> > I have log files for all the servers and are quite big (greater than
> >>>  25MB)
> >>> > hence could not
> >>> > upload send the log files through mail.
> >>> > Is it ok if I file a bug on this this and upload logs there?
> >>> >
> >>> > Thanks & Regards,
> >>> > Deepak
> >>> >
> >>> >
> >>> >
> >>> > On Sun, Jan 26, 2014 at 1:53 AM, German Blanco <
> >>> > [email protected]> wrote:
> >>> >
> >>> > > Hello Deepak,
> >>> > >
> >>> > > sorry for the slow response.
> >>> > > I can't figure out what might be going on here without the log
> files.
> >>> > > The traces you see in S2 do not indicate any problem, as far as I
> >>> see. It
> >>> > > seems that you have a client running in S2 that tries to connect to
> >>> that
> >>> > > server. Since S2 hasn't been able to join a quorum, the server
> >>> attending
> >>> > > clients hasn't been started and the connection is rejected.
> >>> > > Maybe, to start with, you could start by uploading the traces
> around
> >>> the
> >>> > > connection loss between S2 and S3 (say a couple of minutes before
> and
> >>> > > after).
> >>> > >
> >>> > > Regards,
> >>> > >
> >>> > > German.
> >>> > >
> >>> > >
> >>> > > On Thu, Jan 23, 2014 at 8:42 PM, Deepak Jagtap <
> >>> [email protected]
> >>> > > >wrote:
> >>> > >
> >>> > > > Hi,
> >>> > > >
> >>> > > > zoo.cfg is :
> >>> > > >
> >>> > > > maxClientCnxns=50
> >>> > > > # The number of milliseconds of each tick
> >>> > > > tickTime=2000
> >>> > > > # The number of ticks that the initial
> >>> > > > # synchronization phase can take
> >>> > > > initLimit=10
> >>> > > > # The number of ticks that can pass between
> >>> > > > # sending a request and getting an acknowledgement
> >>> > > > syncLimit=5
> >>> > > > # the directory where the snapshot is stored.
> >>> > > > dataDir=/var/lib/zookeeper
> >>> > > > # the port at which the clients will connect
> >>> > > > clientPort=2181
> >>> > > >
> >>> > > > autopurge.snapRetainCount=3
> >>> > > > autopurge.purgeInterval=1
> >>> > > > dynamicConfigFile=/etc/maxta/zookeeper/conf/zoo.cfg.dynamic
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > zoo.cfg.dynamic is:
> >>> > > >
> >>> > > > server.1=169.254.1.1:2888:3888:participant;0.0.0.0:2181
> >>> > > > server.2=169.254.1.2:2888:3888:participant;0.0.0.0:2181
> >>> > > > server.3=169.254.1.3:2888:3888:participant;0.0.0.0:2181
> >>> > > > version=1
> >>> > > >
> >>> > > >
> >>> > > > Thanks & Regards,
> >>> > > > Deepak
> >>> > > >
> >>> > > >
> >>> > > > On Thu, Jan 23, 2014 at 11:30 AM, German Blanco <
> >>> > > > [email protected]> wrote:
> >>> > > >
> >>> > > > > Sorry but the attachment didn't make it through.
> >>> > > > > It might be safer to put the files somewhere in the web and
> send
> >>> a
> >>> > > link.
> >>> > > > >
> >>> > > > >
> >>> > > > > On Thu, Jan 23, 2014 at 8:00 PM, Deepak Jagtap <
> >>> > > [email protected]
> >>> > > > > >wrote:
> >>> > > > >
> >>> > > > > > Hi German,
> >>> > > > > >
> >>> > > > > > Please find zookeeper config files attached.
> >>> > > > > >
> >>> > > > > > Thanks & Regards,
> >>> > > > > > Deepak
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > On Thu, Jan 23, 2014 at 12:59 AM, German Blanco <
> >>> > > > > > [email protected]> wrote:
> >>> > > > > >
> >>> > > > > >> Hello!
> >>> > > > > >>
> >>> > > > > >> Could you please post your configuration files?
> >>> > > > > >>
> >>> > > > > >> Regards,
> >>> > > > > >>
> >>> > > > > >> German.
> >>> > > > > >>
> >>> > > > > >>
> >>> > > > > >> On Thu, Jan 23, 2014 at 2:28 AM, Deepak Jagtap <
> >>> > > > [email protected]
> >>> > > > > >> >wrote:
> >>> > > > > >>
> >>> > > > > >> > Hi All,
> >>> > > > > >> >
> >>> > > > > >> > We have deployed zookeeper version 3.5.0.1515976, with 3
> zk
> >>> > > servers
> >>> > > > in
> >>> > > > > >> the
> >>> > > > > >> > quorum.
> >>> > > > > >> > The problem we are facing is that one zookeeper server in
> >>> the
> >>> > > quorum
> >>> > > > > >> falls
> >>> > > > > >> > apart, and never becomes part of the cluster until we
> >>> restart
> >>> > > > > zookeeper
> >>> > > > > >> > server on that node.
> >>> > > > > >> >
> >>> > > > > >> > Our interpretation from zookeeper logs on all nodes is as
> >>> > follows:
> >>> > > > > >> > (For simplicity assume S1=> zk server1, S2 => zk server2,
> >>> S3 =>
> >>> > zk
> >>> > > > > >> server
> >>> > > > > >> > 3)
> >>> > > > > >> > Initially S3 is the leader while S1 and S2 are followers.
> >>> > > > > >> >
> >>> > > > > >> > S2 hits 46 sec latency while fsyncing write ahead log and
> >>> > results
> >>> > > in
> >>> > > > > >> loss
> >>> > > > > >> > of connection with S3.
> >>> > > > > >> >  S3 in turn prints following error message:
> >>> > > > > >> >
> >>> > > > > >> > Unexpected exception causing shutdown while sock still
> open
> >>> > > > > >> > java.net.SocketTimeoutException: Read timed out
> >>> > > > > >> > Stack trace
> >>> > > > > >> > ******* GOODBYE /169.254.1.2:47647(S2) ********
> >>> > > > > >> >
> >>> > > > > >> > S2 in this case closes connection with S3(leader) and
> shuts
> >>> down
> >>> > > > > >> follower
> >>> > > > > >> > with following log messages:
> >>> > > > > >> > Closing connection to leader, exception during packet send
> >>> > > > > >> > java.net.SocketException: Socket close
> >>> > > > > >> > Follower@194] - shutdown called
> >>> > > > > >> > java.lang.Exception: shutdown Follower
> >>> > > > > >> >
> >>> > > > > >> > After this point S3 could never reestablish connection
> with
> >>> S2
> >>> > and
> >>> > > > > >> leader
> >>> > > > > >> > election mechanism keeps failing. S3 now keeps printing
> >>> > following
> >>> > > > > >> message
> >>> > > > > >> > repeatedly:
> >>> > > > > >> > Cannot open channel to 2 at election address /
> >>> 169.254.1.2:3888
> >>> > > > > >> > java.net.ConnectException: Connection refused.
> >>> > > > > >> >
> >>> > > > > >> > While S3 is in this state, S2 repeatedly keeps printing
> >>> > following
> >>> > > > > >> message:
> >>> > > > > >> > INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> >>> > > > > >> > :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket
> >>> > > > connection
> >>> > > > > >> from
> >>> > > > > >> > /
> >>> > > > > >> > 127.0.0.1:60667
> >>> > > > > >> > Exception causing close of session 0x0: ZooKeeperServer
> not
> >>> > > running
> >>> > > > > >> > Closed socket connection for client /127.0.0.1:60667 (no
> >>> > session
> >>> > > > > >> > established for client)
> >>> > > > > >> >
> >>> > > > > >> > Leader election never completes successfully and causing
> S2
> >>> to
> >>> > > fall
> >>> > > > > >> apart
> >>> > > > > >> > from the quorum.
> >>> > > > > >> > S2 was out of quorum for almost 1 week.
> >>> > > > > >> >
> >>> > > > > >> > While debugging this issue, we found out that both
> election
> >>> and
> >>> > > peer
> >>> > > > > >> > connection ports on S2  can't be telneted from any of the
> >>> node
> >>> > > (S1,
> >>> > > > > S2,
> >>> > > > > >> > S3). Network connectivity is not the issue. Later, we
> >>> restarted
> >>> > > the
> >>> > > > ZK
> >>> > > > > >> > server S2 (service zookeeper-server restart) -- now we
> could
> >>> > > telnet
> >>> > > > to
> >>> > > > > >> both
> >>> > > > > >> > the ports and S2 joined the ensemble after a leader
> election
> >>> > > > attempt.
> >>> > > > > >> > Any idea what might be forcing S2 to get into a situation
> >>> where
> >>> > it
> >>> > > > > won't
> >>> > > > > >> > accept any connections on the leader election and peer
> >>> > connection
> >>> > > > > ports?
> >>> > > > > >> >
> >>> > > > > >> > Should I file a jira on this and upload all log files
> while
> >>> > > > submitting
> >>> > > > > >> the
> >>> > > > > >> > jira as log files are close to 250MB each?
> >>> > > > > >> >
> >>> > > > > >> > Thanks & Regards,
> >>> > > > > >> > Deepak
> >>> > > > > >> >
> >>> > > > > >>
> >>> > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: zk server falling apart from quorum due to connection loss and couldn't connect back

Reply via email to