Re: zk server falling apart from quorum due to connection loss and couldn't connect back

Deepak Jagtap Thu, 23 Jan 2014 11:44:12 -0800

Hi,

zoo.cfg is :


maxClientCnxns=50
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/var/lib/zookeeper
# the port at which the clients will connect
clientPort=2181

autopurge.snapRetainCount=3
autopurge.purgeInterval=1
dynamicConfigFile=/etc/maxta/zookeeper/conf/zoo.cfg.dynamic



zoo.cfg.dynamic is:

server.1=169.254.1.1:2888:3888:participant;0.0.0.0:2181
server.2=169.254.1.2:2888:3888:participant;0.0.0.0:2181
server.3=169.254.1.3:2888:3888:participant;0.0.0.0:2181
version=1


Thanks & Regards,
Deepak


On Thu, Jan 23, 2014 at 11:30 AM, German Blanco <
[email protected]> wrote:

> Sorry but the attachment didn't make it through.
> It might be safer to put the files somewhere in the web and send a link.
>
>
> On Thu, Jan 23, 2014 at 8:00 PM, Deepak Jagtap <[email protected]
> >wrote:
>
> > Hi German,
> >
> > Please find zookeeper config files attached.
> >
> > Thanks & Regards,
> > Deepak
> >
> >
> > On Thu, Jan 23, 2014 at 12:59 AM, German Blanco <
> > [email protected]> wrote:
> >
> >> Hello!
> >>
> >> Could you please post your configuration files?
> >>
> >> Regards,
> >>
> >> German.
> >>
> >>
> >> On Thu, Jan 23, 2014 at 2:28 AM, Deepak Jagtap <[email protected]
> >> >wrote:
> >>
> >> > Hi All,
> >> >
> >> > We have deployed zookeeper version 3.5.0.1515976, with 3 zk servers in
> >> the
> >> > quorum.
> >> > The problem we are facing is that one zookeeper server in the quorum
> >> falls
> >> > apart, and never becomes part of the cluster until we restart
> zookeeper
> >> > server on that node.
> >> >
> >> > Our interpretation from zookeeper logs on all nodes is as follows:
> >> > (For simplicity assume S1=> zk server1, S2 => zk server2, S3 => zk
> >> server
> >> > 3)
> >> > Initially S3 is the leader while S1 and S2 are followers.
> >> >
> >> > S2 hits 46 sec latency while fsyncing write ahead log and results in
> >> loss
> >> > of connection with S3.
> >> >  S3 in turn prints following error message:
> >> >
> >> > Unexpected exception causing shutdown while sock still open
> >> > java.net.SocketTimeoutException: Read timed out
> >> > Stack trace
> >> > ******* GOODBYE /169.254.1.2:47647(S2) ********
> >> >
> >> > S2 in this case closes connection with S3(leader) and shuts down
> >> follower
> >> > with following log messages:
> >> > Closing connection to leader, exception during packet send
> >> > java.net.SocketException: Socket close
> >> > Follower@194] - shutdown called
> >> > java.lang.Exception: shutdown Follower
> >> >
> >> > After this point S3 could never reestablish connection with S2 and
> >> leader
> >> > election mechanism keeps failing. S3 now keeps printing following
> >> message
> >> > repeatedly:
> >> > Cannot open channel to 2 at election address /169.254.1.2:3888
> >> > java.net.ConnectException: Connection refused.
> >> >
> >> > While S3 is in this state, S2 repeatedly keeps printing following
> >> message:
> >> > INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
> >> > :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection
> >> from
> >> > /
> >> > 127.0.0.1:60667
> >> > Exception causing close of session 0x0: ZooKeeperServer not running
> >> > Closed socket connection for client /127.0.0.1:60667 (no session
> >> > established for client)
> >> >
> >> > Leader election never completes successfully and causing S2 to fall
> >> apart
> >> > from the quorum.
> >> > S2 was out of quorum for almost 1 week.
> >> >
> >> > While debugging this issue, we found out that both election and peer
> >> > connection ports on S2  can't be telneted from any of the node (S1,
> S2,
> >> > S3). Network connectivity is not the issue. Later, we restarted the ZK
> >> > server S2 (service zookeeper-server restart) -- now we could telnet to
> >> both
> >> > the ports and S2 joined the ensemble after a leader election attempt.
> >> > Any idea what might be forcing S2 to get into a situation where it
> won't
> >> > accept any connections on the leader election and peer connection
> ports?
> >> >
> >> > Should I file a jira on this and upload all log files while submitting
> >> the
> >> > jira as log files are close to 250MB each?
> >> >
> >> > Thanks & Regards,
> >> > Deepak
> >> >
> >>
> >
> >
>

Re: zk server falling apart from quorum due to connection loss and couldn't connect back

Reply via email to