Re: zk server falling apart from quorum due to connection loss and couldn't connect back

German Blanco Thu, 23 Jan 2014 11:31:13 -0800

Sorry but the attachment didn't make it through.
It might be safer to put the files somewhere in the web and send a link.



On Thu, Jan 23, 2014 at 8:00 PM, Deepak Jagtap <[email protected]>wrote:

> Hi German,
>
> Please find zookeeper config files attached.
>
> Thanks & Regards,
> Deepak
>
>
> On Thu, Jan 23, 2014 at 12:59 AM, German Blanco <
> [email protected]> wrote:
>
>> Hello!
>>
>> Could you please post your configuration files?
>>
>> Regards,
>>
>> German.
>>
>>
>> On Thu, Jan 23, 2014 at 2:28 AM, Deepak Jagtap <[email protected]
>> >wrote:
>>
>> > Hi All,
>> >
>> > We have deployed zookeeper version 3.5.0.1515976, with 3 zk servers in
>> the
>> > quorum.
>> > The problem we are facing is that one zookeeper server in the quorum
>> falls
>> > apart, and never becomes part of the cluster until we restart zookeeper
>> > server on that node.
>> >
>> > Our interpretation from zookeeper logs on all nodes is as follows:
>> > (For simplicity assume S1=> zk server1, S2 => zk server2, S3 => zk
>> server
>> > 3)
>> > Initially S3 is the leader while S1 and S2 are followers.
>> >
>> > S2 hits 46 sec latency while fsyncing write ahead log and results in
>> loss
>> > of connection with S3.
>> >  S3 in turn prints following error message:
>> >
>> > Unexpected exception causing shutdown while sock still open
>> > java.net.SocketTimeoutException: Read timed out
>> > Stack trace
>> > ******* GOODBYE /169.254.1.2:47647(S2) ********
>> >
>> > S2 in this case closes connection with S3(leader) and shuts down
>> follower
>> > with following log messages:
>> > Closing connection to leader, exception during packet send
>> > java.net.SocketException: Socket close
>> > Follower@194] - shutdown called
>> > java.lang.Exception: shutdown Follower
>> >
>> > After this point S3 could never reestablish connection with S2 and
>> leader
>> > election mechanism keeps failing. S3 now keeps printing following
>> message
>> > repeatedly:
>> > Cannot open channel to 2 at election address /169.254.1.2:3888
>> > java.net.ConnectException: Connection refused.
>> >
>> > While S3 is in this state, S2 repeatedly keeps printing following
>> message:
>> > INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
>> > :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection
>> from
>> > /
>> > 127.0.0.1:60667
>> > Exception causing close of session 0x0: ZooKeeperServer not running
>> > Closed socket connection for client /127.0.0.1:60667 (no session
>> > established for client)
>> >
>> > Leader election never completes successfully and causing S2 to fall
>> apart
>> > from the quorum.
>> > S2 was out of quorum for almost 1 week.
>> >
>> > While debugging this issue, we found out that both election and peer
>> > connection ports on S2  can't be telneted from any of the node (S1, S2,
>> > S3). Network connectivity is not the issue. Later, we restarted the ZK
>> > server S2 (service zookeeper-server restart) -- now we could telnet to
>> both
>> > the ports and S2 joined the ensemble after a leader election attempt.
>> > Any idea what might be forcing S2 to get into a situation where it won't
>> > accept any connections on the leader election and peer connection ports?
>> >
>> > Should I file a jira on this and upload all log files while submitting
>> the
>> > jira as log files are close to 250MB each?
>> >
>> > Thanks & Regards,
>> > Deepak
>> >
>>
>
>

Re: zk server falling apart from quorum due to connection loss and couldn't connect back

Reply via email to