Re: Leader election failing

Chris Tue, 11 Sep 2018 12:43:16 -0700

What action should i perform for getting the most usable logs in this case ?


Log level to debug and kill -3 when its failing ?


On 11 September 2018 9:17:45 pm Andor Molnár <an...@apache.org> wrote:

Erm.

Thanks for carrying out these tests Chris.

Have you by any chance - as Camille suggested - collected debug logs
from these tests?


Andor



On 09/11/2018 11:08 AM, Cee Tee wrote:

Concluded a test with a 3.4.13 cluster, it shows the same behaviour.

On Mon, Sep 3, 2018 at 4:56 PM Andor Molnar <an...@cloudera.com.invalid>
wrote:

Thanks for testing Chris.

So, if I understand you correctly, you're running the latest version from
branch-3.5. Could we say that this is a 3.5-only problem?
Have you ever tested the same cluster with 3.4?

Regards,
Andor



On Tue, Aug 21, 2018 at 11:29 AM, Cee Tee <c.turks...@gmail.com> wrote:

I've tested the patch and let it run 6 days. It did not help, result is
still the same. (remaining ZKs form islands based on datacenter they are
in).

I have mitigated it by doing a daily rolling restart.

Regards,
Chris

On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar <an...@cloudera.com.invalid

wrote:

Hi Chris,

Would you mind testing the following patch on your test clusters?
I'm not entirely sure, but the issue might be related.

https://issues.apache.org/jira/browse/ZOOKEEPER-2930

Regards,
Andor



On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <cami...@apache.org>
wrote:

If you have the time and inclination, next time you see this problem

in

your test clusters get stack traces and any other diagnostics

possible

before restarting. I'm not an expert at network debugging but if you

have

someone who is you might want them to take a look at the connections

and

settings of any switches/firewalls/etc involved, see if there's any

unusual

configurations or evidence of other long-lived connections failing

(even

if

their services handle the failures more gracefully). Send us the

stack

traces also it would be interesting to take a look.

C


On Wed, Aug 8, 2018, 11:09 AM Chris <c.turks...@gmail.com> wrote:

Running 3.5.5

I managed to recreate it on acc and test cluster today, failing on
shutdown
of leader. Both had been running for over a week. After restarting

all

zookeepers it runs fine no matter how many leader shutdowns i throw

at

it.

On 8 August 2018 5:05:34 pm Andor Molnar

<an...@cloudera.com.INVALID

wrote:

Some kind of a network split?

It looks like 1-2 and 3-4 were able to communicate each other,

but

connection timed out between these 2 splits. When 5 came back

online

it

started with supporters of (1,2) and later 3 and 4 also joined.

There was no such issue the day after.

Which version of ZooKeeper is this? 3.5.something?

Regards,
Andor



On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turks...@gmail.com>

wrote:

Actually i have similar issues on my test and acceptance

clusters

where

leader election fails if the cluster has been running for a

couple

of

days.

If you stop/start the Zookeepers once they will work fine on

further

disruptions that day. Not sure yet what the treshold is.


On 8 August 2018 4:32:56 pm Camille Fournier <

cami...@apache.org>

wrote:

Hard to say. It looks like about 15 minutes after your first

incident

where

5 goes down and then comes back up, servers 1 and 2 get socket

errors

to

their connections with 3, 4, and 6. It's possible if you had

waited

those

15 minutes, once those errors cleared the quorum would've

formed

with

the

other servers. But as for why there were those errors in the

first

place

it's not clear. Could be a network glitch, or an obscure bug in

the

connection logic. Has anyone else ever seen this?
If you see it again, getting a stack trace of the servers when

they

can't

form quorum might be helpful.

On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turks...@gmail.com>

wrote:

I have a cluster of 5 participants (id 1-5) and 1 observer (id

6).

1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
Yesterday one of the participants (id5, by chance was the

leader)

was

rebooted. Although all other servers were online and not

suffering

from

networking issues the leader election failed and the cluster

remained

"looking" until the old leader came back online after which it

was

promptly
elected as leader again.

Today we tried the same exercise on the exact same servers, 5

was

still

leader and was rebooted, and leader election worked fine with

as

new

leader.

I have included the logs.  From the logs i see that yesterday

1,2

never

received new leader proposals from 3,4 and vice versa.
Today all proposals came through. This is not the first time

we've

seen

this type of behavior, where some zookeepers can't seem to

find

each

other
after the leader goes down.
All servers use dynamic configuration and have the same config

node.

How could this be explained? These servers also host a

replicated

database
cluster and have no history of db replication issues.

Thanks,
Chris

Re: Leader election failing

Reply via email to