What action should i perform for getting the most usable logs in this case ?

Log level to debug and kill -3 when its failing ?


On 11 September 2018 9:17:45 pm Andor Molnár <an...@apache.org> wrote:

Erm.

Thanks for carrying out these tests Chris.

Have you by any chance - as Camille suggested - collected debug logs
from these tests?


Andor



On 09/11/2018 11:08 AM, Cee Tee wrote:
Concluded a test with a 3.4.13 cluster, it shows the same behaviour.

On Mon, Sep 3, 2018 at 4:56 PM Andor Molnar <an...@cloudera.com.invalid>
wrote:

Thanks for testing Chris.

So, if I understand you correctly, you're running the latest version from
branch-3.5. Could we say that this is a 3.5-only problem?
Have you ever tested the same cluster with 3.4?

Regards,
Andor



On Tue, Aug 21, 2018 at 11:29 AM, Cee Tee <c.turks...@gmail.com> wrote:

I've tested the patch and let it run 6 days. It did not help, result is
still the same. (remaining ZKs form islands based on datacenter they are
in).

I have mitigated it by doing a daily rolling restart.

Regards,
Chris

On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar <an...@cloudera.com.invalid

wrote:

Hi Chris,

Would you mind testing the following patch on your test clusters?
I'm not entirely sure, but the issue might be related.

https://issues.apache.org/jira/browse/ZOOKEEPER-2930

Regards,
Andor



On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <cami...@apache.org>
wrote:

If you have the time and inclination, next time you see this problem
in
your test clusters get stack traces and any other diagnostics
possible
before restarting. I'm not an expert at network debugging but if you
have
someone who is you might want them to take a look at the connections
and
settings of any switches/firewalls/etc involved, see if there's any
unusual
configurations or evidence of other long-lived connections failing
(even
if
their services handle the failures more gracefully). Send us the
stack
traces also it would be interesting to take a look.

C


On Wed, Aug 8, 2018, 11:09 AM Chris <c.turks...@gmail.com> wrote:

Running 3.5.5

I managed to recreate it on acc and test cluster today, failing on
shutdown
of leader. Both had been running for over a week. After restarting
all
zookeepers it runs fine no matter how many leader shutdowns i throw
at
it.
On 8 August 2018 5:05:34 pm Andor Molnar
<an...@cloudera.com.INVALID
wrote:

Some kind of a network split?

It looks like 1-2 and 3-4 were able to communicate each other,
but
connection timed out between these 2 splits. When 5 came back
online
it
started with supporters of (1,2) and later 3 and 4 also joined.

There was no such issue the day after.

Which version of ZooKeeper is this? 3.5.something?

Regards,
Andor



On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turks...@gmail.com>
wrote:
Actually i have similar issues on my test and acceptance
clusters
where
leader election fails if the cluster has been running for a
couple
of
days.
If you stop/start the Zookeepers once they will work fine on
further
disruptions that day. Not sure yet what the treshold is.


On 8 August 2018 4:32:56 pm Camille Fournier <
cami...@apache.org>
wrote:
Hard to say. It looks like about 15 minutes after your first
incident
where
5 goes down and then comes back up, servers 1 and 2 get socket
errors
to
their connections with 3, 4, and 6. It's possible if you had
waited
those
15 minutes, once those errors cleared the quorum would've
formed
with
the
other servers. But as for why there were those errors in the
first
place
it's not clear. Could be a network glitch, or an obscure bug in
the
connection logic. Has anyone else ever seen this?
If you see it again, getting a stack trace of the servers when
they
can't
form quorum might be helpful.

On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turks...@gmail.com>
wrote:
I have a cluster of 5 participants (id 1-5) and 1 observer (id
6).
1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
Yesterday one of the participants (id5, by chance was the
leader)
was
rebooted. Although all other servers were online and not
suffering
from
networking issues the leader election failed and the cluster
remained
"looking" until the old leader came back online after which it
was
promptly
elected as leader again.

Today we tried the same exercise on the exact same servers, 5
was
still
leader and was rebooted, and leader election worked fine with
4
as
new
leader.

I have included the logs.  From the logs i see that yesterday
1,2
never
received new leader proposals from 3,4 and vice versa.
Today all proposals came through. This is not the first time
we've
seen
this type of behavior, where some zookeepers can't seem to
find
each
other
after the leader goes down.
All servers use dynamic configuration and have the same config
node.
How could this be explained? These servers also host a
replicated
database
cluster and have no history of db replication issues.

Thanks,
Chris



Reply via email to