Re: Leader election failing

Andor Molnár Tue, 11 Sep 2018 12:18:02 -0700

Erm.

Thanks for carrying out these tests Chris.


Have you by any chance - as Camille suggested - collected debug logs
from these tests?


Andor



On 09/11/2018 11:08 AM, Cee Tee wrote:
> Concluded a test with a 3.4.13 cluster, it shows the same behaviour.
>
> On Mon, Sep 3, 2018 at 4:56 PM Andor Molnar <an...@cloudera.com.invalid>
> wrote:
>
>> Thanks for testing Chris.
>>
>> So, if I understand you correctly, you're running the latest version from
>> branch-3.5. Could we say that this is a 3.5-only problem?
>> Have you ever tested the same cluster with 3.4?
>>
>> Regards,
>> Andor
>>
>>
>>
>> On Tue, Aug 21, 2018 at 11:29 AM, Cee Tee <c.turks...@gmail.com> wrote:
>>
>>> I've tested the patch and let it run 6 days. It did not help, result is
>>> still the same. (remaining ZKs form islands based on datacenter they are
>>> in).
>>>
>>> I have mitigated it by doing a daily rolling restart.
>>>
>>> Regards,
>>> Chris
>>>
>>> On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar <an...@cloudera.com.invalid
>>>
>>> wrote:
>>>
>>>> Hi Chris,
>>>>
>>>> Would you mind testing the following patch on your test clusters?
>>>> I'm not entirely sure, but the issue might be related.
>>>>
>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2930
>>>>
>>>> Regards,
>>>> Andor
>>>>
>>>>
>>>>
>>>> On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <cami...@apache.org>
>>>> wrote:
>>>>
>>>>> If you have the time and inclination, next time you see this problem
>> in
>>>>> your test clusters get stack traces and any other diagnostics
>> possible
>>>>> before restarting. I'm not an expert at network debugging but if you
>>> have
>>>>> someone who is you might want them to take a look at the connections
>>> and
>>>>> settings of any switches/firewalls/etc involved, see if there's any
>>>> unusual
>>>>> configurations or evidence of other long-lived connections failing
>>> (even
>>>> if
>>>>> their services handle the failures more gracefully). Send us the
>> stack
>>>>> traces also it would be interesting to take a look.
>>>>>
>>>>> C
>>>>>
>>>>>
>>>>> On Wed, Aug 8, 2018, 11:09 AM Chris <c.turks...@gmail.com> wrote:
>>>>>
>>>>>> Running 3.5.5
>>>>>>
>>>>>> I managed to recreate it on acc and test cluster today, failing on
>>>>>> shutdown
>>>>>> of leader. Both had been running for over a week. After restarting
>>> all
>>>>>> zookeepers it runs fine no matter how many leader shutdowns i throw
>>> at
>>>>> it.
>>>>>> On 8 August 2018 5:05:34 pm Andor Molnar
>> <an...@cloudera.com.INVALID
>>>>>> wrote:
>>>>>>
>>>>>>> Some kind of a network split?
>>>>>>>
>>>>>>> It looks like 1-2 and 3-4 were able to communicate each other,
>> but
>>>>>>> connection timed out between these 2 splits. When 5 came back
>>> online
>>>> it
>>>>>>> started with supporters of (1,2) and later 3 and 4 also joined.
>>>>>>>
>>>>>>> There was no such issue the day after.
>>>>>>>
>>>>>>> Which version of ZooKeeper is this? 3.5.something?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andor
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turks...@gmail.com>
>>> wrote:
>>>>>>>> Actually i have similar issues on my test and acceptance
>> clusters
>>>>> where
>>>>>>>> leader election fails if the cluster has been running for a
>> couple
>>>> of
>>>>>> days.
>>>>>>>> If you stop/start the Zookeepers once they will work fine on
>>> further
>>>>>>>> disruptions that day. Not sure yet what the treshold is.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8 August 2018 4:32:56 pm Camille Fournier <
>> cami...@apache.org>
>>>>>> wrote:
>>>>>>>> Hard to say. It looks like about 15 minutes after your first
>>>> incident
>>>>>> where
>>>>>>>>> 5 goes down and then comes back up, servers 1 and 2 get socket
>>>> errors
>>>>>> to
>>>>>>>>> their connections with 3, 4, and 6. It's possible if you had
>>> waited
>>>>>> those
>>>>>>>>> 15 minutes, once those errors cleared the quorum would've
>> formed
>>>> with
>>>>>> the
>>>>>>>>> other servers. But as for why there were those errors in the
>>> first
>>>>>> place
>>>>>>>>> it's not clear. Could be a network glitch, or an obscure bug in
>>> the
>>>>>>>>> connection logic. Has anyone else ever seen this?
>>>>>>>>> If you see it again, getting a stack trace of the servers when
>>> they
>>>>>> can't
>>>>>>>>> form quorum might be helpful.
>>>>>>>>>
>>>>>>>>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turks...@gmail.com>
>>>>> wrote:
>>>>>>>>> I have a cluster of 5 participants (id 1-5) and 1 observer (id
>>> 6).
>>>>>>>>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
>>>>>>>>>> Yesterday one of the participants (id5, by chance was the
>>> leader)
>>>>> was
>>>>>>>>>> rebooted. Although all other servers were online and not
>>> suffering
>>>>>> from
>>>>>>>>>> networking issues the leader election failed and the cluster
>>>>> remained
>>>>>>>>>> "looking" until the old leader came back online after which it
>>> was
>>>>>>>>>> promptly
>>>>>>>>>> elected as leader again.
>>>>>>>>>>
>>>>>>>>>> Today we tried the same exercise on the exact same servers, 5
>>> was
>>>>>> still
>>>>>>>>>> leader and was rebooted, and leader election worked fine with
>> 4
>>> as
>>>>> new
>>>>>>>>>> leader.
>>>>>>>>>>
>>>>>>>>>> I have included the logs.  From the logs i see that yesterday
>>> 1,2
>>>>>> never
>>>>>>>>>> received new leader proposals from 3,4 and vice versa.
>>>>>>>>>> Today all proposals came through. This is not the first time
>>> we've
>>>>>> seen
>>>>>>>>>> this type of behavior, where some zookeepers can't seem to
>> find
>>>> each
>>>>>>>>>> other
>>>>>>>>>> after the leader goes down.
>>>>>>>>>> All servers use dynamic configuration and have the same config
>>>> node.
>>>>>>>>>> How could this be explained? These servers also host a
>>> replicated
>>>>>>>>>> database
>>>>>>>>>> cluster and have no history of db replication issues.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Chris
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Leader election failing

Reply via email to