Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

Jan Pokorný Wed, 29 Nov 2017 13:04:06 -0800

On 28/11/17 22:35 +0300, Andrei Borzenkov wrote:
> 28.11.2017 13:01, Jan Pokorný пишет:
>> On 27/11/17 17:43 +0300, Andrei Borzenkov wrote:
>>> Отправлено с iPhone
>>> 
>>>> 27 нояб. 2017 г., в 14:36, Ferenc Wágner <wf...@niif.hu> написал(а):
>>>> 
>>>> Andrei Borzenkov <arvidj...@gmail.com> writes:
>>>> 
>>>>> 25.11.2017 10:05, Andrei Borzenkov пишет:
>>>>> 
>>>>>> In one of guides suggested procedure to simulate split brain was to kill
>>>>>> corosync process. It actually worked on one cluster, but on another
>>>>>> corosync process was restarted after being killed without cluster
>>>>>> noticing anything. Except after several attempts pacemaker died with
>>>>>> stopping resources ... :)
>>>>>> 
>>>>>> This is SLES12 SP2; I do not see any Restart in service definition so it
>>>>>> probably not systemd.
>>>>>> 
>>>>> FTR - it was not corosync, but pacemakker; its unit file specifies
>>>>> RestartOn=error so killing corosync caused pacemaker to fail and be
>>>>> restarted by systemd.
>>>> 
>>>> And starting corosync via a Requires dependency?
>>> 
>>> Exactly.
>> 
>> From my testing it looks like we should change
>> "Requires=corosync.service" to "BindsTo=corosync.service"
>> in pacemaker.service.
>> 
>> Could you give it a try?
>> 
> 
> I'm not sure what is expected outcome, but pacemaker.service is still
> restarted (due to Restart=on-failure).


Expected outcome is that pacemaker.service will become
"inactive (dead)" after killing corosync (as a result of being
"bound" by pacemaker).  Have you indeed issued "systemctl
daemon-reload" after updating the pacemaker unit file?

(FTR, I tried with systemd 235).

> If intention is to unconditionally stop it when corosync dies,
> pacemaker should probably exit with unique code and unit files have
> RestartPreventExitStatus set to it.

That would be an elaborate way to reach the same.

But good point in questioning what's the "best intention" around these
scenarios -- normally, fencing would happen, but as you note, the node
had actually survived by being fast enough to put corosync back to
life, and from there, whether it adds any value to have pacemaker
restarted on non-clean terminations at all.  I don't know.

Would it make more sense to have FailureAction=reboot-immediate to
at least in part emulate the fencing instead?

-- 
Jan (Poki)

pgpvr3dRWe6V_.pgp
Description: PGP signature

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

Reply via email to