Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

2017-11-26 Thread Andrei Borzenkov
22.11.2017 22:45, Klaus Wenninger пишет:
>>
>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
>> just fenced by sapprod01p for sapprod01p
>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
>> process (3151) can no longer be respawned,
>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down 
>> Pacemaker
>>
>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
>> stonith with SBD always takes msgwait (at least, visually host is not
>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
>> and is up and running long before timeout expires.
>>
>> I think I have seen similar report already. Is it something that can
>> be fixed by SBD/pacemaker tuning?
> Don't know it from sbd but have seen where fencing using
> the cycle-method with machines that boot quickly leads to
> strange behavior.
> If you configure sbd to not clear the disk-slot on startup
> (SBD_START_MODE=clean) it should be left to the other
> side to do that which should prevent the other node from
> coming up while the one fencing is still waiting. You might
> set the method from cycle to off/on to make the fencing
> side clean the slot.
> 
>>
>> I can provide full logs tomorrow if needed.
> Yes would be interesting to see more ...
> 

crm_report attached (it's from different trivial test cluster). Actually
I can reliably reproduce it as long as node is rebooted and pacemaker is
started before stonith agent confirmed node kill.

Unfortunately in case of SBD I cannot set stonith timeout too low as we
need to account for possible storage path failover.


hb_report-Sun-26-Nov-2017.tar.bz2
Description: application/bzip
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

2017-11-26 Thread Andrei Borzenkov
25.11.2017 10:05, Andrei Borzenkov пишет:
> In one of guides suggested procedure to simulate split brain was to kill
> corosync process. It actually worked on one cluster, but on another
> corosync process was restarted after being killed without cluster
> noticing anything. Except after several attempts pacemaker died with
> stopping resources ... :)
> 
> This is SLES12 SP2; I do not see any Restart in service definition so it
> probably not systemd.
> 
FTR - it was not corosync, but pacemakker; its unit file specifies
RestartOn=error so killing corosync caused pacemaker to fail and be
restarted by systemd.

I wish systemd could dynamically "unmanage" services ...

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org