On Wed, Jun 16, 2021 at 11:26 AM Klaus Wenninger <kwenn...@redhat.com> wrote:
> > > On Wed, Jun 16, 2021 at 10:47 AM Roger Zhou <zz...@suse.com> wrote: > >> >> On 6/16/21 3:03 PM, Andrei Borzenkov wrote: >> >> > >> >>> >> >>> We thought that access to storage was restored, but one step was >> >>> missing so devices appeared empty. >> >>> >> >>> At this point I tried to restart the pacemaker. But as soon as I >> >>> stopped pacemaker SBD rebooted nodes ‑ which is logical, as quorum was >> >>> now lost. >> >>> >> >>> How to cleanly stop pacemaker in this case and keep nodes up? >> >> >> >> Unconfigurte sbd devices I guess. >> >> >> > >> > Do you have *practical* suggestions on how to do it online in a >> > running pacemaker cluster? Can you explain how it is going to help >> > given that lack of sbd device was not the problem in the first place? >> >> I would translate this issue as "how to gracefully shutdown sbd to >> deregister >> sbd from pacemaker for the whole cluster". Seems no way to do that except >> `systemctl stop corosync`. >> >> With that, to calm down sbd suicide, I'm thinking some tricky steps as >> below >> might help. Well, not sure it fits your situation as the whole. >> >> crm cluster run "systemctl stop pacemaker" >> crm cluster run "systemctl stop corosync" >> > I guess this shouldn't be helpful in this situation. > As I've already tried to explain before shutting down > pacemaker on one of the nodes - if sbd-device can't > be reached - should already be enough for the other > one to suicide. > > One - not less ugly than other suggestions here I'm afraid - > thing coming to my mind is to right after stopping pacemaker > dummy-register at the cpg-protocol. If after that you want > to bring down corosync & sbd as well it should be possible > to do that quickly enough - as pcs is otherwise doing with > 3+ node clusters. > Something else coming to my mind that might be more helpful and less ugly - have to think it over a bit though: With the new startup/shutdown-syncing pacemaker should stay connected to the cpg-protocol till a final handshake with sbd on shutdown. If we could bring all nodes to a state right before that handshake with e.g. pcs we have lots of time for that. And the final step incl. corosync/sbd shutdown is quick enough that it can happen on all nodes within watchdog-timeout. Klaus > >> BR, >> Roger >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/