>>> Gabriele Bulfon <gbul...@sonicle.com> schrieb am 14.12.2020 um 12:40 in Nachricht <16685368.7249.1607946038308@www>: > I isolated the log when everything happens (when I disable the ha interface), > attached here.
What looks odd in my eyes is "A new membership (127.0.0.1:352) was formed" using localhost address. Dec 14 12:34:42 [677] stonith-ng: info: call_remote_stonith: Requesting that 'xstha1' perform op 'xstha2 poweroff' for crmd.681 (72s, 0s) Dec 14 12:34:44 [677] stonith-ng: notice: log_operation: Operation 'poweroff' [2235] (call 2 from crmd.681) for host 'xstha2' with device 'xstha2-stonith' returned: 0 (OK) xstha2 should be off now... Dec 14 12:34:44 [681] crmd: info: cib_fencing_updated: Fencing update 43 for xstha2: complete This looks odd: Dec 14 12:34:44 [681] crmd: warning: match_down_event: No reason to expect node 2 to be down I could not see fencing of xtsha1 from xstha2. > > Gabriele > > > Sonicle S.r.l. : http://www.sonicle.com > Music: http://www.gabrielebulfon.com > eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets > > > > > > ---------------------------------------------------------------------------- > ------ > > Da: Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> > A: users@clusterlabs.org > Data: 14 dicembre 2020 11.53.22 CET > Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure > > >>>> Gabriele Bulfon <gbul...@sonicle.com> schrieb am 14.12.2020 um 11:48 in > Nachricht <1065144646.7212.1607942889206@www>: >> Thanks! >> >> I tried first option, by adding pcmk_delay_base to the two stonith >> primitives. >> First has 1 second, second has 5 seconds. >> It didn't work :( they still killed each other :( >> Anything wrong with the way I did it? > > Hard to say without seeing the logs... > >> >> Here's the config: >> >> node 1: xstha1 \ >> attributes standby=off maintenance=off >> node 2: xstha2 \ >> attributes standby=off maintenance=off >> primitive xstha1-stonith stonith:external/ipmi \ >> params hostname=xstha1 ipaddr=192.168.221.18 userid=ADMIN >> passwd="***" interface=lanplus pcmk_delay_base=1 \ >> op monitor interval=25 timeout=25 start-delay=25 \ >> meta target-role=Started >> primitive xstha1_san0_IP IPaddr \ >> params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0 >> primitive xstha2-stonith stonith:external/ipmi \ >> params hostname=xstha2 ipaddr=192.168.221.19 userid=ADMIN >> passwd="***" interface=lanplus pcmk_delay_base=5 \ >> op monitor interval=25 timeout=25 start-delay=25 \ >> meta target-role=Started >> primitive xstha2_san0_IP IPaddr \ >> params ip=10.10.10.2 cidr_netmask=255.255.255.0 nic=san0 >> primitive zpool_data ZFS \ >> params pool=test \ >> op start timeout=90 interval=0 \ >> op stop timeout=90 interval=0 \ >> meta target-role=Started >> location xstha1-stonith-pref xstha1-stonith -inf: xstha1 >> location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1 >> location xstha2-stonith-pref xstha2-stonith -inf: xstha2 >> location xstha2_san0_IP_pref xstha2_san0_IP 100: xstha2 >> order zpool_data_order inf: zpool_data ( xstha1_san0_IP ) >> location zpool_data_pref zpool_data 100: xstha1 >> colocation zpool_data_with_IPs inf: zpool_data xstha1_san0_IP >> property cib-bootstrap-options: \ >> have-watchdog=false \ >> dc-version=1.1.15-e174ec8 \ >> cluster-infrastructure=corosync \ >> stonith-action=poweroff \ >> no-quorum-policy=stop >> >> >> Sonicle S.r.l. : http://www.sonicle.com >> Music: http://www.gabrielebulfon.com >> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets >> >> >> >> >> >> > ---------------------------------------------------------------------------- >> ------ >> >> Da: Andrei Borzenkov <arvidj...@gmail.com> >> A: users@clusterlabs.org >> Data: 13 dicembre 2020 7.50.57 CET >> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure >> >> >> 12.12.2020 20:30, Gabriele Bulfon пишет: >>> Thanks, I will experiment this. >>> >>> Now, I have a last issue about stonith. >>> I tried to reproduce a stonith situation, by disabling the network > interface >> used for HA on node 1. >>> Stonith is configured with ipmi poweroff. >>> What happens, is that once the interface is down, both nodes tries to >> stonith the other node, causing both to poweroff... >> >> Yes, this is expected. The options are basically >> >> 1. Have separate stonith resource for each node and configure static >> (pcmk_delay_base) or random dynamic (pcmk_delay_max) delays to avoid >> both nodes starting stonith at the same time. This does not take >> resources in account. >> >> 2. Use fencing topology and create pseudo-stonith agent that does not >> attempt to do anything but just delays for some time before continuing >> with actual fencing agent. Delay can be based on anything including >> resources running on node. >> >> 3. If you are using pacemaker 2.0.3+, you could use new >> priority-fencing-delay feature that implements resource-based priority >> fencing: >> >> + controller/fencing/scheduler: add new feature 'priority-fencing-delay' >> Optionally derive the priority of a node from the >> resource-priorities >> of the resources it is running. >> In a fencing-race the node with the highest priority has a certain >> advantage over the others as fencing requests for that node are >> executed with an additional delay. >> controlled via cluster option priority-fencing-delay (default = 0) >> >> >> See also https://www.mail-archive.com/users@clusterlabs.org/msg10328.html >> >>> I would like the node running all resources (zpool and nfs ip) to be the >> first trying to stonith the other node. >>> Or is there anything else better? >>> >>> Here is the current crm config show: >>> >> >> It is unreadable >> >>> node 1: xstha1 \ attributes standby=off maintenance=offnode 2: xstha2 \ >> attributes standby=off maintenance=offprimitive xstha1-stonith >> stonith:external/ipmi \ params hostname=xstha1 ipaddr=192.168.221.18 >> userid=ADMIN passwd="******" interface=lanplus \ op monitor interval=25 >> timeout=25 start-delay=25 \ meta target-role=Startedprimitive xstha1_san0_IP > >> IPaddr \ params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0primitive >> xstha2-stonith stonith:external/ipmi \ params hostname=xstha2 >> ipaddr=192.168.221.19 userid=ADMIN passwd="******" interface=lanplus \ op >> monitor interval=25 timeout=25 start-delay=25 \ meta >> target-role=Startedprimitive xstha2_san0_IP IPaddr \ params ip=10.10.10.2 >> cidr_netmask=255.255.255.0 nic=san0primitive zpool_data ZFS \ params >> pool=test \ op start timeout=90 interval=0 \ op stop timeout=90 interval=0 \ > >> meta target-role=Startedlocation xstha1-stonith-pref xstha1-stonith -inf: >> xstha1location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1location >> xstha2-stonith-pref xstha2-stonith -inf: xstha2location xstha2_san0_IP_pref > >> xstha2_san0_IP 100: xstha2order zpool_data_order inf: zpool_data ( >> xstha1_san0_IP )location zpool_data_pref zpool_data 100: xstha1colocation >> zpool_data_with_IPs inf: zpool_data xstha1_san0_IPproperty >> cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.15-e174ec8 \ >> cluster-infrastructure=corosync \ stonith-action=poweroff \ >> no-quorum-policy=stop >>> >>> Thanks! >>> Gabriele >>> >>> >>> Sonicle S.r.l. : http://www.sonicle.com >>> Music: http://www.gabrielebulfon.com >>> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets >>> >>> >>> >>> >>> >>> >> > ---------------------------------------------------------------------------- > - >> ----- >>> >>> Da: Andrei Borzenkov <arvidj...@gmail.com> >>> A: users@clusterlabs.org >>> Data: 11 dicembre 2020 18.30.29 CET >>> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure >>> >>> >>> 11.12.2020 18:37, Gabriele Bulfon пишет: >>>> I found I can do this temporarily: >>>> >>>> crm config property cib-bootstrap-options: no-quorum-policy=ignore >>>> >>> >>> All two node clusters I remember run with setting forever :) >>> >>>> then once node 2 is up again: >>>> >>>> crm config property cib-bootstrap-options: no-quorum-policy=stop >>>> >>>> so that I make sure nodes will not mount in another strange situation. >>>> >>>> Is there any better way? >>> >>> "better" us subjective, but ... >>> >>>> (such as ignore until everything is back to normal then conisder top > again) >>>> >>> >>> That is what stonith does. Because quorum is pretty much useless in two >>> node cluster, as I already said all clusters I have seem used >>> no-quorum-policy=ignore and stonith-enabled=true. It means when node >>> boots and other node is not available stonith is attempted; if stonith >>> succeeds pacemaker continues with starting resources; if stonith fails, >>> node is stuck. >>> >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >>> >>> >>> >>> >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >>> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/