> On 21 Aug 2015, at 11:01 pm, Kostiantyn Ponomarenko > <[email protected]> wrote: > > Hi Andrew, > > >> Recent versions allow this depending on what the configured fencing > >> devices report. > So the device should be able/configured/report that it can STONITH the node > on which it is running?
It should just report who it can fence. The cluster will decide if/when to use it. > > >> You left out “but the devices reports that it did”. Your fencing agent > >> needs to report the truth > Yes, that was a hole in the configuration - I didn't specify > pcmk_host_list="node-1" and pcmk_host_check="static-list". > But the "safety check" that you mentioned before worked perfect, so I didn't > notice my mistake in the configuration. > > Now I see that I shouldn't rely on the "safety check" and should have a > proper configuration for STONITH. Always :) > The thing is - I am trying to understand the way I should modify my config. > Using two-node cluster solutions it is possible to have only one node > running, and my current STONITH agents, with "pcmk_host_list" and > "pcmk_host_check", won't work. > Even more, it will lead to the situation when a started cluster couldn't find > a device to reboot itself (say "stop" failed) remains working while must be > STONITHed. > > Could you help me to find the best way to "self-stonithing"? > Will it be sufficient to create another stonith agent with will issue "reboot > -f”? If that satisfies the level of risk you are prepared to tolerate. Just make sure it only reports the current host. > > > Thank you, > Kostya > > On Mon, Aug 17, 2015 at 1:15 AM, Andrew Beekhof <[email protected]> wrote: > > > On 13 Aug 2015, at 9:39 pm, Kostiantyn Ponomarenko > > <[email protected]> wrote: > > > > Hi, > > > > Brief description of the STONITH problem: > > > > I see two different behaviors with two different STONITH configurations. If > > Pacemaker cannot find a device that can STONITH a problematic node, the > > node remains up and running. Which is bad, because it must be STONITHed. > > As opposite to it, if Pacemaker finds a device that, it thinks, can STONITH > > a problematic node, even if the device actually cannot, > > You left out “but the devices reports that it did”. Your fencing agent needs > to report the truth. > > > Pacemaker goes down after STONITH returns false positive. The Pacemaker > > shutdowns itself right after STONITH. > > Is it the expected behavior? > > Yes, its a safety check: > > Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: crit: > tengine_stonith_notify: We were alegedly just fenced by node-0 for node-0! > > > > Do I need to configure a two more STONITH agents for just rebooting nodes > > on which they are running (e.g. with # reboot -f)? > > > > > > > > +------------------------- > > + Set-up: > > +------------------------- > > - two node cluster (node-0 and node-1); > > - two fencing (STONITH) agents are configured (STONITH_node-0 and > > STONITH_node-1). > > - "STONITH_node-0" runs only on "node-1" // this fencing agent can only > > fence node-0 > > - "STONITH_node-1" runs only on "node-0" // this fencing agent can only > > fence node-1 > > > > +------------------------- > > + Environment: > > +------------------------- > > - one node - "node-0" - is up and running; > > - one STONITH agent - "STONITH_node-1" - is up and running > > > > +------------------------- > > + Test case: > > +------------------------- > > Simulate error of stopping a resource. > > 1. start cluster > > 2. change a RA's script to return "$OCF_ERR_GENERIC" from "Stop" function. > > 3. stop the resource by "# crm resource stop <resource>" > > > > +------------------------- > > + Actual behavior: > > +------------------------- > > > > CASE 1: > > STONITH is configured with: > > # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw \ > > params pcmk_host_list="node-1" pcmk_host_check="static-list" > > > > After issuing a "stop" command: > > - the resource changes its state to "FAILED" > > - Pacemaker remains working > > > > See below LOG_snippet_1 section. > > > > > > CASE 2: > > STONITH is configured with: > > # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw > > > > After issuing a "stop" command: > > - the resource changes its state to "FAILED" > > - Pacemaker stops working > > > > See below LOG_snippet_2 section. > > > > > > +------------------------- > > + LOG_snippet_1: > > +------------------------- > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: handle_request: > > Client crmd.39210.fa40430f wants to fence (reboot) 'node-0' with device > > '(any)' > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: > > initiate_remote_stonith_op: Initiating remote operation reboot for > > node-0: 18cc29db-b7e4-4994-85f1-df891f091a0d (0) > > .... > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: > > can_fence_host_with_device: STONITH_node-1 can not fence (reboot) > > node-0: static-list > > .... > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: > > stonith_choose_peer: Couldn't find anyone to fence node-0 with <any> > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: info: > > call_remote_stonith: Total remote op timeout set to 60 for fencing of > > node node-0 for crmd.39210.18cc29db > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: info: > > call_remote_stonith: None of the 1 peers have devices capable of > > terminating node-0 for crmd.39210 (0) > > .... > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: warning: > > get_xpath_object: No match for //@st_delegate in /st-reply > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: error: remote_op_done: > > Operation reboot of node-0 by node-0 for [email protected]: No > > such device > > .... > > Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: > > tengine_stonith_callback: Stonith operation > > 3/23:16:0:0856a484-6b69-4280-b93f-1af9a6a542ee: No such device (-19) > > Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: > > tengine_stonith_callback: Stonith operation 3 for node-0 failed (No such > > device): aborting transition. > > Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: info: > > abort_transition_graph: Transition aborted: Stonith failed > > (source=tengine_stonith_callback:697, 0) > > Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: > > tengine_stonith_notify: Peer node-0 was not terminated (reboot) by > > node-0 for node-0: No such device > > > > > > +------------------------- > > + LOG_snippet_2: > > +------------------------- > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: handle_request: > > Client crmd.9009.cabd2154 wants to fence (reboot) 'node-0' with device > > '(any)' > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: > > initiate_remote_stonith_op: Initiating remote operation reboot for node-0: > > 3b06d3ce-b100-46d7-874e-96f10348d9e4 (0) > > .... > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: > > can_fence_host_with_device: STONITH_node-1 can fence (reboot) node-0: none > > .... > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: > > call_remote_stonith: Total remote op timeout set to 60 for fencing of > > node node-0 for crmd.9009.3b06d3ce > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: > > call_remote_stonith: Requesting that node-0 perform op reboot node-0 > > for crmd.9009 (72s) > > .... > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: > > can_fence_host_with_device: STONITH_node-1 can fence (reboot) node-0: none > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: > > stonith_fence_get_devices_cb: Found 1 matching devices for 'node-0' > > .... > > Aug 11 16:09:53 [9005] A6-4U24-402-T stonithd: notice: log_operation: > > Operation 'reboot' [25511] (call 3 from crmd.9009) for host 'node-0' with > > device 'STONITH_node-1' returned: 0 (OK) > > Aug 11 16:09:53 [9005] A6-4U24-402-T stonithd: warning: > > get_xpath_object: No match for //@st_delegate in /st-reply > > Aug 11 16:09:53 [9005] A6-4U24-402-T stonithd: notice: remote_op_done: > > Operation reboot of node-0 by node-0 for [email protected]: OK > > .... > > Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: notice: > > tengine_stonith_callback: Stonith operation > > 3/23:115:0:70ac834e-5b67-4ca6-9080-c98d2b59e2ee: OK (0) > > Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: info: > > crm_update_peer_join: crmd_peer_down: Node node-0[1] - join-2 phase 4 -> > > 0 > > Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: info: > > crm_update_peer_expected: crmd_peer_down: Node node-0[1] - expected > > state is now down (was member) > > .... > > Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: crit: > > tengine_stonith_notify: We were alegedly just fenced by node-0 for node-0! > > .... > > Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd: error: pcmk_child_exit: > > Child process crmd (9009) exited: Network is down (100) > > .... > > Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd: warning: pcmk_child_exit: > > Pacemaker child process crmd no longer wishes to be respawned. Shutting > > ourselves down. > > .... > > Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd: notice: > > pcmk_shutdown_worker: Shuting down Pacemaker > > > > > > Thank you, > > Kostya > > _______________________________________________ > > Users mailing list: [email protected] > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Users mailing list: [email protected] > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Users mailing list: [email protected] > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
