Hi Andrew, >> Recent versions allow this depending on what the configured fencing devices report. So the device should be able/configured/report that it can STONITH the node on which it is running?
>> You left out “but the devices reports that it did”. Your fencing agent needs to report the truth Yes, that was a hole in the configuration - I didn't specify pcmk_host_list="node-1" and pcmk_host_check="static-list". But the "safety check" that you mentioned before worked perfect, so I didn't notice my mistake in the configuration. Now I see that I shouldn't rely on the "safety check" and should have a proper configuration for STONITH. The thing is - I am trying to understand the way I should modify my config. Using two-node cluster solutions it is possible to have only one node running, and my current STONITH agents, with "pcmk_host_list" and "pcmk_host_check", won't work. Even more, it will lead to the situation when a started cluster couldn't find a device to reboot itself (say "stop" failed) remains working while must be STONITHed. Could you help me to find the best way to "self-stonithing"? Will it be sufficient to create another stonith agent with will issue "reboot -f"? Thank you, Kostya On Mon, Aug 17, 2015 at 1:15 AM, Andrew Beekhof <[email protected]> wrote: > > > On 13 Aug 2015, at 9:39 pm, Kostiantyn Ponomarenko < > [email protected]> wrote: > > > > Hi, > > > > Brief description of the STONITH problem: > > > > I see two different behaviors with two different STONITH configurations. > If Pacemaker cannot find a device that can STONITH a problematic node, the > node remains up and running. Which is bad, because it must be STONITHed. > > As opposite to it, if Pacemaker finds a device that, it thinks, can > STONITH a problematic node, even if the device actually cannot, > > You left out “but the devices reports that it did”. Your fencing agent > needs to report the truth. > > > Pacemaker goes down after STONITH returns false positive. The Pacemaker > shutdowns itself right after STONITH. > > Is it the expected behavior? > > Yes, its a safety check: > > Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: crit: > tengine_stonith_notify: We were alegedly just fenced by node-0 for node-0! > > > > Do I need to configure a two more STONITH agents for just rebooting > nodes on which they are running (e.g. with # reboot -f)? > > > > > > > > +------------------------- > > + Set-up: > > +------------------------- > > - two node cluster (node-0 and node-1); > > - two fencing (STONITH) agents are configured (STONITH_node-0 and > STONITH_node-1). > > - "STONITH_node-0" runs only on "node-1" // this fencing agent can only > fence node-0 > > - "STONITH_node-1" runs only on "node-0" // this fencing agent can only > fence node-1 > > > > +------------------------- > > + Environment: > > +------------------------- > > - one node - "node-0" - is up and running; > > - one STONITH agent - "STONITH_node-1" - is up and running > > > > +------------------------- > > + Test case: > > +------------------------- > > Simulate error of stopping a resource. > > 1. start cluster > > 2. change a RA's script to return "$OCF_ERR_GENERIC" from "Stop" > function. > > 3. stop the resource by "# crm resource stop <resource>" > > > > +------------------------- > > + Actual behavior: > > +------------------------- > > > > CASE 1: > > STONITH is configured with: > > # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw \ > > params pcmk_host_list="node-1" pcmk_host_check="static-list" > > > > After issuing a "stop" command: > > - the resource changes its state to "FAILED" > > - Pacemaker remains working > > > > See below LOG_snippet_1 section. > > > > > > CASE 2: > > STONITH is configured with: > > # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw > > > > After issuing a "stop" command: > > - the resource changes its state to "FAILED" > > - Pacemaker stops working > > > > See below LOG_snippet_2 section. > > > > > > +------------------------- > > + LOG_snippet_1: > > +------------------------- > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: > handle_request: Client crmd.39210.fa40430f wants to fence (reboot) > 'node-0' with device '(any)' > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: > initiate_remote_stonith_op: Initiating remote operation reboot for > node-0: 18cc29db-b7e4-4994-85f1-df891f091a0d (0) > > .... > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: > can_fence_host_with_device: STONITH_node-1 can not fence (reboot) > node-0: static-list > > .... > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: > stonith_choose_peer: Couldn't find anyone to fence node-0 with <any> > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: info: > call_remote_stonith: Total remote op timeout set to 60 for fencing of > node node-0 for crmd.39210.18cc29db > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: info: > call_remote_stonith: None of the 1 peers have devices capable of > terminating node-0 for crmd.39210 (0) > > .... > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: warning: > get_xpath_object: No match for //@st_delegate in /st-reply > > Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: error: > remote_op_done: Operation reboot of node-0 by node-0 for > [email protected]: No such device > > .... > > Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: > tengine_stonith_callback: Stonith operation > 3/23:16:0:0856a484-6b69-4280-b93f-1af9a6a542ee: No such device (-19) > > Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: > tengine_stonith_callback: Stonith operation 3 for node-0 failed (No such > device): aborting transition. > > Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: info: > abort_transition_graph: Transition aborted: Stonith failed > (source=tengine_stonith_callback:697, 0) > > Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: > tengine_stonith_notify: Peer node-0 was not terminated (reboot) by > node-0 for node-0: No such device > > > > > > +------------------------- > > + LOG_snippet_2: > > +------------------------- > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: > handle_request: Client crmd.9009.cabd2154 wants to fence (reboot) 'node-0' > with device '(any)' > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: > initiate_remote_stonith_op: Initiating remote operation reboot for node-0: > 3b06d3ce-b100-46d7-874e-96f10348d9e4 (0) > > .... > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: > can_fence_host_with_device: STONITH_node-1 can fence (reboot) node-0: none > > .... > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: > call_remote_stonith: Total remote op timeout set to 60 for fencing of > node node-0 for crmd.9009.3b06d3ce > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: > call_remote_stonith: Requesting that node-0 perform op reboot node-0 > for crmd.9009 (72s) > > .... > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: > can_fence_host_with_device: STONITH_node-1 can fence (reboot) node-0: none > > Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: > stonith_fence_get_devices_cb: Found 1 matching devices for 'node-0' > > .... > > Aug 11 16:09:53 [9005] A6-4U24-402-T stonithd: notice: > log_operation: Operation 'reboot' [25511] (call 3 from crmd.9009) for > host 'node-0' with device 'STONITH_node-1' returned: 0 (OK) > > Aug 11 16:09:53 [9005] A6-4U24-402-T stonithd: warning: > get_xpath_object: No match for //@st_delegate in /st-reply > > Aug 11 16:09:53 [9005] A6-4U24-402-T stonithd: notice: > remote_op_done: Operation reboot of node-0 by node-0 for > [email protected]: OK > > .... > > Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: notice: > tengine_stonith_callback: Stonith operation > 3/23:115:0:70ac834e-5b67-4ca6-9080-c98d2b59e2ee: OK (0) > > Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: info: > crm_update_peer_join: crmd_peer_down: Node node-0[1] - join-2 phase 4 -> > 0 > > Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: info: > crm_update_peer_expected: crmd_peer_down: Node node-0[1] - expected > state is now down (was member) > > .... > > Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: crit: > tengine_stonith_notify: We were alegedly just fenced by node-0 for node-0! > > .... > > Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd: error: > pcmk_child_exit: Child process crmd (9009) exited: Network is down (100) > > .... > > Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd: warning: > pcmk_child_exit: Pacemaker child process crmd no longer wishes to be > respawned. Shutting ourselves down. > > .... > > Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd: notice: > pcmk_shutdown_worker: Shuting down Pacemaker > > > > > > Thank you, > > Kostya > > _______________________________________________ > > Users mailing list: [email protected] > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Users mailing list: [email protected] > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
