> On 13 Aug 2015, at 9:39 pm, Kostiantyn Ponomarenko 
> <[email protected]> wrote:
> 
> Hi,
> 
> Brief description of the STONITH problem: 
> 
> I see two different behaviors with two different STONITH configurations. If 
> Pacemaker cannot find a device that can STONITH a problematic node, the node 
> remains up and running. Which is bad, because it must be STONITHed.
> As opposite to it, if Pacemaker finds a device that, it thinks, can STONITH a 
> problematic node, even if the device actually cannot,

You left out “but the devices reports that it did”.  Your fencing agent needs 
to report the truth. 

> Pacemaker goes down after STONITH returns false positive. The Pacemaker 
> shutdowns itself right after STONITH.
> Is it the expected behavior?

Yes, its a safety check:

    Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     crit: 
tengine_stonith_notify:  We were alegedly just fenced by node-0 for node-0!


> Do I need to configure a two more STONITH agents for just rebooting nodes on 
> which they are running (e.g. with # reboot -f)?
> 
> 
> 
> +-------------------------
> + Set-up:
> +-------------------------
> - two node cluster (node-0 and node-1);
> - two fencing (STONITH) agents are configured (STONITH_node-0 and 
> STONITH_node-1).
> - "STONITH_node-0" runs only on "node-1" // this fencing agent can only fence 
> node-0
> - "STONITH_node-1" runs only on "node-0" // this fencing agent can only fence 
> node-1
> 
> +-------------------------
> + Environment:
> +-------------------------
> - one node - "node-0" - is up and running;
> - one STONITH agent - "STONITH_node-1" - is up and running
> 
> +-------------------------
> + Test case:
> +-------------------------
> Simulate error of stopping a resource.
> 1. start cluster
> 2. change a RA's script to return "$OCF_ERR_GENERIC" from "Stop" function.
> 3. stop the resource by "# crm resource stop <resource>"
> 
> +-------------------------
> + Actual behavior:
> +-------------------------
> 
>     CASE 1:
> STONITH is configured with:
> # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw \
>         params pcmk_host_list="node-1" pcmk_host_check="static-list"
> 
> After issuing a "stop" command:
>     - the resource changes its state to "FAILED"
>     - Pacemaker remains working
> 
> See below LOG_snippet_1 section. 
> 
> 
>     CASE 2:
> STONITH is configured with:
> # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw
> 
> After issuing a "stop" command:
>     - the resource changes its state to "FAILED"
>     - Pacemaker stops working
> 
> See below LOG_snippet_2 section.
> 
> 
> +-------------------------
> + LOG_snippet_1:
> +-------------------------
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice: handle_request:   
>   Client crmd.39210.fa40430f wants to fence (reboot) 'node-0' with device 
> '(any)'
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice: 
> initiate_remote_stonith_op:     Initiating remote operation reboot for 
> node-0: 18cc29db-b7e4-4994-85f1-df891f091a0d (0)
> ....
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice: 
> can_fence_host_with_device:     STONITH_node-1 can not fence (reboot) node-0: 
> static-list
> ....
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice: 
> stonith_choose_peer:    Couldn't find anyone to fence node-0 with <any>
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:     info: 
> call_remote_stonith:    Total remote op timeout set to 60 for fencing of node 
> node-0 for crmd.39210.18cc29db
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:     info: 
> call_remote_stonith:    None of the 1 peers have devices capable of 
> terminating node-0 for crmd.39210 (0)
> ....
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:  warning: get_xpath_object: 
>   No match for //@st_delegate in /st-reply
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:    error: remote_op_done:   
>   Operation reboot of node-0 by node-0 for [email protected]: No 
> such device
> ....
> Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice: 
> tengine_stonith_callback:   Stonith operation 
> 3/23:16:0:0856a484-6b69-4280-b93f-1af9a6a542ee: No such device (-19)
> Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice: 
> tengine_stonith_callback:   Stonith operation 3 for node-0 failed (No such 
> device): aborting transition.
> Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:     info: 
> abort_transition_graph:     Transition aborted: Stonith failed 
> (source=tengine_stonith_callback:697, 0)
> Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice: 
> tengine_stonith_notify:     Peer node-0 was not terminated (reboot) by node-0 
> for node-0: No such device
> 
> 
> +-------------------------
> + LOG_snippet_2:
> +-------------------------
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice: handle_request:  
> Client crmd.9009.cabd2154 wants to fence (reboot) 'node-0' with device '(any)'
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice: 
> initiate_remote_stonith_op:  Initiating remote operation reboot for node-0: 
> 3b06d3ce-b100-46d7-874e-96f10348d9e4 (0)
> ....
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice: 
> can_fence_host_with_device:  STONITH_node-1 can fence (reboot) node-0: none
> ....
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info: 
> call_remote_stonith:     Total remote op timeout set to 60 for fencing of 
> node node-0 for crmd.9009.3b06d3ce
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info: 
> call_remote_stonith:     Requesting that node-0 perform op reboot node-0 for 
> crmd.9009 (72s)
> ....
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice: 
> can_fence_host_with_device:  STONITH_node-1 can fence (reboot) node-0: none
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info: 
> stonith_fence_get_devices_cb:    Found 1 matching devices for 'node-0'
> ....
> Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:   notice: log_operation:   
> Operation 'reboot' [25511] (call 3 from crmd.9009) for host 'node-0' with 
> device 'STONITH_node-1' returned: 0 (OK)
> Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:  warning: get_xpath_object:  
>   No match for //@st_delegate in /st-reply
> Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:   notice: remote_op_done:  
> Operation reboot of node-0 by node-0 for [email protected]: OK
> ....
> Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:   notice: 
> tengine_stonith_callback:    Stonith operation 
> 3/23:115:0:70ac834e-5b67-4ca6-9080-c98d2b59e2ee: OK (0)
> Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     info: 
> crm_update_peer_join:    crmd_peer_down: Node node-0[1] - join-2 phase 4 -> 0
> Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     info: 
> crm_update_peer_expected:    crmd_peer_down: Node node-0[1] - expected state 
> is now down (was member)
> ....
> Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     crit: 
> tengine_stonith_notify:  We were alegedly just fenced by node-0 for node-0!
> ....
> Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:    error: pcmk_child_exit:   
>   Child process crmd (9009) exited: Network is down (100)
> ....
> Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:  warning: pcmk_child_exit:   
>   Pacemaker child process crmd no longer wishes to be respawned. Shutting 
> ourselves down.
> ....
> Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:   notice: 
> pcmk_shutdown_worker:    Shuting down Pacemaker
> 
> 
> Thank you,
> Kostya
> _______________________________________________
> Users mailing list: [email protected]
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to