Hi, On Wed, Oct 14, 2015 at 10:31:40PM +0530, Shyam wrote: > Hi all, > > We were previously using Pacemaker 1.1.6 on Ubuntu Precise. We are shifting > to Ubuntu Trusty & it has Pacemaker 1.1.10. > > We have our own STONITH resource & with the pacemaker upgrade this is > broken. The problem that I am hitting is not something new & is exactly > that was reported in > http://clusterlabs.org/pipermail/users/2015-June/000686.html > > Basically Hearbeat + Pacemaker stack suffers this problem (even with the > github mainline code) & since a lot of people are using corosync I think > the problem is not felt with. > > The basic problem is this. Between Pacemaker 1.1.6 to 1.1.10, the following > change was done > > fencing/remote.c > > static void remote_op_done(remote_fencing_op_t * op, xmlNode * data, int > rc, int dup) > > if (dup == FALSE && safe_str_neq(subt, "broadcast")) { > /* Defer notification until the bcast message arrives */ > bcast_result_to_peers(op, rc); > goto remote_op_done_cleanup; > > remote_op_done() previously used to send a direct reply. Now it was changed > that it does a broadcast of the result & this broadcast is supposed to come > back to the same fencing agent upon which finishes the operation. However > this broadcast that is being sent by stonithd doesnt reach back itself. > This causes eventually crmd to timeout on stonith_async_timeout_handler() > to be triggered & keep retrying STONITH. so essentially STONITH is broken.
I can vaguelly recall some stonith/heartbeat issue which got fixed by Lars: commit cc34288a7b2276aa238546f4aa92fa79b8cbcf88 Author: Lars Ellenberg <[email protected]> Date: Tue Dec 9 10:48:33 2014 +0100 Sounds like the same problem. Thanks, Dejan > I tried setting up bcast in /etc/ha.d/ha.cf, but it doesnt help. To confirm > the issue, I removed the above flow for bcast_result_to_peers() & then > STONITH works perfectly fine. Any pointers/help appreciated on how to > resolve this issue in the right way? > > > From logs > > CRMD issues STONITH reboot to stonith-ng > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG: Dumping message with > 19 fields > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[0] : > [__name__=stonith_command] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[1] : > [__name__=stonith_command] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[3] : > [st_async_id=617dce02-057f-42b8-9df2-33bc4e64780d] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[4] : [st_op=st_fence] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[5] : [st_callid=2] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[6] : [st_callopt=0] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[7] : > [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[8] : [st_target=node1] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[9] : > [st_device_action=reboot] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[10] : [st_origin=node0] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[11] : > [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[12] : > [st_clientname=crmd.14483] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[13] : [st_timeout=90] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[14] : [st_mode=smart] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[15] : [dest=node0] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[16] : [oseq=2] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[17] : > [from_id=stonith-ng] > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[18] : > [to_id=stonith-ng] > > > stonith-ng successfully finishes the STONITH > Oct 14 14:57:03 node0 stonith-ng[14481]: notice: log_operation: Operation > 'reboot' [14657] (call 2 from crmd.14483) for host 'node1' with device > 'Z-FENCE:0' returned: 0 (OK) > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: > stonith_send_async_reply: Directed reply to node0 > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > outbound <st-reply st_origin="node0" t="stonith-ng" st_op="st_fence" > st_device_id="Z-FENCE:0" > st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" > st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" > st_clientname="crmd.14483" st_target="node1" st_device_action="reboot" > st_callid="2" st_callopt="0" st_rc="0" st_output="Performing: stonith -t > external/zstonith -T reset node1\nsuccess: node1 0\n"/> > > > & stonith-ng sends a self message for st-reply > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message with > 18 fields > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] : > [__name__=st-reply] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [st_origin=node0] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : [st_op=st_fence] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : > [st_device_id=Z-FENCE:0] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] : > [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] : > [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] : > [st_clientname=crmd.14483] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[8] : [st_target=node1] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[9] : > [st_device_action=reboot] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[10] : [st_callid=2] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[11] : [st_callopt=0] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[12] : [st_rc=0] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[13] : > [st_output=Performing: stonith -t external/zstonith -T reset > node1#012success: node1 0#012] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[14] : [dest=node0] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[15] : [oseq=3] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[16] : > [from_id=stonith-ng] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[17] : > [to_id=stonith-ng] > > > As part of this stonith-ng does a broadcast > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: stonith_command: > Processing st_fence reply 0 from node0 ( 0) > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: remote_op_done: > Broadcasting result > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > outbound <st-reply t="st_notify" subt="broadcast" st_op="st_notify" > count="1"> > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > outbound <st_calldata> > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > outbound <st_notify_fence state="2" st_rc="0" st_target="node1" > st_device_action="reboot" st_delegate="node0" > st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0" > st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" > st_clientname="crmd.14483"/> > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > outbound </st_calldata> > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > outbound </st-reply> > > > that is realized at hearbeat > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message with > 8 fields > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] : > [__name__=st-reply] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [t=st_notify] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [subt=broadcast] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : [st_op=st_notify] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : [count=1] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] : > [st_calldata=<st_calldata><st_notify_fence state="2" st_rc="0" > st_target="node1" st_device_action="reboot" st_delegate="node0" > st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0" > st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" > st_clientname="crmd.14483"/></st_calldata>] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] : > [from_id=stonith-ng] > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] : [to_id=stonith-ng] > > but then on nothing really happens. > > eventually crmd gives up on timeout > Oct 14 14:59:48 node0 crmd[14483]: error: stonith_async_timeout_handler: > Async call 2 timed out after 168000ms > > Thanks. > > --Shyam > _______________________________________________ > Users mailing list: [email protected] > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
