Re: [Pacemaker] will a stonith resource be moved from an AWOL node?
On 01/05/2013, at 1:28 AM, Brian J. Murrell wrote: > On 13-04-30 11:13 AM, Lars Marowsky-Bree wrote: >> >> Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB, >> and will complete the fencing request even if the fencing/stonith >> resource is not instantiated on the node yet. > > But clearly that's not happening here. Can you file a bug and attach the logs from both machines? Unless... are you still using cman or the pacemaker plugin (as shipped or the patched one from https://bugzilla.redhat.com/show_bug.cgi?id=951340)? > >> (There's a bug in 1.1.8 as >> released that causes an annoying delay here, but that's fixed since.) > > Do you know which bug specifically so that I can see if the fix has been > applied here? > >>> Node node1: UNCLEAN (pending) >>> Online: [ node2 ] >> >>> node1 is very clearly completely off. The cluster has been in this state, >>> with node1 being off for several 10s of minutes now and still the stonith >>> resource is running on it. >> >> It shouldn't take so long. > > Indeed. And FWIW, it's still in that state. > >> I think your easiest path is to update. > > Update to what? I'm already using pacemaker-1.1.8-7 on EL6 and a yum > update is not providing anything newer. > > Cheers, > b. > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] will a stonith resource be moved from an AWOL node?
On 13-04-30 11:13 AM, Lars Marowsky-Bree wrote: > > Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB, > and will complete the fencing request even if the fencing/stonith > resource is not instantiated on the node yet. But clearly that's not happening here. > (There's a bug in 1.1.8 as > released that causes an annoying delay here, but that's fixed since.) Do you know which bug specifically so that I can see if the fix has been applied here? >> Node node1: UNCLEAN (pending) >> Online: [ node2 ] > >> node1 is very clearly completely off. The cluster has been in this state, >> with node1 being off for several 10s of minutes now and still the stonith >> resource is running on it. > > It shouldn't take so long. Indeed. And FWIW, it's still in that state. > I think your easiest path is to update. Update to what? I'm already using pacemaker-1.1.8-7 on EL6 and a yum update is not providing anything newer. Cheers, b. signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] will a stonith resource be moved from an AWOL node?
On 2013-04-30T10:55:41, "Brian J. Murrell" wrote: > From what I think I know of pacemaker, pacemaker wants to be able to > stonith that AWOL node before moving any resources away from it since > starting a resource on a new node while the state of the AWOL node is > unknown is unsafe, right? Right. > But of course, if the resource that pacemaker wants to move is the > stonith resource there's a bit of a catch-22. It can't move the > stonith resource until it can stonith the node, which it cannot stonith > the node because the node running the resource is AWOL. > > So, is pacemaker supposed to resolve this on it's own or am I supposed > to create a cluster configuration that ensures that enough stonith > resources exist to mitigate this situation? Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB, and will complete the fencing request even if the fencing/stonith resource is not instantiated on the node yet. (There's a bug in 1.1.8 as released that causes an annoying delay here, but that's fixed since.) That can appear to be a bit confusing if you were used to the previous behaviour. (And I'm not sure it's a real win for the complexity of the project/code, but Andrew and David are.) > Node node1: UNCLEAN (pending) > Online: [ node2 ] > node1 is very clearly completely off. The cluster has been in this state, > with node1 being off for several 10s of minutes now and still the stonith > resource is running on it. It shouldn't take so long. I think your easiest path is to update. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] will a stonith resource be moved from an AWOL node?
I'm using pacemaker 1.1.8 and I don't see stonith resources moving away from AWOL hosts like I thought I did with 1.1.7. So I guess the first thing to do is clear up what is supposed to happen. If I have a single stonith resource for a cluster and it's running on node A and then node A goes AWOL, what happens to that stonith resource? From what I think I know of pacemaker, pacemaker wants to be able to stonith that AWOL node before moving any resources away from it since starting a resource on a new node while the state of the AWOL node is unknown is unsafe, right? But of course, if the resource that pacemaker wants to move is the stonith resource there's a bit of a catch-22. It can't move the stonith resource until it can stonith the node, which it cannot stonith the node because the node running the resource is AWOL. So, is pacemaker supposed to resolve this on it's own or am I supposed to create a cluster configuration that ensures that enough stonith resources exist to mitigate this situation? The case I have in hand is this: # pcs config Corosync Nodes: Pacemaker Nodes: node1 node2 Resources: Resource: stonith (type=fence_xvm class=stonith) Location Constraints: Ordering Constraints: Colocation Constraints: Cluster Properties: dc-version: 1.1.8-7.wc1.el6-394e906 expected-quorum-votes: 2 no-quorum-policy: ignore symmetric-cluster: true cluster-infrastructure: classic openais (with plugin) stonith-enabled: true last-lrm-refresh: 1367331233 # pcs status Last updated: Tue Apr 30 14:48:06 2013 Last change: Tue Apr 30 14:13:53 2013 via crmd on node2 Stack: classic openais (with plugin) Current DC: node2 - partition WITHOUT quorum Version: 1.1.8-7.wc1.el6-394e906 2 Nodes configured, 2 expected votes 1 Resources configured. Node node1: UNCLEAN (pending) Online: [ node2 ] Full list of resources: stonith(stonith:fence_xvm):Started node1 node1 is very clearly completely off. The cluster has been in this state, with node1 being off for several 10s of minutes now and still the stonith resource is running on it. The log, since corosync noticed node1 going AWOL: Apr 30 14:14:56 node2 corosync[1364]: [TOTEM ] A processor failed, forming new configuration. Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 52: memb=1, new=0, lost=1 Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: pcmk_peer_update: lost: node1 4252674240 Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 52: memb=1, new=0, lost=0 Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: update_member: Node 4252674240/node1 is now: lost Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: send_member_notification: Sending membership update 52 to 2 children Apr 30 14:14:57 node2 corosync[1364]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 30 14:14:57 node2 corosync[1364]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.155) ; members(old:2 left:1) Apr 30 14:14:57 node2 corosync[1364]: [MAIN ] Completed service synchronization, ready to provide service. Apr 30 14:14:57 node2 crmd[1666]: notice: ais_dispatch_message: Membership 52: quorum lost Apr 30 14:14:57 node2 crmd[1666]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 30 14:14:57 node2 crmd[1666]: warning: match_down_event: No match for shutdown action on node1 Apr 30 14:14:57 node2 crmd[1666]: notice: peer_update_callback: Stonith/shutdown of node1 not matched Apr 30 14:14:57 node2 cib[1661]: notice: ais_dispatch_message: Membership 52: quorum lost Apr 30 14:14:57 node2 cib[1661]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 30 14:14:57 node2 crmd[1666]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ] Apr 30 14:14:57 node2 attrd[1664]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Apr 30 14:14:57 node2 attrd[1664]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Apr 30 14:14:58 node2 pengine[1665]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 'now' Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 'now' Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 'now' Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to