Re: [Pacemaker] will a stonith resource be moved from an AWOL node?

2013-04-30 Thread Andrew Beekhof

On 01/05/2013, at 1:28 AM, Brian J. Murrell  wrote:

> On 13-04-30 11:13 AM, Lars Marowsky-Bree wrote:
>> 
>> Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB,
>> and will complete the fencing request even if the fencing/stonith
>> resource is not instantiated on the node yet.
> 
> But clearly that's not happening here.

Can you file a bug and attach the logs from both machines?

Unless... are you still using cman or the pacemaker plugin (as shipped or the 
patched one from https://bugzilla.redhat.com/show_bug.cgi?id=951340)?


> 
>> (There's a bug in 1.1.8 as
>> released that causes an annoying delay here, but that's fixed since.)
> 
> Do you know which bug specifically so that I can see if the fix has been
> applied here?
> 
>>> Node node1: UNCLEAN (pending)
>>> Online: [ node2 ]
>> 
>>> node1 is very clearly completely off.  The cluster has been in this state, 
>>> with node1 being off for several 10s of minutes now and still the stonith 
>>> resource is running on it.
>> 
>> It shouldn't take so long. 
> 
> Indeed.  And FWIW, it's still in that state.
> 
>> I think your easiest path is to update.
> 
> Update to what?  I'm already using pacemaker-1.1.8-7 on EL6 and a yum
> update is not providing anything newer.
> 
> Cheers,
> b.
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] will a stonith resource be moved from an AWOL node?

2013-04-30 Thread Brian J. Murrell
On 13-04-30 11:13 AM, Lars Marowsky-Bree wrote:
> 
> Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB,
> and will complete the fencing request even if the fencing/stonith
> resource is not instantiated on the node yet.

But clearly that's not happening here.

> (There's a bug in 1.1.8 as
> released that causes an annoying delay here, but that's fixed since.)

Do you know which bug specifically so that I can see if the fix has been
applied here?

>> Node node1: UNCLEAN (pending)
>> Online: [ node2 ]
> 
>> node1 is very clearly completely off.  The cluster has been in this state, 
>> with node1 being off for several 10s of minutes now and still the stonith 
>> resource is running on it.
> 
> It shouldn't take so long. 

Indeed.  And FWIW, it's still in that state.

> I think your easiest path is to update.

Update to what?  I'm already using pacemaker-1.1.8-7 on EL6 and a yum
update is not providing anything newer.

Cheers,
b.





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] will a stonith resource be moved from an AWOL node?

2013-04-30 Thread Lars Marowsky-Bree
On 2013-04-30T10:55:41, "Brian J. Murrell"  wrote:

> From what I think I know of pacemaker, pacemaker wants to be able to
> stonith that AWOL node before moving any resources away from it since
> starting a resource on a new node while the state of the AWOL node is
> unknown is unsafe, right?

Right.

> But of course, if the resource that pacemaker wants to move is the
> stonith resource there's a bit of a catch-22.  It can't move the
> stonith resource until it can stonith the node, which it cannot stonith
> the node because the node running the resource is AWOL.
> 
> So, is pacemaker supposed to resolve this on it's own or am I supposed
> to create a cluster configuration that ensures that enough stonith
> resources exist to mitigate this situation?

Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB,
and will complete the fencing request even if the fencing/stonith
resource is not instantiated on the node yet. (There's a bug in 1.1.8 as
released that causes an annoying delay here, but that's fixed since.)

That can appear to be a bit confusing if you were used to the previous
behaviour.

(And I'm not sure it's a real win for the complexity of the
project/code, but Andrew and David are.)

> Node node1: UNCLEAN (pending)
> Online: [ node2 ]

> node1 is very clearly completely off.  The cluster has been in this state, 
> with node1 being off for several 10s of minutes now and still the stonith 
> resource is running on it.

It shouldn't take so long. 

I think your easiest path is to update.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] will a stonith resource be moved from an AWOL node?

2013-04-30 Thread Brian J. Murrell
I'm using pacemaker 1.1.8 and I don't see stonith resources moving away
from AWOL hosts like I thought I did with 1.1.7.  So I guess the first
thing to do is clear up what is supposed to happen.

If I have a single stonith resource for a cluster and it's running on
node A and then node A goes AWOL, what happens to that stonith resource?

From what I think I know of pacemaker, pacemaker wants to be able to
stonith that AWOL node before moving any resources away from it since
starting a resource on a new node while the state of the AWOL node is
unknown is unsafe, right?

But of course, if the resource that pacemaker wants to move is the
stonith resource there's a bit of a catch-22.  It can't move the
stonith resource until it can stonith the node, which it cannot stonith
the node because the node running the resource is AWOL.

So, is pacemaker supposed to resolve this on it's own or am I supposed
to create a cluster configuration that ensures that enough stonith
resources exist to mitigate this situation?

The case I have in hand is this:

# pcs config
Corosync Nodes:
 
Pacemaker Nodes:
 node1 node2 

Resources: 
 Resource: stonith (type=fence_xvm class=stonith)

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 dc-version: 1.1.8-7.wc1.el6-394e906
 expected-quorum-votes: 2
 no-quorum-policy: ignore
 symmetric-cluster: true
 cluster-infrastructure: classic openais (with plugin)
 stonith-enabled: true
 last-lrm-refresh: 1367331233

# pcs status
Last updated: Tue Apr 30 14:48:06 2013
Last change: Tue Apr 30 14:13:53 2013 via crmd on node2
Stack: classic openais (with plugin)
Current DC: node2 - partition WITHOUT quorum
Version: 1.1.8-7.wc1.el6-394e906
2 Nodes configured, 2 expected votes
1 Resources configured.


Node node1: UNCLEAN (pending)
Online: [ node2 ]

Full list of resources:

 stonith(stonith:fence_xvm):Started node1

node1 is very clearly completely off.  The cluster has been in this state, with 
node1 being off for several 10s of minutes now and still the stonith resource 
is running on it.

The log, since corosync noticed node1 going AWOL:

Apr 30 14:14:56 node2 corosync[1364]:   [TOTEM ] A processor failed, forming 
new configuration.
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 52: memb=1, new=0, lost=1
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: pcmk_peer_update: lost: 
node1 4252674240
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 52: memb=1, new=0, lost=0
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: lost
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: 
send_member_notification: Sending membership update 52 to 2 children
Apr 30 14:14:57 node2 corosync[1364]:   [TOTEM ] A processor joined or left the 
membership and a new membership was formed.
Apr 30 14:14:57 node2 corosync[1364]:   [CPG   ] chosen downlist: sender r(0) 
ip(192.168.122.155) ; members(old:2 left:1)
Apr 30 14:14:57 node2 corosync[1364]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Apr 30 14:14:57 node2 crmd[1666]:   notice: ais_dispatch_message: Membership 
52: quorum lost
Apr 30 14:14:57 node2 crmd[1666]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 30 14:14:57 node2 crmd[1666]:  warning: match_down_event: No match for 
shutdown action on node1
Apr 30 14:14:57 node2 crmd[1666]:   notice: peer_update_callback: 
Stonith/shutdown of node1 not matched
Apr 30 14:14:57 node2 cib[1661]:   notice: ais_dispatch_message: Membership 52: 
quorum lost
Apr 30 14:14:57 node2 cib[1661]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 30 14:14:57 node2 crmd[1666]:   notice: do_state_transition: State 
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL 
origin=check_join_state ]
Apr 30 14:14:57 node2 attrd[1664]:   notice: attrd_local_callback: Sending full 
refresh (origin=crmd)
Apr 30 14:14:57 node2 attrd[1664]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
Apr 30 14:14:58 node2 pengine[1665]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to