Re: [Pacemaker] [Problem][crmsh]The designation of the 'ordered' attribute becomes the error.

2013-04-30 Thread Keisuke MORI
Hi Dejan, Andreas, Yamauchi-san


2013/4/18 renayama19661...@ybb.ne.jp

 Hi Dejan,
 Hi Andreas,

  The shell in pacemaker v1.0.x is in maintenance mode and shipped
  along with the pacemaker code. The v1.1.x doesn't have the
  ordered and collocated meta attributes.

 I sent the pull request of the patch which Mr. Dejan donated.
  * https://github.com/ClusterLabs/pacemaker-1.0/pull/14


The patch for crmsh is now included in the 1.0.x repository:

https://github.com/ClusterLabs/pacemaker-1.0/commit/9227e89fb748cd52d330f5fca80d56fbd9d3efbf

It will be appeared in 1.0.14 maintenance release, which is not scheduled
yet though.

Thanks,

Keisuke MORI



 Many Thanks!
 Hideo Yamauchi.
 --- On Tue, 2013/4/2, Dejan Muhamedagic deja...@fastmail.fm wrote:

  Hi,
 
  On Mon, Apr 01, 2013 at 09:19:51PM +0200, Andreas Kurz wrote:
   Hi Dejan,
  
   On 2013-03-06 11:59, Dejan Muhamedagic wrote:
Hi Hideo-san,
   
On Wed, Mar 06, 2013 at 10:37:44AM +0900, 
renayama19661...@ybb.ne.jpwrote:
Hi Dejan,
Hi Andrew,
   
As for the crm shell, the check of the meta attribute was revised
 with the next patch.
   
 * http://hg.savannah.gnu.org/hgweb/crmsh/rev/d1174f42f4b3
   
This patch was backported in Pacemaker1.0.13.
   
 *
 https://github.com/ClusterLabs/pacemaker-1.0/commit/fa1a99ab36e0ed015f1bcbbb28f7db962a9d1abc#shell/modules/cibconfig.py
   
However, the ordered,colocated attribute of the group resource is
 treated as an error when I use crm Shell which adopted this patch.
   
--
(snip)
### Group Configuration ###
group master-group \
vip-master \
vip-rep \
meta \
ordered=false
(snip)
   
[root@rh63-heartbeat1 ~]# crm configure load update test2339.crm
INFO: building help index
crm_verify[20028]: 2013/03/06_17:57:18 WARN: unpack_nodes: Blind
 faith: not fencing unseen nodes
WARNING: vip-master: specified timeout 60s for start is smaller
 than the advised 90
WARNING: vip-master: specified timeout 60s for stop is smaller than
 the advised 100
WARNING: vip-rep: specified timeout 60s for start is smaller than
 the advised 90
WARNING: vip-rep: specified timeout 60s for stop is smaller than
 the advised 100
ERROR: master-group: attribute ordered does not exist  - WHY?
Do you still want to commit? y
--
   
If it chooses `yes` by a confirmation message, it is reflected, but
 it is a problem that error message is displayed.
 * The error occurs in the same way when I appoint colocated
 attribute.
AndI noticed that there was not explanation of
 ordered,colocated of the group resource in online help of Pacemaker.
   
I think that the designation of the ordered,colocated attribute
 should not become the error in group resource.
In addition, I think that ordered,colocated should be added to
 online help.
   
These attributes are not listed in crmsh. Does the attached patch
help?
  
   Dejan, will this patch for the missing ordered and collocated group
   meta-attribute be included in the next crmsh release? ... can't see the
   patch in the current tip.
 
  The shell in pacemaker v1.0.x is in maintenance mode and shipped
  along with the pacemaker code. The v1.1.x doesn't have the
  ordered and collocated meta attributes.
 
  Thanks,
 
  Dejan
 
 
   Thanks  Regards,
   Andreas
  
   
Thanks,
   
Dejan
   
Best Regards,
Hideo Yamauchi.
   
   
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
   
Project Home: http://www.clusterlabs.org
Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
   
   
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
   
Project Home: http://www.clusterlabs.org
Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
  
  
  
 
 
 
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
   Project Home: http://www.clusterlabs.org
   Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: http://bugs.clusterlabs.org
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 

[Pacemaker] will a stonith resource be moved from an AWOL node?

2013-04-30 Thread Brian J. Murrell
I'm using pacemaker 1.1.8 and I don't see stonith resources moving away
from AWOL hosts like I thought I did with 1.1.7.  So I guess the first
thing to do is clear up what is supposed to happen.

If I have a single stonith resource for a cluster and it's running on
node A and then node A goes AWOL, what happens to that stonith resource?

From what I think I know of pacemaker, pacemaker wants to be able to
stonith that AWOL node before moving any resources away from it since
starting a resource on a new node while the state of the AWOL node is
unknown is unsafe, right?

But of course, if the resource that pacemaker wants to move is the
stonith resource there's a bit of a catch-22.  It can't move the
stonith resource until it can stonith the node, which it cannot stonith
the node because the node running the resource is AWOL.

So, is pacemaker supposed to resolve this on it's own or am I supposed
to create a cluster configuration that ensures that enough stonith
resources exist to mitigate this situation?

The case I have in hand is this:

# pcs config
Corosync Nodes:
 
Pacemaker Nodes:
 node1 node2 

Resources: 
 Resource: stonith (type=fence_xvm class=stonith)

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 dc-version: 1.1.8-7.wc1.el6-394e906
 expected-quorum-votes: 2
 no-quorum-policy: ignore
 symmetric-cluster: true
 cluster-infrastructure: classic openais (with plugin)
 stonith-enabled: true
 last-lrm-refresh: 1367331233

# pcs status
Last updated: Tue Apr 30 14:48:06 2013
Last change: Tue Apr 30 14:13:53 2013 via crmd on node2
Stack: classic openais (with plugin)
Current DC: node2 - partition WITHOUT quorum
Version: 1.1.8-7.wc1.el6-394e906
2 Nodes configured, 2 expected votes
1 Resources configured.


Node node1: UNCLEAN (pending)
Online: [ node2 ]

Full list of resources:

 stonith(stonith:fence_xvm):Started node1

node1 is very clearly completely off.  The cluster has been in this state, with 
node1 being off for several 10s of minutes now and still the stonith resource 
is running on it.

The log, since corosync noticed node1 going AWOL:

Apr 30 14:14:56 node2 corosync[1364]:   [TOTEM ] A processor failed, forming 
new configuration.
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 52: memb=1, new=0, lost=1
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: pcmk_peer_update: lost: 
node1 4252674240
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 52: memb=1, new=0, lost=0
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: lost
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: 
send_member_notification: Sending membership update 52 to 2 children
Apr 30 14:14:57 node2 corosync[1364]:   [TOTEM ] A processor joined or left the 
membership and a new membership was formed.
Apr 30 14:14:57 node2 corosync[1364]:   [CPG   ] chosen downlist: sender r(0) 
ip(192.168.122.155) ; members(old:2 left:1)
Apr 30 14:14:57 node2 corosync[1364]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Apr 30 14:14:57 node2 crmd[1666]:   notice: ais_dispatch_message: Membership 
52: quorum lost
Apr 30 14:14:57 node2 crmd[1666]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 30 14:14:57 node2 crmd[1666]:  warning: match_down_event: No match for 
shutdown action on node1
Apr 30 14:14:57 node2 crmd[1666]:   notice: peer_update_callback: 
Stonith/shutdown of node1 not matched
Apr 30 14:14:57 node2 cib[1661]:   notice: ais_dispatch_message: Membership 52: 
quorum lost
Apr 30 14:14:57 node2 cib[1661]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 30 14:14:57 node2 crmd[1666]:   notice: do_state_transition: State 
transition S_IDLE - S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL 
origin=check_join_state ]
Apr 30 14:14:57 node2 attrd[1664]:   notice: attrd_local_callback: Sending full 
refresh (origin=crmd)
Apr 30 14:14:57 node2 attrd[1664]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
Apr 30 14:14:58 node2 pengine[1665]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 

Re: [Pacemaker] will a stonith resource be moved from an AWOL node?

2013-04-30 Thread Lars Marowsky-Bree
On 2013-04-30T10:55:41, Brian J. Murrell br...@interlinx.bc.ca wrote:

 From what I think I know of pacemaker, pacemaker wants to be able to
 stonith that AWOL node before moving any resources away from it since
 starting a resource on a new node while the state of the AWOL node is
 unknown is unsafe, right?

Right.

 But of course, if the resource that pacemaker wants to move is the
 stonith resource there's a bit of a catch-22.  It can't move the
 stonith resource until it can stonith the node, which it cannot stonith
 the node because the node running the resource is AWOL.
 
 So, is pacemaker supposed to resolve this on it's own or am I supposed
 to create a cluster configuration that ensures that enough stonith
 resources exist to mitigate this situation?

Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB,
and will complete the fencing request even if the fencing/stonith
resource is not instantiated on the node yet. (There's a bug in 1.1.8 as
released that causes an annoying delay here, but that's fixed since.)

That can appear to be a bit confusing if you were used to the previous
behaviour.

(And I'm not sure it's a real win for the complexity of the
project/code, but Andrew and David are.)

 Node node1: UNCLEAN (pending)
 Online: [ node2 ]

 node1 is very clearly completely off.  The cluster has been in this state, 
 with node1 being off for several 10s of minutes now and still the stonith 
 resource is running on it.

It shouldn't take so long. 

I think your easiest path is to update.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?

2013-04-30 Thread Lars Marowsky-Bree
On 2013-04-24T11:44:57, Rainer Brestan rainer.bres...@gmx.net wrote:

 Current DC: int2node2 - partition WITHOUT quorum
 Version: 1.1.8-7.el6-394e906

This may not be the answer you want, since it is fairly unspecific. But
I think we noticed something similar when we pulled in 1.1.8, I don't
recall the bug number, but I *think* it worked out with a later git
version.

Can you try a newer build than 1.1.8?


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] will a stonith resource be moved from an AWOL node?

2013-04-30 Thread Brian J. Murrell
On 13-04-30 11:13 AM, Lars Marowsky-Bree wrote:
 
 Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB,
 and will complete the fencing request even if the fencing/stonith
 resource is not instantiated on the node yet.

But clearly that's not happening here.

 (There's a bug in 1.1.8 as
 released that causes an annoying delay here, but that's fixed since.)

Do you know which bug specifically so that I can see if the fix has been
applied here?

 Node node1: UNCLEAN (pending)
 Online: [ node2 ]
 
 node1 is very clearly completely off.  The cluster has been in this state, 
 with node1 being off for several 10s of minutes now and still the stonith 
 resource is running on it.
 
 It shouldn't take so long. 

Indeed.  And FWIW, it's still in that state.

 I think your easiest path is to update.

Update to what?  I'm already using pacemaker-1.1.8-7 on EL6 and a yum
update is not providing anything newer.

Cheers,
b.





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] warning: unpack_rsc_op: Processing failed op monitor for my_resource on node1: unknown error (1)

2013-04-30 Thread Brian J. Murrell
Using 1.1.8 on EL6.4, I am seeing this sort of thing:

pengine[1590]:  warning: unpack_rsc_op: Processing failed op monitor for 
my_resource on node1: unknown error (1)

The full log from the point of adding the resource until the errors:

Apr 30 11:46:30 node1 cibadmin[3380]:   notice: crm_log_args: Invoked: cibadmin 
-o resources -C -x /tmp/tmpHrgNZv 
Apr 30 11:46:30 node1 crmd[1591]:   notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.24.5
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.25.1 
8a4aac3dcddc2689e4b336e1bf2078ff
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: -- cib admin_epoch=0 
epoch=24 num_updates=5 /
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   primitive 
class=ocf provider=my_provider type=my_RA id=my_resource 
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ 
meta_attributes id=my_resource-meta_attributes 
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   nvpair 
name=my_RA-role id=my_resource-meta_attributes-my_RA-role value=Stopped /
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ 
/meta_attributes
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ operations 
id=my_resource-operations 
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   op 
id=my_resource-monitor-5 interval=5 name=monitor timeout=60 /
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   op 
id=my_resource-start-0 interval=0 name=start timeout=300 /
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   op 
id=my_resource-stop-0 interval=0 name=stop timeout=300 /
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ /operations
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ 
instance_attributes id=my_resource-instance_attributes 
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   nvpair 
id=my_resource-instance_attributes-my_RA name=my_RA 
value=33bb17d2-350b-495f-bd8d-8427baabeed9 /
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ 
/instance_attributes
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   /primitive
Apr 30 11:46:30 node1 pengine[1590]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 
'now'
Apr 30 11:46:30 node1 pengine[1590]:   notice: process_pe_message: Calculated 
Transition 5: /var/lib/pacemaker/pengine/pe-input-10.bz2
Apr 30 11:46:30 node1 cibadmin[3386]:   notice: crm_log_args: Invoked: cibadmin 
-o constraints -C -X rsc_location id=my_resource-primary node=node1 
rsc=my_resource score=20/ 
Apr 30 11:46:30 node1 cib[1586]:   notice: log_cib_diff: cib:diff: Local-only 
Change: 0.26.1
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: -- cib admin_epoch=0 
epoch=25 num_updates=1 /
Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   rsc_location 
id=my_resource-primary node=node1 rsc=my_resource score=20 /
Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.26.3
Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.27.1 
8305c8fe19d06a6204bd04f437eb923a
Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: -- nvpair 
value=1367322378 id=cib-bootstrap-options-last-lrm-refresh /
Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: ++ nvpair 
id=cib-bootstrap-options-last-lrm-refresh name=last-lrm-refresh 
value=1367322393 /
Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.27.2
Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.28.1 
0dbddb3084f7cd76bffe21916538be94
Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: --   nvpair 
value=Stopped id=my_resource-meta_attributes-my_RA-role /
Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: ++   nvpair 
name=my_RA-role id=my_resource-meta_attributes-my_RA-role value=Started /
Apr 30 11:46:33 node1 crmd[1591]:  warning: do_update_resource: Resource 
my_resource no longer exists in the lrmd
Apr 30 11:46:33 node1 crmd[1591]:   notice: process_lrm_event: LRM operation 
my_resource_monitor_0 (call=31, rc=7, cib-update=0, confirmed=true) not running
Apr 30 11:46:33 node1 crmd[1591]:  warning: decode_transition_key: Bad UUID 
(crm_resource.c) in sscanf result (4) for 3397:0:0:crm_resource.c
Apr 30 11:46:33 node1 crmd[1591]:error: send_msg_via_ipc: Unknown 
Sub-system (3397_crm_resource)... discarding message.
Apr 30 11:47:50 node1 crmd[1591]:  warning: action_timer_callback: Timer popped 

Re: [Pacemaker] [Problem][crmsh]The designation of the 'ordered' attribute becomes the error.

2013-04-30 Thread renayama19661014
Hi Mori san,

 The patch for crmsh is now included in the 1.0.x repository:
 
   
 https://github.com/ClusterLabs/pacemaker-1.0/commit/9227e89fb748cd52d330f5fca80d56fbd9d3efbf
 
 
 It will be appeared in 1.0.14 maintenance release, which is not scheduled yet 
 though.

All right.

Many Thanks!
Hideo Yamauchi.

--- On Tue, 2013/4/30, Keisuke MORI keisuke.mori...@gmail.com wrote:

 
 Hi Dejan, Andreas, Yamauchi-san
 
 
 
 
 
 2013/4/18  renayama19661...@ybb.ne.jp
 Hi Dejan,
 Hi Andreas,
 
 
  The shell in pacemaker v1.0.x is in maintenance mode and shipped
  along with the pacemaker code. The v1.1.x doesn't have the
  ordered and collocated meta attributes.
 
 I sent the pull request of the patch which Mr. Dejan donated.
  * https://github.com/ClusterLabs/pacemaker-1.0/pull/14
 
 
 
 The patch for crmsh is now included in the 1.0.x repository:
 
   
 https://github.com/ClusterLabs/pacemaker-1.0/commit/9227e89fb748cd52d330f5fca80d56fbd9d3efbf
 
 
 It will be appeared in 1.0.14 maintenance release, which is not scheduled yet 
 though.
 
 
 Thanks,
 
 
 Keisuke MORI
 
  Many Thanks!
 Hideo Yamauchi.
 
 
 --- On Tue, 2013/4/2, Dejan Muhamedagic deja...@fastmail.fm wrote:
 
  Hi,
 
  On Mon, Apr 01, 2013 at 09:19:51PM +0200, Andreas Kurz wrote:
   Hi Dejan,
  
   On 2013-03-06 11:59, Dejan Muhamedagic wrote:
Hi Hideo-san,
   
On Wed, Mar 06, 2013 at 10:37:44AM +0900, renayama19661...@ybb.ne.jp 
wrote:
Hi Dejan,
Hi Andrew,
   
As for the crm shell, the check of the meta attribute was revised with 
the next patch.
   
     * http://hg.savannah.gnu.org/hgweb/crmsh/rev/d1174f42f4b3
   
This patch was backported in Pacemaker1.0.13.
   
     * 
   https://github.com/ClusterLabs/pacemaker-1.0/commit/fa1a99ab36e0ed015f1bcbbb28f7db962a9d1abc#shell/modules/cibconfig.py
   
However, the ordered,colocated attribute of the group resource is 
treated as an error when I use crm Shell which adopted this patch.
   
--
(snip)
### Group Configuration ###
group master-group \
            vip-master \
            vip-rep \
            meta \
                    ordered=false
(snip)
   
[root@rh63-heartbeat1 ~]# crm configure load update test2339.crm
INFO: building help index
crm_verify[20028]: 2013/03/06_17:57:18 WARN: unpack_nodes: Blind 
faith: not fencing unseen nodes
WARNING: vip-master: specified timeout 60s for start is smaller than 
the advised 90
WARNING: vip-master: specified timeout 60s for stop is smaller than 
the advised 100
WARNING: vip-rep: specified timeout 60s for start is smaller than the 
advised 90
WARNING: vip-rep: specified timeout 60s for stop is smaller than the 
advised 100
ERROR: master-group: attribute ordered does not exist  - WHY?
Do you still want to commit? y
--
   
If it chooses `yes` by a confirmation message, it is reflected, but it 
is a problem that error message is displayed.
     * The error occurs in the same way when I appoint colocated attribute.
AndI noticed that there was not explanation of ordered,colocated 
of the group resource in online help of Pacemaker.
   
I think that the designation of the ordered,colocated attribute should 
not become the error in group resource.
In addition, I think that ordered,colocated should be added to online 
help.
   
These attributes are not listed in crmsh. Does the attached patch
help?
  
   Dejan, will this patch for the missing ordered and collocated group
   meta-attribute be included in the next crmsh release? ... can't see the
   patch in the current tip.
 
  The shell in pacemaker v1.0.x is in maintenance mode and shipped
  along with the pacemaker code. The v1.1.x doesn't have the
  ordered and collocated meta attributes.
 
  Thanks,
 
  Dejan
 
 
   Thanks  Regards,
   Andreas
  
   
Thanks,
   
Dejan
   
Best Regards,
Hideo Yamauchi.
   
   
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
   
Project Home: http://www.clusterlabs.org
Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
   
   
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
   
Project Home: http://www.clusterlabs.org
Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
  
  
  
 
 
 
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
   Project Home: http://www.clusterlabs.org
   Getting started: 

Re: [Pacemaker] corosync restarts service when slave node joins the cluster

2013-04-30 Thread Andrew Beekhof
Please ask questions on the mailing lists.

On 01/05/2013, at 12:30 AM, Babu Challa babu.cha...@ipaccess.com wrote:

 Hi Andrew,
  
 Greetings,
  
 We are using corosync/pacemaker for  high availability
  
 This is a 4 node HA cluster where each pair of nodes are configured for DB 
 and file system replication
 We have very tricky situation. We have configured two clusters with exact 
 same configuration on each. But on one cluster,  corosync restarting the 
 services when slave node is rebooted and re-joins the cluster.
  
 We have tried to reproduce the issue on other cluster with multiple HA 
 scenarios but no luck
  
 Few questions:
  
 1.   If rebooted slave is a  DC (designated Controller) , is there any 
 possibility of this issue
 2.   Is there any known issue in pacemaker version currently  we are 
 using (1.1.5) which will be resolved if we upgrade to latest (1.8)

I believe there was one, check the ChangeLog

 3.   Is there any chance that pacemaker/corosync behaves differently even 
 though configuration is same on each cluster

Timing issues do occur, how identical is the hardware?

 4.   Can you please let us kinow if there is any possible reason for this 
 issue. That’s really helpful to reproduce this issue and fix it

More than likely it has been fixed in a later version.

  
 Versions we are using;
  
 Pacemaker version - pacemaker-1.1.5
 Corosync version - corosync-1.2.7
 heartbeat-3.0.3-2.3
  
 R
 Babu Challa
 T: +44 (0) 1954 717972 | M: +44 (0) 7912 859958| E: babu.cha...@ipaccess.com 
 | W: www.ipaccess.com
 ip.access Ltd, Building 2020, Cambourne Business Park, Cambourne, Cambridge, 
 CB23 6DW
  
 The desire to excel is exclusive of the fact whether someone else appreciates 
 it or not. Excellence is a drive from inside, not outside. Excellence is 
 not for someone else to notice but for your own satisfaction and efficiency...
  
 
 
 
 
 This message contains confidential information and may be privileged. If you 
 are not the intended recipient, please notify the sender and delete the 
 message immediately.
 
 ip.access ltd, registration number 3400157, Building 2020, 
 Cambourne Business Park, Cambourne, Cambridge CB23 6DW, United Kingdom
 
 
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] cannot register service of pacemaker_remote

2013-04-30 Thread Andrew Beekhof
Done. Thanks!

On 30/04/2013, at 3:34 PM, nozawat noza...@gmail.com wrote:

 Hi
 
  Because there was typo in pacemaker.spec.in, I was not able to register 
 service of pacemaker_remote. 
 
 -
 diff --git a/pacemaker.spec.in b/pacemaker.spec.in
 index 10296a5..1e1fd6d 100644
 --- a/pacemaker.spec.in
 +++ b/pacemaker.spec.in
 @@ -404,7 +404,7 @@ exit 0
  %if %{defined _unitdir}
  /bin/systemctl daemon-reload /dev/null 21 || :
  %endif
 -/sbin/chkconfig --add pacemaker-remote || :
 +/sbin/chkconfig --add pacemaker_remote || :
  
  %preun -n %{name}-remote
  if [ $1 -eq 0 ]; then
 -
 
 Regards,
 Tomo
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] lrm monitor failure status lost during DC election

2013-04-30 Thread Andrew Beekhof

On 19/04/2013, at 6:36 AM, David Adair david_ad...@xyratex.com wrote:

 Hello.
 
 I have an issue with pacemaker 1.1.6.1 but believe this may still be
 present in the
 latest git versions and would like to know if the fix makes sense.
 
 
 What I see is the following:
 Setup:
 - 2 node cluster
 - ocf:heartbeat:Dumy resource on non-DC node.
 - Force DC reboot or stonith and fail resource while there is no DC.
 
 Result:
 - node with failed monitor becomes DC (good)
 
 - lrmd reports resource as failed during every monitor interval but
 since these failures are not rc status changes they are not sent to crmd.
 (good -- it is failing, but ..)
 
 - crm_mon / cibadmin --query report resource as running OK. (not good)
 
 
 The resource has failed but is never restarted   I believe the failing
 resource and any group it belongs to should be recovered during/after
 the DC election.
 
 I think  this is due to the operation of build_active_RAs on the surviving 
 node:
 
build_operation_update(xml_rsc, (entry-rsc), entry-last,
 __FUNCTION__);
build_operation_update(xml_rsc, (entry-rsc), entry-failed,
 __FUNCTION__);
for (gIter = entry-recurring_op_list; gIter != NULL; gIter =
 gIter-next) {
build_operation_update(xml_rsc, (entry-rsc),
 gIter-data, __FUNCTION__);
}
 
 What this produces is
 last failed list[0]
  list[1]
 start_0: rc=0; monitor_1000: rc=7; monitor_1000: rc=7; monitor_1000: rc=0

list[] should only have one element as both are for monitor_1000

I have a vague recollection of an old bug in this area and strongly suspect 
that something more recent wont have the same problem.

 
 The final result in the cib appears to be the last entry which is from
 the initial
 transition of the monitor from rc=-1 to rc=0.
 
 To fix this I swapped the order of recurring_op_list so that the last 
 transition
 is at the end of the list rather than the beginning.  With this this change I
 see what I believe is the desired behavior -- the resource is stopped and
 re-stared when the DC election is finalized.
 
 The memcpy is a backport of a corresponding change in lrmd_copy_event
 to simplify debugging by maintaining the rcchanged time.
 
 -
 This patch swaps the order of recurring operations (monitors) in the
 lrm history cache.  By placing the most recent change at the end of the
 list it is properly detected by pengine after a DC election.
 
 With the new events placed at the start of the list the last thing
 in the list is the initial startup with rc=0.  This makes pengine
 believe the resource is working properly even though lrmd is reporting
 constand failure.
 
 It is fairly easy to get into this situation when a shared resource
 (storage enclosure) fails and causes the DC to be stonithed.
 
 diff --git a/crmd/lrm.c b/crmd/lrm.c
 index 187db76..f8974f6 100644
 --- a/crmd/lrm.c
 +++ b/crmd/lrm.c
 @@ -217,7 +217,7 @@ update_history_cache(lrm_rsc_t * rsc, lrm_op_t * op)
 
 if (op-interval  0) {
 crm_trace(Adding recurring op: %s_%s_%d, op-rsc_id,
 op-op_type, op-interval);
 -entry-recurring_op_list =
 g_list_prepend(entry-recurring_op_list, copy_lrm_op(op));
 +entry-recurring_op_list =
 g_list_append(entry-recurring_op_list, copy_lrm_op(op));
 
 } else if (entry-recurring_op_list  safe_str_eq(op-op_type,
 RSC_STATUS) == FALSE) {
 GList *gIter = entry-recurring_op_list;
 @@ -1756,6 +1756,9 @@ copy_lrm_op(const lrm_op_t * op)
 
 crm_malloc0(op_copy, sizeof(lrm_op_t));
 
 +   /* Copy all int values, pointers fixed below */
 +   memcpy(op_copy, op, sizeof(lrm_op_t));
 +
 op_copy-op_type = crm_strdup(op-op_type);
 /* input fields */
 op_copy-params = g_hash_table_new_full(crm_str_hash, g_str_equal,
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Behavior when crm_mon is a daemon

2013-04-30 Thread Andrew Beekhof

On 19/04/2013, at 11:05 AM, Yuichi SEINO seino.clust...@gmail.com wrote:

 HI,
 
 2013/4/16 Andrew Beekhof and...@beekhof.net:
 
 On 15/04/2013, at 7:42 PM, Yuichi SEINO seino.clust...@gmail.com wrote:
 
 Hi All,
 
 I look at the daemon of tools to make a new daemon. So, I have a question.
 
 When the old pid file existed, crm_mon is start as a daemon. Then,
 crm_mon don't update this old pid file. And, crm_mon doesn't stop.
 I would like to know if this behavior is correct.
 
 Some of it is, but the part about crm_mon not updating the pid file (which 
 is probably also preventing it from stopping) is bad.
 I understood that it is a negative behavior.
 If we figure out a problem, I think that we want to fix it.

Done:

   https://github.com/beekhof/pacemaker/commit/e549770

Plus an extra bonus:

   https://github.com/beekhof/pacemaker/commit/479c5cc


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] will a stonith resource be moved from an AWOL node?

2013-04-30 Thread Andrew Beekhof

On 01/05/2013, at 1:28 AM, Brian J. Murrell br...@interlinx.bc.ca wrote:

 On 13-04-30 11:13 AM, Lars Marowsky-Bree wrote:
 
 Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB,
 and will complete the fencing request even if the fencing/stonith
 resource is not instantiated on the node yet.
 
 But clearly that's not happening here.

Can you file a bug and attach the logs from both machines?

Unless... are you still using cman or the pacemaker plugin (as shipped or the 
patched one from https://bugzilla.redhat.com/show_bug.cgi?id=951340)?


 
 (There's a bug in 1.1.8 as
 released that causes an annoying delay here, but that's fixed since.)
 
 Do you know which bug specifically so that I can see if the fix has been
 applied here?
 
 Node node1: UNCLEAN (pending)
 Online: [ node2 ]
 
 node1 is very clearly completely off.  The cluster has been in this state, 
 with node1 being off for several 10s of minutes now and still the stonith 
 resource is running on it.
 
 It shouldn't take so long. 
 
 Indeed.  And FWIW, it's still in that state.
 
 I think your easiest path is to update.
 
 Update to what?  I'm already using pacemaker-1.1.8-7 on EL6 and a yum
 update is not providing anything newer.
 
 Cheers,
 b.
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] warning: unpack_rsc_op: Processing failed op monitor for my_resource on node1: unknown error (1)

2013-04-30 Thread Andrew Beekhof

On 01/05/2013, at 2:51 AM, Brian J. Murrell br...@interlinx.bc.ca wrote:

 Using 1.1.8 on EL6.4, I am seeing this sort of thing:
 
 pengine[1590]:  warning: unpack_rsc_op: Processing failed op monitor for 
 my_resource on node1: unknown error (1)
 
 The full log from the point of adding the resource until the errors:
 
 Apr 30 11:46:30 node1 cibadmin[3380]:   notice: crm_log_args: Invoked: 
 cibadmin -o resources -C -x /tmp/tmpHrgNZv 
 Apr 30 11:46:30 node1 crmd[1591]:   notice: do_state_transition: State 
 transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
 origin=abort_transition_graph ]
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.24.5
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.25.1 
 8a4aac3dcddc2689e4b336e1bf2078ff
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: -- cib admin_epoch=0 
 epoch=24 num_updates=5 /
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   primitive 
 class=ocf provider=my_provider type=my_RA id=my_resource 
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ 
 meta_attributes id=my_resource-meta_attributes 
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   nvpair 
 name=my_RA-role id=my_resource-meta_attributes-my_RA-role value=Stopped 
 /
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ 
 /meta_attributes
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ operations 
 id=my_resource-operations 
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   op 
 id=my_resource-monitor-5 interval=5 name=monitor timeout=60 /
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   op 
 id=my_resource-start-0 interval=0 name=start timeout=300 /
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   op 
 id=my_resource-stop-0 interval=0 name=stop timeout=300 /
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ /operations
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ 
 instance_attributes id=my_resource-instance_attributes 
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   nvpair 
 id=my_resource-instance_attributes-my_RA name=my_RA 
 value=33bb17d2-350b-495f-bd8d-8427baabeed9 /
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++ 
 /instance_attributes
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   /primitive
 Apr 30 11:46:30 node1 pengine[1590]:   notice: unpack_config: On loss of CCM 
 Quorum: Ignore
 Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 
 'now'
 Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 
 'now'
 Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 
 'now'
 Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 
 'now'
 Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 
 'now'
 Apr 30 11:46:30 node1 pengine[1590]:   notice: process_pe_message: Calculated 
 Transition 5: /var/lib/pacemaker/pengine/pe-input-10.bz2
 Apr 30 11:46:30 node1 cibadmin[3386]:   notice: crm_log_args: Invoked: 
 cibadmin -o constraints -C -X rsc_location id=my_resource-primary 
 node=node1 rsc=my_resource score=20/ 
 Apr 30 11:46:30 node1 cib[1586]:   notice: log_cib_diff: cib:diff: Local-only 
 Change: 0.26.1
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: -- cib admin_epoch=0 
 epoch=25 num_updates=1 /
 Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++   rsc_location 
 id=my_resource-primary node=node1 rsc=my_resource score=20 /
 Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.26.3
 Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.27.1 
 8305c8fe19d06a6204bd04f437eb923a
 Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: -- nvpair 
 value=1367322378 id=cib-bootstrap-options-last-lrm-refresh /
 Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: ++ nvpair 
 id=cib-bootstrap-options-last-lrm-refresh name=last-lrm-refresh 
 value=1367322393 /
 Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.27.2
 Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.28.1 
 0dbddb3084f7cd76bffe21916538be94
 Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: --   nvpair 
 value=Stopped id=my_resource-meta_attributes-my_RA-role /
 Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: ++   nvpair 
 name=my_RA-role id=my_resource-meta_attributes-my_RA-role value=Started 
 /
 Apr 30 11:46:33 node1 crmd[1591]:  warning: do_update_resource: Resource 
 my_resource no longer exists in the lrmd
 Apr 30 11:46:33 node1 crmd[1591]:   notice: process_lrm_event: LRM operation 
 my_resource_monitor_0 (call=31, rc=7, cib-update=0, confirmed=true) not 
 running
 Apr 30 11:46:33 node1 crmd[1591]:  warning: decode_transition_key: Bad UUID 
 (crm_resource.c) in sscanf result (4) for 3397:0:0:crm_resource.c
 Apr 30 11:46:33 node1 crmd[1591]:

Re: [Pacemaker] Kernel WARN unpack_status in syslog

2013-04-30 Thread Andrew Beekhof

On 20/04/2013, at 3:07 AM, Ivor Prebeg ivor.pre...@gmail.com wrote:

 Guys,
 
 I can't get rid of following warnings:
 
 Apr 19 19:00:37 node2 crmd: [32230]: WARN: start_subsystem: Client pengine 
 already running as pid 32240
 Apr 19 19:00:44 node2 pengine: [32240]: WARN: unpack_status: Node node1 in 
 status section no longer exists
 Apr 19 19:00:44 node2 pengine: [32240]: WARN: unpack_status: Node node2 in 
 status section no longer exists
 Apr 19 19:00:44 node2 pengine: [32240]: notice: process_pe_message: 
 Configuration WARNINGs found during PE processing.  Please run crm_verify 
 -L to identify issues.
 
 root@node2:~# crm_verify -LV
 crm_verify[13317]: 2013/04/19_19:03:04 WARN: unpack_status: Node node1 in 
 status section no longer exists
 crm_verify[13317]: 2013/04/19_19:03:04 WARN: unpack_status: Node node2 in 
 status section no longer exists
 Warnings found during check: config may not be valid
 
 Since I have nagios running through syslog emailing warnings and errors, this 
 is pretty annoying. And disabling warn checks isn't an option. Any clues? 

Can you run cibadmin -Ql | grep node and paste the result here?

 
 I do have /etc/hosts entries. 
 
 Ivor Prebeg
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker configuration with different dependencies

2013-04-30 Thread Andrew Beekhof

On 17/04/2013, at 6:15 PM, Ivor Prebeg ivor.pre...@gmail.com wrote:

 Hi Andreas, thank you for your answer.
 
 Maybe my description was a little fuzzy, sorry for that.
 
 What I want is following:
 
 * if l3_ping fails on a particular node, all services should go to standby on 
 that node (which probably works fine with on-fail=standby) 

correct

 
 * if sip service (active/active) fails on particular node, only floating ip 
 assigned should be migrated to other node

colocate ip with sip clone

 
 * if any of services (active/active), be it database or java container fails, 
 both database and java container should be stopped and floating ip migrated 
 to another node

put all three in a group?

 
 * failure of sip service should not affect database or java container and 
 vice versa.
 
 Hope this makes it more clear, not sure that I understood how to achieve 
 dependency tree. 
 
 Thanks,
 Ivor Prebeg
 
 On Apr 16, 2013, at 2:50 PM, Andreas Mock andreas.m...@web.de wrote:
 
 Hi Ivor,
  
 I don't know whether I understand you completely right:
 If you want independence of resources don't put them into a group.
  
 Look at
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/ch10.html
  
 A group is made to tie together several resources without
 declaring all necessary colocations and orderings to get
 a desired behaviour.
  
 Otherwise. Name your resources ans how they should be spread across
 your cluster. (Show the technical dependency)
  
 Best regards
 Andreas
  
  
 Von: Ivor Prebeg [mailto:ivor.pre...@gmail.com] 
 Gesendet: Dienstag, 16. April 2013 13:53
 An: pacemaker@oss.clusterlabs.org
 Betreff: [Pacemaker] Pacemaker configuration with different dependencies
  
 Hi guys,
 
 I need some help with pacemaker configuration, it is all new to me and can't 
 find solution...
 
 I have two-node HA environment with services that I want to be partially 
 independent, in pacemaker/heartbeat configuration.
 
 There is active/active sip service with two floating IPs, it should all just 
 migrate floating ip when one sip dies.
 
 There is also two active/active master/slave services with java container 
 and rdbms with replication between them, should also fallback when one dies.
 
 What I can't figure out how to configure those two to be independent (put 
 on-fail directive on group). What I want is to, e.g., in case my sip service 
 fails, java container stays active on that node, but floating ip to be moved 
 to other node.
 
 Another thing is, in case one of rdbms fails, I want to put whole service 
 group on that node to standby, but leave sip service intact.
 
 Whole node should go to standby (all services down) only when L3_ping to 
 gateway dies.
  
 All suggestions and configuration examples are welcome.
 
 Thanks in advance.
 
  
 Ivor Prebeg
  
  
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] failure handling on a cloned resource

2013-04-30 Thread Andrew Beekhof

On 17/04/2013, at 9:54 PM, Johan Huysmans johan.huysm...@inuits.be wrote:

 Hi All,
 
 I'm trying to setup a specific configuration in our cluster, however I'm 
 struggling with my configuration.
 
 This is what I'm trying to achieve:
 On both nodes of the cluster a daemon must be running (tomcat).
 Some failover addresses are configured and must be running on the node with a 
 correctly running tomcat.
 
 I have this achieved with a cloned tomcat resource and an collocation between 
 the cloned tomcat and the failover addresses.
 When I cause a failure in the tomcat on the node running the failover 
 addresses, the failover addresses will failover to the other node as expected.
 crm_mon shows that this tomcat has a failure.
 When I configure the tomcat resource with failure-timeout=0, the failure 
 alarm in crm_mon isn't cleared whenever the tomcat failure is fixed.

All sounds right so far.

 When I configure the tomcat resource with failure-timeout=30, the failure 
 alarm in crm_mon is cleared after 30seconds however the tomcat is still 
 having a failure.

Can you define still having a failure?
You mean it still shows up in crm_mon?
Have you read this link?
   
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html

 
 What I expect is that pacemaker reports the failure as the failure exists and 
 as long as it exists and that pacemaker reports that everything is ok once 
 everything is back ok.
 
 Do I do something wrong with my configuration?
 Or how can I achieve my wanted setup?
 
 Here is my configuration:
 
 node CSE-1
 node CSE-2
 primitive d_tomcat ocf:custom:tomcat \
op monitor interval=15s timeout=510s on-fail=block \
op start interval=0 timeout=510s \
params instance_name=NMS monitor_use_ssl=no monitor_urls=/cse/health 
 monitor_timeout=120 \
meta migration-threshold=1 failure-timeout=0
 primitive ip_1 ocf:heartbeat:IPaddr2 \
op monitor interval=10s \
params nic=bond0 broadcast=10.1.1.1 iflabel=ha ip=10.1.1.1
 primitive ip_2 ocf:heartbeat:IPaddr2 \
op monitor interval=10s \
params nic=bond0 broadcast=10.1.1.2 iflabel=ha ip=10.1.1.2
 group svc-cse ip_1 ip_2
 clone cl_tomcat d_tomcat
 colocation colo_tomcat inf: svc-cse cl_tomcat
 order order_tomcat inf: cl_tomcat svc-cse
 property $id=cib-bootstrap-options \
dc-version=1.1.8-7.el6-394e906 \
cluster-infrastructure=cman \
no-quorum-policy=ignore \
stonith-enabled=false
 
 Thanks!
 
 Greetings,
 Johan Huysmans
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Two node KVM cluster

2013-04-30 Thread Andrew Beekhof

On 28/04/2013, at 9:19 PM, Oriol Mula-Valls oriol.mula-va...@ic3.cat wrote:

 Hi,
 
 I have modified the previous configuration to use sbd fencing. I have also 
 fixed several other issues with the configuration and now when the node 
 reboots seems not to be able to rejoin the cluster.
 
 I attach the debug log I have just generated. Node was rebooted around 
 11:51:41 and came back at 12:52:47.
 
 The boot order of the services is:
 1. sbd
 2. corosync
 3. pacemaker

It doesn't look like pacemaker was restarted on node1, just corosync.

 
 Could someone help me, please?
 
 Thanks,
 Oriol
 
 On 16/04/13 06:10, Andrew Beekhof wrote:
 
 On 10/04/2013, at 3:20 PM, Oriol Mula-Vallsoriol.mula-va...@ic3.cat  wrote:
 
 On 10/04/13 02:10, Andrew Beekhof wrote:
 
 On 09/04/2013, at 7:31 PM, Oriol Mula-Vallsoriol.mula-va...@ic3.cat   
 wrote:
 
 Thanks Andrew I've managed to set up the system and currently I have it 
 working but still on testing.
 
 I have configure external/ipmi as fencing device and then I force a 
 reboot doing a echo b   /proc/sysrq-trigger. The fencing is working 
 properly as the node is shut off and the VM migrated. However, as soon as 
 I turn on the fenced now and the OS has started the surviving is shut 
 down. Is it normal or am I doing something wrong?
 
 Can you clarify turn on the fenced?
 
 
 To restart the fenced node I do either a power on with ipmitool or I power 
 it on using the iRMC web interface.
 
 Oh, fenced now was meant to be fenced node.  That makes more sense now :)
 
 To answer your question, I would not expect the surviving node to be fenced 
 when the previous node returns.
 The network between the two is still functional?
 
 
 
 -- 
 Oriol Mula Valls
 Institut Català de Ciències del Clima (IC3)
 Doctor Trueta 203 - 08005 Barcelona
 Tel:+34 93 567 99 77
 corosync.log.gz___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org