Re: [Pacemaker] [Problem][crmsh]The designation of the 'ordered' attribute becomes the error.
Hi Dejan, Andreas, Yamauchi-san 2013/4/18 renayama19661...@ybb.ne.jp Hi Dejan, Hi Andreas, The shell in pacemaker v1.0.x is in maintenance mode and shipped along with the pacemaker code. The v1.1.x doesn't have the ordered and collocated meta attributes. I sent the pull request of the patch which Mr. Dejan donated. * https://github.com/ClusterLabs/pacemaker-1.0/pull/14 The patch for crmsh is now included in the 1.0.x repository: https://github.com/ClusterLabs/pacemaker-1.0/commit/9227e89fb748cd52d330f5fca80d56fbd9d3efbf It will be appeared in 1.0.14 maintenance release, which is not scheduled yet though. Thanks, Keisuke MORI Many Thanks! Hideo Yamauchi. --- On Tue, 2013/4/2, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Mon, Apr 01, 2013 at 09:19:51PM +0200, Andreas Kurz wrote: Hi Dejan, On 2013-03-06 11:59, Dejan Muhamedagic wrote: Hi Hideo-san, On Wed, Mar 06, 2013 at 10:37:44AM +0900, renayama19661...@ybb.ne.jpwrote: Hi Dejan, Hi Andrew, As for the crm shell, the check of the meta attribute was revised with the next patch. * http://hg.savannah.gnu.org/hgweb/crmsh/rev/d1174f42f4b3 This patch was backported in Pacemaker1.0.13. * https://github.com/ClusterLabs/pacemaker-1.0/commit/fa1a99ab36e0ed015f1bcbbb28f7db962a9d1abc#shell/modules/cibconfig.py However, the ordered,colocated attribute of the group resource is treated as an error when I use crm Shell which adopted this patch. -- (snip) ### Group Configuration ### group master-group \ vip-master \ vip-rep \ meta \ ordered=false (snip) [root@rh63-heartbeat1 ~]# crm configure load update test2339.crm INFO: building help index crm_verify[20028]: 2013/03/06_17:57:18 WARN: unpack_nodes: Blind faith: not fencing unseen nodes WARNING: vip-master: specified timeout 60s for start is smaller than the advised 90 WARNING: vip-master: specified timeout 60s for stop is smaller than the advised 100 WARNING: vip-rep: specified timeout 60s for start is smaller than the advised 90 WARNING: vip-rep: specified timeout 60s for stop is smaller than the advised 100 ERROR: master-group: attribute ordered does not exist - WHY? Do you still want to commit? y -- If it chooses `yes` by a confirmation message, it is reflected, but it is a problem that error message is displayed. * The error occurs in the same way when I appoint colocated attribute. AndI noticed that there was not explanation of ordered,colocated of the group resource in online help of Pacemaker. I think that the designation of the ordered,colocated attribute should not become the error in group resource. In addition, I think that ordered,colocated should be added to online help. These attributes are not listed in crmsh. Does the attached patch help? Dejan, will this patch for the missing ordered and collocated group meta-attribute be included in the next crmsh release? ... can't see the patch in the current tip. The shell in pacemaker v1.0.x is in maintenance mode and shipped along with the pacemaker code. The v1.1.x doesn't have the ordered and collocated meta attributes. Thanks, Dejan Thanks Regards, Andreas Thanks, Dejan Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
[Pacemaker] will a stonith resource be moved from an AWOL node?
I'm using pacemaker 1.1.8 and I don't see stonith resources moving away from AWOL hosts like I thought I did with 1.1.7. So I guess the first thing to do is clear up what is supposed to happen. If I have a single stonith resource for a cluster and it's running on node A and then node A goes AWOL, what happens to that stonith resource? From what I think I know of pacemaker, pacemaker wants to be able to stonith that AWOL node before moving any resources away from it since starting a resource on a new node while the state of the AWOL node is unknown is unsafe, right? But of course, if the resource that pacemaker wants to move is the stonith resource there's a bit of a catch-22. It can't move the stonith resource until it can stonith the node, which it cannot stonith the node because the node running the resource is AWOL. So, is pacemaker supposed to resolve this on it's own or am I supposed to create a cluster configuration that ensures that enough stonith resources exist to mitigate this situation? The case I have in hand is this: # pcs config Corosync Nodes: Pacemaker Nodes: node1 node2 Resources: Resource: stonith (type=fence_xvm class=stonith) Location Constraints: Ordering Constraints: Colocation Constraints: Cluster Properties: dc-version: 1.1.8-7.wc1.el6-394e906 expected-quorum-votes: 2 no-quorum-policy: ignore symmetric-cluster: true cluster-infrastructure: classic openais (with plugin) stonith-enabled: true last-lrm-refresh: 1367331233 # pcs status Last updated: Tue Apr 30 14:48:06 2013 Last change: Tue Apr 30 14:13:53 2013 via crmd on node2 Stack: classic openais (with plugin) Current DC: node2 - partition WITHOUT quorum Version: 1.1.8-7.wc1.el6-394e906 2 Nodes configured, 2 expected votes 1 Resources configured. Node node1: UNCLEAN (pending) Online: [ node2 ] Full list of resources: stonith(stonith:fence_xvm):Started node1 node1 is very clearly completely off. The cluster has been in this state, with node1 being off for several 10s of minutes now and still the stonith resource is running on it. The log, since corosync noticed node1 going AWOL: Apr 30 14:14:56 node2 corosync[1364]: [TOTEM ] A processor failed, forming new configuration. Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 52: memb=1, new=0, lost=1 Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: pcmk_peer_update: lost: node1 4252674240 Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 52: memb=1, new=0, lost=0 Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: update_member: Node 4252674240/node1 is now: lost Apr 30 14:14:57 node2 corosync[1364]: [pcmk ] info: send_member_notification: Sending membership update 52 to 2 children Apr 30 14:14:57 node2 corosync[1364]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 30 14:14:57 node2 corosync[1364]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.155) ; members(old:2 left:1) Apr 30 14:14:57 node2 corosync[1364]: [MAIN ] Completed service synchronization, ready to provide service. Apr 30 14:14:57 node2 crmd[1666]: notice: ais_dispatch_message: Membership 52: quorum lost Apr 30 14:14:57 node2 crmd[1666]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 30 14:14:57 node2 crmd[1666]: warning: match_down_event: No match for shutdown action on node1 Apr 30 14:14:57 node2 crmd[1666]: notice: peer_update_callback: Stonith/shutdown of node1 not matched Apr 30 14:14:57 node2 cib[1661]: notice: ais_dispatch_message: Membership 52: quorum lost Apr 30 14:14:57 node2 cib[1661]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 30 14:14:57 node2 crmd[1666]: notice: do_state_transition: State transition S_IDLE - S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ] Apr 30 14:14:57 node2 attrd[1664]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Apr 30 14:14:57 node2 attrd[1664]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Apr 30 14:14:58 node2 pengine[1665]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 'now' Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 'now' Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to 'now' Apr 30 14:14:58 node2 pengine[1665]: crit: get_timet_now: Defaulting to
Re: [Pacemaker] will a stonith resource be moved from an AWOL node?
On 2013-04-30T10:55:41, Brian J. Murrell br...@interlinx.bc.ca wrote: From what I think I know of pacemaker, pacemaker wants to be able to stonith that AWOL node before moving any resources away from it since starting a resource on a new node while the state of the AWOL node is unknown is unsafe, right? Right. But of course, if the resource that pacemaker wants to move is the stonith resource there's a bit of a catch-22. It can't move the stonith resource until it can stonith the node, which it cannot stonith the node because the node running the resource is AWOL. So, is pacemaker supposed to resolve this on it's own or am I supposed to create a cluster configuration that ensures that enough stonith resources exist to mitigate this situation? Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB, and will complete the fencing request even if the fencing/stonith resource is not instantiated on the node yet. (There's a bug in 1.1.8 as released that causes an annoying delay here, but that's fixed since.) That can appear to be a bit confusing if you were used to the previous behaviour. (And I'm not sure it's a real win for the complexity of the project/code, but Andrew and David are.) Node node1: UNCLEAN (pending) Online: [ node2 ] node1 is very clearly completely off. The cluster has been in this state, with node1 being off for several 10s of minutes now and still the stonith resource is running on it. It shouldn't take so long. I think your easiest path is to update. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?
On 2013-04-24T11:44:57, Rainer Brestan rainer.bres...@gmx.net wrote: Current DC: int2node2 - partition WITHOUT quorum Version: 1.1.8-7.el6-394e906 This may not be the answer you want, since it is fairly unspecific. But I think we noticed something similar when we pulled in 1.1.8, I don't recall the bug number, but I *think* it worked out with a later git version. Can you try a newer build than 1.1.8? Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] will a stonith resource be moved from an AWOL node?
On 13-04-30 11:13 AM, Lars Marowsky-Bree wrote: Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB, and will complete the fencing request even if the fencing/stonith resource is not instantiated on the node yet. But clearly that's not happening here. (There's a bug in 1.1.8 as released that causes an annoying delay here, but that's fixed since.) Do you know which bug specifically so that I can see if the fix has been applied here? Node node1: UNCLEAN (pending) Online: [ node2 ] node1 is very clearly completely off. The cluster has been in this state, with node1 being off for several 10s of minutes now and still the stonith resource is running on it. It shouldn't take so long. Indeed. And FWIW, it's still in that state. I think your easiest path is to update. Update to what? I'm already using pacemaker-1.1.8-7 on EL6 and a yum update is not providing anything newer. Cheers, b. signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] warning: unpack_rsc_op: Processing failed op monitor for my_resource on node1: unknown error (1)
Using 1.1.8 on EL6.4, I am seeing this sort of thing: pengine[1590]: warning: unpack_rsc_op: Processing failed op monitor for my_resource on node1: unknown error (1) The full log from the point of adding the resource until the errors: Apr 30 11:46:30 node1 cibadmin[3380]: notice: crm_log_args: Invoked: cibadmin -o resources -C -x /tmp/tmpHrgNZv Apr 30 11:46:30 node1 crmd[1591]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: Diff: --- 0.24.5 Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.25.1 8a4aac3dcddc2689e4b336e1bf2078ff Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: -- cib admin_epoch=0 epoch=24 num_updates=5 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ primitive class=ocf provider=my_provider type=my_RA id=my_resource Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ meta_attributes id=my_resource-meta_attributes Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ nvpair name=my_RA-role id=my_resource-meta_attributes-my_RA-role value=Stopped / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ /meta_attributes Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ operations id=my_resource-operations Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ op id=my_resource-monitor-5 interval=5 name=monitor timeout=60 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ op id=my_resource-start-0 interval=0 name=start timeout=300 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ op id=my_resource-stop-0 interval=0 name=stop timeout=300 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ /operations Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ instance_attributes id=my_resource-instance_attributes Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ nvpair id=my_resource-instance_attributes-my_RA name=my_RA value=33bb17d2-350b-495f-bd8d-8427baabeed9 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ /instance_attributes Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ /primitive Apr 30 11:46:30 node1 pengine[1590]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 'now' Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 'now' Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 'now' Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 'now' Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 'now' Apr 30 11:46:30 node1 pengine[1590]: notice: process_pe_message: Calculated Transition 5: /var/lib/pacemaker/pengine/pe-input-10.bz2 Apr 30 11:46:30 node1 cibadmin[3386]: notice: crm_log_args: Invoked: cibadmin -o constraints -C -X rsc_location id=my_resource-primary node=node1 rsc=my_resource score=20/ Apr 30 11:46:30 node1 cib[1586]: notice: log_cib_diff: cib:diff: Local-only Change: 0.26.1 Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: -- cib admin_epoch=0 epoch=25 num_updates=1 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ rsc_location id=my_resource-primary node=node1 rsc=my_resource score=20 / Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: --- 0.26.3 Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.27.1 8305c8fe19d06a6204bd04f437eb923a Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: -- nvpair value=1367322378 id=cib-bootstrap-options-last-lrm-refresh / Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: ++ nvpair id=cib-bootstrap-options-last-lrm-refresh name=last-lrm-refresh value=1367322393 / Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: --- 0.27.2 Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.28.1 0dbddb3084f7cd76bffe21916538be94 Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: -- nvpair value=Stopped id=my_resource-meta_attributes-my_RA-role / Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: ++ nvpair name=my_RA-role id=my_resource-meta_attributes-my_RA-role value=Started / Apr 30 11:46:33 node1 crmd[1591]: warning: do_update_resource: Resource my_resource no longer exists in the lrmd Apr 30 11:46:33 node1 crmd[1591]: notice: process_lrm_event: LRM operation my_resource_monitor_0 (call=31, rc=7, cib-update=0, confirmed=true) not running Apr 30 11:46:33 node1 crmd[1591]: warning: decode_transition_key: Bad UUID (crm_resource.c) in sscanf result (4) for 3397:0:0:crm_resource.c Apr 30 11:46:33 node1 crmd[1591]:error: send_msg_via_ipc: Unknown Sub-system (3397_crm_resource)... discarding message. Apr 30 11:47:50 node1 crmd[1591]: warning: action_timer_callback: Timer popped
Re: [Pacemaker] [Problem][crmsh]The designation of the 'ordered' attribute becomes the error.
Hi Mori san, The patch for crmsh is now included in the 1.0.x repository: https://github.com/ClusterLabs/pacemaker-1.0/commit/9227e89fb748cd52d330f5fca80d56fbd9d3efbf It will be appeared in 1.0.14 maintenance release, which is not scheduled yet though. All right. Many Thanks! Hideo Yamauchi. --- On Tue, 2013/4/30, Keisuke MORI keisuke.mori...@gmail.com wrote: Hi Dejan, Andreas, Yamauchi-san 2013/4/18 renayama19661...@ybb.ne.jp Hi Dejan, Hi Andreas, The shell in pacemaker v1.0.x is in maintenance mode and shipped along with the pacemaker code. The v1.1.x doesn't have the ordered and collocated meta attributes. I sent the pull request of the patch which Mr. Dejan donated. * https://github.com/ClusterLabs/pacemaker-1.0/pull/14 The patch for crmsh is now included in the 1.0.x repository: https://github.com/ClusterLabs/pacemaker-1.0/commit/9227e89fb748cd52d330f5fca80d56fbd9d3efbf It will be appeared in 1.0.14 maintenance release, which is not scheduled yet though. Thanks, Keisuke MORI Many Thanks! Hideo Yamauchi. --- On Tue, 2013/4/2, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Mon, Apr 01, 2013 at 09:19:51PM +0200, Andreas Kurz wrote: Hi Dejan, On 2013-03-06 11:59, Dejan Muhamedagic wrote: Hi Hideo-san, On Wed, Mar 06, 2013 at 10:37:44AM +0900, renayama19661...@ybb.ne.jp wrote: Hi Dejan, Hi Andrew, As for the crm shell, the check of the meta attribute was revised with the next patch. * http://hg.savannah.gnu.org/hgweb/crmsh/rev/d1174f42f4b3 This patch was backported in Pacemaker1.0.13. * https://github.com/ClusterLabs/pacemaker-1.0/commit/fa1a99ab36e0ed015f1bcbbb28f7db962a9d1abc#shell/modules/cibconfig.py However, the ordered,colocated attribute of the group resource is treated as an error when I use crm Shell which adopted this patch. -- (snip) ### Group Configuration ### group master-group \ vip-master \ vip-rep \ meta \ ordered=false (snip) [root@rh63-heartbeat1 ~]# crm configure load update test2339.crm INFO: building help index crm_verify[20028]: 2013/03/06_17:57:18 WARN: unpack_nodes: Blind faith: not fencing unseen nodes WARNING: vip-master: specified timeout 60s for start is smaller than the advised 90 WARNING: vip-master: specified timeout 60s for stop is smaller than the advised 100 WARNING: vip-rep: specified timeout 60s for start is smaller than the advised 90 WARNING: vip-rep: specified timeout 60s for stop is smaller than the advised 100 ERROR: master-group: attribute ordered does not exist - WHY? Do you still want to commit? y -- If it chooses `yes` by a confirmation message, it is reflected, but it is a problem that error message is displayed. * The error occurs in the same way when I appoint colocated attribute. AndI noticed that there was not explanation of ordered,colocated of the group resource in online help of Pacemaker. I think that the designation of the ordered,colocated attribute should not become the error in group resource. In addition, I think that ordered,colocated should be added to online help. These attributes are not listed in crmsh. Does the attached patch help? Dejan, will this patch for the missing ordered and collocated group meta-attribute be included in the next crmsh release? ... can't see the patch in the current tip. The shell in pacemaker v1.0.x is in maintenance mode and shipped along with the pacemaker code. The v1.1.x doesn't have the ordered and collocated meta attributes. Thanks, Dejan Thanks Regards, Andreas Thanks, Dejan Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started:
Re: [Pacemaker] corosync restarts service when slave node joins the cluster
Please ask questions on the mailing lists. On 01/05/2013, at 12:30 AM, Babu Challa babu.cha...@ipaccess.com wrote: Hi Andrew, Greetings, We are using corosync/pacemaker for high availability This is a 4 node HA cluster where each pair of nodes are configured for DB and file system replication We have very tricky situation. We have configured two clusters with exact same configuration on each. But on one cluster, corosync restarting the services when slave node is rebooted and re-joins the cluster. We have tried to reproduce the issue on other cluster with multiple HA scenarios but no luck Few questions: 1. If rebooted slave is a DC (designated Controller) , is there any possibility of this issue 2. Is there any known issue in pacemaker version currently we are using (1.1.5) which will be resolved if we upgrade to latest (1.8) I believe there was one, check the ChangeLog 3. Is there any chance that pacemaker/corosync behaves differently even though configuration is same on each cluster Timing issues do occur, how identical is the hardware? 4. Can you please let us kinow if there is any possible reason for this issue. That’s really helpful to reproduce this issue and fix it More than likely it has been fixed in a later version. Versions we are using; Pacemaker version - pacemaker-1.1.5 Corosync version - corosync-1.2.7 heartbeat-3.0.3-2.3 R Babu Challa T: +44 (0) 1954 717972 | M: +44 (0) 7912 859958| E: babu.cha...@ipaccess.com | W: www.ipaccess.com ip.access Ltd, Building 2020, Cambourne Business Park, Cambourne, Cambridge, CB23 6DW The desire to excel is exclusive of the fact whether someone else appreciates it or not. Excellence is a drive from inside, not outside. Excellence is not for someone else to notice but for your own satisfaction and efficiency... This message contains confidential information and may be privileged. If you are not the intended recipient, please notify the sender and delete the message immediately. ip.access ltd, registration number 3400157, Building 2020, Cambourne Business Park, Cambourne, Cambridge CB23 6DW, United Kingdom ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] cannot register service of pacemaker_remote
Done. Thanks! On 30/04/2013, at 3:34 PM, nozawat noza...@gmail.com wrote: Hi Because there was typo in pacemaker.spec.in, I was not able to register service of pacemaker_remote. - diff --git a/pacemaker.spec.in b/pacemaker.spec.in index 10296a5..1e1fd6d 100644 --- a/pacemaker.spec.in +++ b/pacemaker.spec.in @@ -404,7 +404,7 @@ exit 0 %if %{defined _unitdir} /bin/systemctl daemon-reload /dev/null 21 || : %endif -/sbin/chkconfig --add pacemaker-remote || : +/sbin/chkconfig --add pacemaker_remote || : %preun -n %{name}-remote if [ $1 -eq 0 ]; then - Regards, Tomo ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] lrm monitor failure status lost during DC election
On 19/04/2013, at 6:36 AM, David Adair david_ad...@xyratex.com wrote: Hello. I have an issue with pacemaker 1.1.6.1 but believe this may still be present in the latest git versions and would like to know if the fix makes sense. What I see is the following: Setup: - 2 node cluster - ocf:heartbeat:Dumy resource on non-DC node. - Force DC reboot or stonith and fail resource while there is no DC. Result: - node with failed monitor becomes DC (good) - lrmd reports resource as failed during every monitor interval but since these failures are not rc status changes they are not sent to crmd. (good -- it is failing, but ..) - crm_mon / cibadmin --query report resource as running OK. (not good) The resource has failed but is never restarted I believe the failing resource and any group it belongs to should be recovered during/after the DC election. I think this is due to the operation of build_active_RAs on the surviving node: build_operation_update(xml_rsc, (entry-rsc), entry-last, __FUNCTION__); build_operation_update(xml_rsc, (entry-rsc), entry-failed, __FUNCTION__); for (gIter = entry-recurring_op_list; gIter != NULL; gIter = gIter-next) { build_operation_update(xml_rsc, (entry-rsc), gIter-data, __FUNCTION__); } What this produces is last failed list[0] list[1] start_0: rc=0; monitor_1000: rc=7; monitor_1000: rc=7; monitor_1000: rc=0 list[] should only have one element as both are for monitor_1000 I have a vague recollection of an old bug in this area and strongly suspect that something more recent wont have the same problem. The final result in the cib appears to be the last entry which is from the initial transition of the monitor from rc=-1 to rc=0. To fix this I swapped the order of recurring_op_list so that the last transition is at the end of the list rather than the beginning. With this this change I see what I believe is the desired behavior -- the resource is stopped and re-stared when the DC election is finalized. The memcpy is a backport of a corresponding change in lrmd_copy_event to simplify debugging by maintaining the rcchanged time. - This patch swaps the order of recurring operations (monitors) in the lrm history cache. By placing the most recent change at the end of the list it is properly detected by pengine after a DC election. With the new events placed at the start of the list the last thing in the list is the initial startup with rc=0. This makes pengine believe the resource is working properly even though lrmd is reporting constand failure. It is fairly easy to get into this situation when a shared resource (storage enclosure) fails and causes the DC to be stonithed. diff --git a/crmd/lrm.c b/crmd/lrm.c index 187db76..f8974f6 100644 --- a/crmd/lrm.c +++ b/crmd/lrm.c @@ -217,7 +217,7 @@ update_history_cache(lrm_rsc_t * rsc, lrm_op_t * op) if (op-interval 0) { crm_trace(Adding recurring op: %s_%s_%d, op-rsc_id, op-op_type, op-interval); -entry-recurring_op_list = g_list_prepend(entry-recurring_op_list, copy_lrm_op(op)); +entry-recurring_op_list = g_list_append(entry-recurring_op_list, copy_lrm_op(op)); } else if (entry-recurring_op_list safe_str_eq(op-op_type, RSC_STATUS) == FALSE) { GList *gIter = entry-recurring_op_list; @@ -1756,6 +1756,9 @@ copy_lrm_op(const lrm_op_t * op) crm_malloc0(op_copy, sizeof(lrm_op_t)); + /* Copy all int values, pointers fixed below */ + memcpy(op_copy, op, sizeof(lrm_op_t)); + op_copy-op_type = crm_strdup(op-op_type); /* input fields */ op_copy-params = g_hash_table_new_full(crm_str_hash, g_str_equal, ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Behavior when crm_mon is a daemon
On 19/04/2013, at 11:05 AM, Yuichi SEINO seino.clust...@gmail.com wrote: HI, 2013/4/16 Andrew Beekhof and...@beekhof.net: On 15/04/2013, at 7:42 PM, Yuichi SEINO seino.clust...@gmail.com wrote: Hi All, I look at the daemon of tools to make a new daemon. So, I have a question. When the old pid file existed, crm_mon is start as a daemon. Then, crm_mon don't update this old pid file. And, crm_mon doesn't stop. I would like to know if this behavior is correct. Some of it is, but the part about crm_mon not updating the pid file (which is probably also preventing it from stopping) is bad. I understood that it is a negative behavior. If we figure out a problem, I think that we want to fix it. Done: https://github.com/beekhof/pacemaker/commit/e549770 Plus an extra bonus: https://github.com/beekhof/pacemaker/commit/479c5cc ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] will a stonith resource be moved from an AWOL node?
On 01/05/2013, at 1:28 AM, Brian J. Murrell br...@interlinx.bc.ca wrote: On 13-04-30 11:13 AM, Lars Marowsky-Bree wrote: Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB, and will complete the fencing request even if the fencing/stonith resource is not instantiated on the node yet. But clearly that's not happening here. Can you file a bug and attach the logs from both machines? Unless... are you still using cman or the pacemaker plugin (as shipped or the patched one from https://bugzilla.redhat.com/show_bug.cgi?id=951340)? (There's a bug in 1.1.8 as released that causes an annoying delay here, but that's fixed since.) Do you know which bug specifically so that I can see if the fix has been applied here? Node node1: UNCLEAN (pending) Online: [ node2 ] node1 is very clearly completely off. The cluster has been in this state, with node1 being off for several 10s of minutes now and still the stonith resource is running on it. It shouldn't take so long. Indeed. And FWIW, it's still in that state. I think your easiest path is to update. Update to what? I'm already using pacemaker-1.1.8-7 on EL6 and a yum update is not providing anything newer. Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] warning: unpack_rsc_op: Processing failed op monitor for my_resource on node1: unknown error (1)
On 01/05/2013, at 2:51 AM, Brian J. Murrell br...@interlinx.bc.ca wrote: Using 1.1.8 on EL6.4, I am seeing this sort of thing: pengine[1590]: warning: unpack_rsc_op: Processing failed op monitor for my_resource on node1: unknown error (1) The full log from the point of adding the resource until the errors: Apr 30 11:46:30 node1 cibadmin[3380]: notice: crm_log_args: Invoked: cibadmin -o resources -C -x /tmp/tmpHrgNZv Apr 30 11:46:30 node1 crmd[1591]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: Diff: --- 0.24.5 Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.25.1 8a4aac3dcddc2689e4b336e1bf2078ff Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: -- cib admin_epoch=0 epoch=24 num_updates=5 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ primitive class=ocf provider=my_provider type=my_RA id=my_resource Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ meta_attributes id=my_resource-meta_attributes Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ nvpair name=my_RA-role id=my_resource-meta_attributes-my_RA-role value=Stopped / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ /meta_attributes Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ operations id=my_resource-operations Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ op id=my_resource-monitor-5 interval=5 name=monitor timeout=60 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ op id=my_resource-start-0 interval=0 name=start timeout=300 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ op id=my_resource-stop-0 interval=0 name=stop timeout=300 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ /operations Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ instance_attributes id=my_resource-instance_attributes Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ nvpair id=my_resource-instance_attributes-my_RA name=my_RA value=33bb17d2-350b-495f-bd8d-8427baabeed9 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ /instance_attributes Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ /primitive Apr 30 11:46:30 node1 pengine[1590]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 'now' Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 'now' Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 'now' Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 'now' Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to 'now' Apr 30 11:46:30 node1 pengine[1590]: notice: process_pe_message: Calculated Transition 5: /var/lib/pacemaker/pengine/pe-input-10.bz2 Apr 30 11:46:30 node1 cibadmin[3386]: notice: crm_log_args: Invoked: cibadmin -o constraints -C -X rsc_location id=my_resource-primary node=node1 rsc=my_resource score=20/ Apr 30 11:46:30 node1 cib[1586]: notice: log_cib_diff: cib:diff: Local-only Change: 0.26.1 Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: -- cib admin_epoch=0 epoch=25 num_updates=1 / Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ rsc_location id=my_resource-primary node=node1 rsc=my_resource score=20 / Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: --- 0.26.3 Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.27.1 8305c8fe19d06a6204bd04f437eb923a Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: -- nvpair value=1367322378 id=cib-bootstrap-options-last-lrm-refresh / Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: ++ nvpair id=cib-bootstrap-options-last-lrm-refresh name=last-lrm-refresh value=1367322393 / Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: --- 0.27.2 Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.28.1 0dbddb3084f7cd76bffe21916538be94 Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: -- nvpair value=Stopped id=my_resource-meta_attributes-my_RA-role / Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: ++ nvpair name=my_RA-role id=my_resource-meta_attributes-my_RA-role value=Started / Apr 30 11:46:33 node1 crmd[1591]: warning: do_update_resource: Resource my_resource no longer exists in the lrmd Apr 30 11:46:33 node1 crmd[1591]: notice: process_lrm_event: LRM operation my_resource_monitor_0 (call=31, rc=7, cib-update=0, confirmed=true) not running Apr 30 11:46:33 node1 crmd[1591]: warning: decode_transition_key: Bad UUID (crm_resource.c) in sscanf result (4) for 3397:0:0:crm_resource.c Apr 30 11:46:33 node1 crmd[1591]:
Re: [Pacemaker] Kernel WARN unpack_status in syslog
On 20/04/2013, at 3:07 AM, Ivor Prebeg ivor.pre...@gmail.com wrote: Guys, I can't get rid of following warnings: Apr 19 19:00:37 node2 crmd: [32230]: WARN: start_subsystem: Client pengine already running as pid 32240 Apr 19 19:00:44 node2 pengine: [32240]: WARN: unpack_status: Node node1 in status section no longer exists Apr 19 19:00:44 node2 pengine: [32240]: WARN: unpack_status: Node node2 in status section no longer exists Apr 19 19:00:44 node2 pengine: [32240]: notice: process_pe_message: Configuration WARNINGs found during PE processing. Please run crm_verify -L to identify issues. root@node2:~# crm_verify -LV crm_verify[13317]: 2013/04/19_19:03:04 WARN: unpack_status: Node node1 in status section no longer exists crm_verify[13317]: 2013/04/19_19:03:04 WARN: unpack_status: Node node2 in status section no longer exists Warnings found during check: config may not be valid Since I have nagios running through syslog emailing warnings and errors, this is pretty annoying. And disabling warn checks isn't an option. Any clues? Can you run cibadmin -Ql | grep node and paste the result here? I do have /etc/hosts entries. Ivor Prebeg ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker configuration with different dependencies
On 17/04/2013, at 6:15 PM, Ivor Prebeg ivor.pre...@gmail.com wrote: Hi Andreas, thank you for your answer. Maybe my description was a little fuzzy, sorry for that. What I want is following: * if l3_ping fails on a particular node, all services should go to standby on that node (which probably works fine with on-fail=standby) correct * if sip service (active/active) fails on particular node, only floating ip assigned should be migrated to other node colocate ip with sip clone * if any of services (active/active), be it database or java container fails, both database and java container should be stopped and floating ip migrated to another node put all three in a group? * failure of sip service should not affect database or java container and vice versa. Hope this makes it more clear, not sure that I understood how to achieve dependency tree. Thanks, Ivor Prebeg On Apr 16, 2013, at 2:50 PM, Andreas Mock andreas.m...@web.de wrote: Hi Ivor, I don't know whether I understand you completely right: If you want independence of resources don't put them into a group. Look at http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/ch10.html A group is made to tie together several resources without declaring all necessary colocations and orderings to get a desired behaviour. Otherwise. Name your resources ans how they should be spread across your cluster. (Show the technical dependency) Best regards Andreas Von: Ivor Prebeg [mailto:ivor.pre...@gmail.com] Gesendet: Dienstag, 16. April 2013 13:53 An: pacemaker@oss.clusterlabs.org Betreff: [Pacemaker] Pacemaker configuration with different dependencies Hi guys, I need some help with pacemaker configuration, it is all new to me and can't find solution... I have two-node HA environment with services that I want to be partially independent, in pacemaker/heartbeat configuration. There is active/active sip service with two floating IPs, it should all just migrate floating ip when one sip dies. There is also two active/active master/slave services with java container and rdbms with replication between them, should also fallback when one dies. What I can't figure out how to configure those two to be independent (put on-fail directive on group). What I want is to, e.g., in case my sip service fails, java container stays active on that node, but floating ip to be moved to other node. Another thing is, in case one of rdbms fails, I want to put whole service group on that node to standby, but leave sip service intact. Whole node should go to standby (all services down) only when L3_ping to gateway dies. All suggestions and configuration examples are welcome. Thanks in advance. Ivor Prebeg ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] failure handling on a cloned resource
On 17/04/2013, at 9:54 PM, Johan Huysmans johan.huysm...@inuits.be wrote: Hi All, I'm trying to setup a specific configuration in our cluster, however I'm struggling with my configuration. This is what I'm trying to achieve: On both nodes of the cluster a daemon must be running (tomcat). Some failover addresses are configured and must be running on the node with a correctly running tomcat. I have this achieved with a cloned tomcat resource and an collocation between the cloned tomcat and the failover addresses. When I cause a failure in the tomcat on the node running the failover addresses, the failover addresses will failover to the other node as expected. crm_mon shows that this tomcat has a failure. When I configure the tomcat resource with failure-timeout=0, the failure alarm in crm_mon isn't cleared whenever the tomcat failure is fixed. All sounds right so far. When I configure the tomcat resource with failure-timeout=30, the failure alarm in crm_mon is cleared after 30seconds however the tomcat is still having a failure. Can you define still having a failure? You mean it still shows up in crm_mon? Have you read this link? http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html What I expect is that pacemaker reports the failure as the failure exists and as long as it exists and that pacemaker reports that everything is ok once everything is back ok. Do I do something wrong with my configuration? Or how can I achieve my wanted setup? Here is my configuration: node CSE-1 node CSE-2 primitive d_tomcat ocf:custom:tomcat \ op monitor interval=15s timeout=510s on-fail=block \ op start interval=0 timeout=510s \ params instance_name=NMS monitor_use_ssl=no monitor_urls=/cse/health monitor_timeout=120 \ meta migration-threshold=1 failure-timeout=0 primitive ip_1 ocf:heartbeat:IPaddr2 \ op monitor interval=10s \ params nic=bond0 broadcast=10.1.1.1 iflabel=ha ip=10.1.1.1 primitive ip_2 ocf:heartbeat:IPaddr2 \ op monitor interval=10s \ params nic=bond0 broadcast=10.1.1.2 iflabel=ha ip=10.1.1.2 group svc-cse ip_1 ip_2 clone cl_tomcat d_tomcat colocation colo_tomcat inf: svc-cse cl_tomcat order order_tomcat inf: cl_tomcat svc-cse property $id=cib-bootstrap-options \ dc-version=1.1.8-7.el6-394e906 \ cluster-infrastructure=cman \ no-quorum-policy=ignore \ stonith-enabled=false Thanks! Greetings, Johan Huysmans ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Two node KVM cluster
On 28/04/2013, at 9:19 PM, Oriol Mula-Valls oriol.mula-va...@ic3.cat wrote: Hi, I have modified the previous configuration to use sbd fencing. I have also fixed several other issues with the configuration and now when the node reboots seems not to be able to rejoin the cluster. I attach the debug log I have just generated. Node was rebooted around 11:51:41 and came back at 12:52:47. The boot order of the services is: 1. sbd 2. corosync 3. pacemaker It doesn't look like pacemaker was restarted on node1, just corosync. Could someone help me, please? Thanks, Oriol On 16/04/13 06:10, Andrew Beekhof wrote: On 10/04/2013, at 3:20 PM, Oriol Mula-Vallsoriol.mula-va...@ic3.cat wrote: On 10/04/13 02:10, Andrew Beekhof wrote: On 09/04/2013, at 7:31 PM, Oriol Mula-Vallsoriol.mula-va...@ic3.cat wrote: Thanks Andrew I've managed to set up the system and currently I have it working but still on testing. I have configure external/ipmi as fencing device and then I force a reboot doing a echo b /proc/sysrq-trigger. The fencing is working properly as the node is shut off and the VM migrated. However, as soon as I turn on the fenced now and the OS has started the surviving is shut down. Is it normal or am I doing something wrong? Can you clarify turn on the fenced? To restart the fenced node I do either a power on with ipmitool or I power it on using the iRMC web interface. Oh, fenced now was meant to be fenced node. That makes more sense now :) To answer your question, I would not expect the surviving node to be fenced when the previous node returns. The network between the two is still functional? -- Oriol Mula Valls Institut Català de Ciències del Clima (IC3) Doctor Trueta 203 - 08005 Barcelona Tel:+34 93 567 99 77 corosync.log.gz___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org