Re: [Pacemaker] rename all nodes of a cluster
Just bring up the cluster with the new names and use crm configure node delete to remove the old names. On Mon, Oct 4, 2010 at 2:12 PM, Karl Rößmann k.roessm...@fkf.mpg.de wrote: Hi Clusterlabs mailing list, I have a running cluster with three nodes. For some reason I had to change all host names and their IP address for the interface eth0 The communication channel is not affected, bindnetaddr mcastaddr mcastport will be the same. is there an easy way to rename the nodes ? We have SuSE SLES11 SP1, including: corosync-1.2.1-0.5.1 openais-1.1.2-0.5.19 Karl Roessmann -- Karl Rößmann Tel. +49-711-689-1657 Max-Planck-Institut FKF Fax. +49-711-689-1198 Postfach 800 665 70506 Stuttgart email k.roessm...@fkf.mpg.de ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Fail over algorithm used by Pacemaker
On Sun, Oct 3, 2010 at 4:01 PM, hudan studiawan studia...@gmail.com wrote: Hi, I want to start to contribute to Pacemaker project. I start to read Documentation and try some basic configurations. I have a question: what kind of algorithm used by Pacemaker to choose another node when a node die in a cluster? Is there any manual or documentation I can read? We figure out the next best bode based on the location and colocation constraints you specified in the configuration. See the Pacemaker Explained and Cluster from Scratch docs at http://www.clusterlabs.org/doc/ Thank you, Hudan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] cib
On Fri, Oct 1, 2010 at 3:45 PM, Shravan Mishra shravan.mis...@gmail.com wrote: Hi, Just a quick question, who generates the very first cib.xml when pacemaker processes are initialized? The cib Thanks Shravan On Thu, Sep 30, 2010 at 4:22 AM, Andrew Beekhof and...@beekhof.net wrote: On Tue, Sep 28, 2010 at 11:47 AM, Andrew Beekhof and...@beekhof.net wrote: On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra shravan.mis...@gmail.com wrote: Thanks Raoul for the response. Changing the permission to hacluster:haclient did stop that error. Now I'm hitting another problem whereby cib is failing to start Very strange logs. Which distribution is this? What does your corosync.conf look like? = Sep 27 00:16:29 corosync [pcmk ] info: update_member: Node ha2.itactics.com now has process list: 00110012 (1114130) Sep 27 00:16:29 corosync [pcmk ] info: update_member: Node ha2.itactics.com now has 1 quorum votes (was 0) Sep 27 00:16:29 corosync [pcmk ] info: send_member_notification: Sending membership update 100 to 0 children Sep 27 00:16:29 corosync [MAIN ] Completed service synchronization, ready to provide service. Sep 27 00:16:30 corosync [pcmk ] ERROR: pcmk_wait_dispatch: Child process cib exited (pid=14889, rc=127) Sep 27 00:16:30 corosync [pcmk ] notice: pcmk_wait_dispatch: Respawning failed child process: cib Sep 27 00:16:30 corosync [pcmk ] info: spawn_child: Forked child 14896 for process cib crmd[14893]: 2010/09/27_00:16:30 WARN: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Sep 27 00:16:31 corosync [pcmk ] ERROR: pcmk_wait_dispatch: Child process cib exited (pid=14896, rc=127) Sep 27 00:16:31 corosync [pcmk ] notice: pcmk_wait_dispatch: Respawning failed child process: cib Sep 27 00:16:31 corosync [pcmk ] info: spawn_child: Forked child 14901 for process cib Sep 27 00:16:32 corosync [pcmk ] ERROR: pcmk_wait_dispatch: Child process cib exited (pid=14901, rc=1 == I have attached the full logs. We are using corosync 1.2.8 and pacemaker 1.1.3. Thanks. Shravan On Sat, Sep 25, 2010 at 4:36 AM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote: On 24.09.2010 21:41, Shravan Mishra wrote: crmd[20612]: 2010/09/24_15:29:57 ERROR: crm_log_init_worker: Cannot change active directory to /var/lib/heartbeat/cores/hacluster: Permission denied (13) ls -ald /var/lib/heartbeat/cores/hacluster /var/lib/heartbeat/cores/ /var/lib/heartbeat/ /var/lib/ /var/ is haclient allowed to cd all the way into /var/lib/heartbeat/cores/hacluster ? cheers, ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Fail-count and failure timeout
On Fri, Oct 1, 2010 at 3:40 PM, holger.teut...@fresenius-netcare.com wrote: Hi, I observed the following in pacemaker Versions 1.1.3 and tip up to patch 10258. In a small test environment to study fail-count behavior I have one resource anything doing sleep 600 with monitoring interval 10 secs. The failure-timeout is 300. I would expect to never see a failcount higher than 1. Why? The fail-count is only reset when the PE runs... which is on a failure and/or after the cluster-recheck-interval So I'd expect a maximum of two. cluster-recheck-interval = time [15min] Polling interval for time based changes to options, resource parameters and constraints. The Cluster is primarily event driven, however the configuration can have elements that change based on time. To ensure these changes take effect, we can optionally poll the cluster’s status for changes. Allowed values: Zero disables polling. Positive values are an interval in seconds (unless other SI units are specified. eg. 5min) I observed some sporadic clears but mostly the count is increasing by 1 each 10 minutes. Am I mistaken or is this a bug ? Hard to say without logs. What value did it reach? Regards Holger -- complete cib for reference --- cib epoch=32 num_updates=0 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.4 have-quorum=0 cib-last-written=Fri Oct 1 14:17:31 2010 dc-uuid=hotlx configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.3-09640bd6069e677d5eed65203a6056d9bf562e67/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=openais/ nvpair id=cib-bootstrap-options-expected-quorum-votes name=expected-quorum-votes value=2/ nvpair id=cib-bootstrap-options-no-quorum-policy name=no-quorum-policy value=ignore/ nvpair id=cib-bootstrap-options-stonith-enabled name=stonith-enabled value=false/ nvpair id=cib-bootstrap-options-start-failure-is-fatal name=start-failure-is-fatal value=false/ nvpair id=cib-bootstrap-options-last-lrm-refresh name=last-lrm-refresh value=1285926879/ /cluster_property_set /crm_config nodes node id=hotlx uname=hotlx type=normal/ /nodes resources primitive class=ocf id=test provider=heartbeat type=anything meta_attributes id=test-meta_attributes nvpair id=test-meta_attributes-target-role name=target-role value=started/ nvpair id=test-meta_attributes-failure-timeout name=failure-timeout value=300/ /meta_attributes operations id=test-operations op id=test-op-monitor-10 interval=10 name=monitor on-fail=restart timeout=20s/ op id=test-op-start-0 interval=0 name=start on-fail=restart timeout=20s/ /operations instance_attributes id=test-instance_attributes nvpair id=test-instance_attributes-binfile name=binfile value=sleep 600/ /instance_attributes /primitive /resources constraints/ /configuration /cib ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resources are restarted without obvious reasons
On Fri, Oct 1, 2010 at 9:53 AM, Pavlos Parissis pavlos.paris...@gmail.com wrote: Hi, It seams that it happens every time PE wants to check the conf 09:23:55 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped! and then check_rsc_parameters() wants to reset my resources 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_02 on node-02, provider changed: heartbeat - null 09:23:55 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_01 on node-01, provider changed: heartbeat - null Could be a bug in the code that detects changes to the resource definition. Could you file a bug please? http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker looking at the code I can't conclude where the issue could be, in the actual conf or I am hitting a bug static gboolean check_rsc_parameters(resource_t *rsc, node_t *node, xmlNode *rsc_entry, pe_working_set_t *data_set) { int attr_lpc = 0; gboolean force_restart = FALSE; gboolean delete_resource = FALSE; const char *value = NULL; const char *old_value = NULL; const char *attr_list[] = { XML_ATTR_TYPE, XML_AGENT_ATTR_CLASS, XML_AGENT_ATTR_PROVIDER }; for(; attr_lpc DIMOF(attr_list); attr_lpc++) { value = crm_element_value(rsc-xml, attr_list[attr_lpc]); old_value = crm_element_value(rsc_entry, attr_list[attr_lpc]); if(value == old_value /* ie. NULL */ || crm_str_eq(value, old_value, TRUE)) { continue; } force_restart = TRUE; crm_notice(Forcing restart of %s on %s, %s changed: %s - %s, rsc-id, node-details-uname, attr_list[attr_lpc], crm_str(old_value), crm_str(value)); } if(force_restart) { /* make sure the restart happens */ stop_action(rsc, node, FALSE); set_bit(rsc-flags, pe_rsc_start_pending); delete_resource = TRUE; } return delete_resource; } On 1 October 2010 09:13, Pavlos Parissis pavlos.paris...@gmail.com wrote: Hi Could be related to a possible bug mentioned here[1]? BTW here is the conf of pacemaker node $id=b8ad13a6-8a6e-4304-a4a1-8f69fa735100 node-02 node $id=d5557037-cf8f-49b7-95f5-c264927a0c76 node-01 node $id=e5195d6b-ed14-4bb3-92d3-9105543f9251 node-03 primitive drbd_01 ocf:linbit:drbd \ params drbd_resource=drbd_pbx_service_1 \ op monitor interval=30s \ op start interval=0 timeout=240s \ op stop interval=0 timeout=120s primitive drbd_02 ocf:linbit:drbd \ params drbd_resource=drbd_pbx_service_2 \ op monitor interval=30s \ op start interval=0 timeout=240s \ op stop interval=0 timeout=120s primitive fs_01 ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/pbx_service_01 fstype=ext3 \ meta migration-threshold=3 failure-timeout=60 \ op monitor interval=20s timeout=40s OCF_CHECK_LEVEL=20 \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s primitive fs_02 ocf:heartbeat:Filesystem \ params device=/dev/drbd2 directory=/pbx_service_02 fstype=ext3 \ meta migration-threshold=3 failure-timeout=60 \ op monitor interval=20s timeout=40s OCF_CHECK_LEVEL=20 \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s primitive ip_01 ocf:heartbeat:IPaddr2 \ params ip=192.168.78.10 cidr_netmask=24 broadcast=192.168.78.255 \ meta failure-timeout=120 migration-threshold=3 \ op monitor interval=5s primitive ip_02 ocf:heartbeat:IPaddr2 \ params ip=192.168.78.20 cidr_netmask=24 broadcast=192.168.78.255 \ op monitor interval=5s primitive pbx_01 lsb:test-01 \ meta failure-timeout=60 migration-threshold=3 target-role=Started \ op monitor interval=20s \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s primitive pbx_02 lsb:test-02 \ meta failure-timeout=60 migration-threshold=3 target-role=Started \ op monitor interval=20s \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s group pbx_service_01 ip_01 fs_01 pbx_01 \ meta target-role=Started group pbx_service_02 ip_02 fs_02 pbx_02 \ meta target-role=Started ms ms-drbd_01 drbd_01 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ms-drbd_02 drbd_02 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started location PrimaryNode-drbd_01 ms-drbd_01 100: node-01 location PrimaryNode-drbd_02 ms-drbd_02 100: node-02 location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01 location
Re: [Pacemaker] [Problem or Enhancement]When attrd reboots, a fail count is initialized.
Hi Andrew, I registered these contents with Bugzilla as enhancement of the functions. * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2501 Thanks, Hideo Yamauchi. --- renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. Is the change of this attrd and crmd difficult? I dont think so. But its not a huge priority because I've never heard of attrd actually crashing. So while I agree that its theoretically a problem, in practice no-one is going to hit this in production. Even if they were unlucky enough to see it, at worst the resource is able to run on the node again - which doesn't seem that bad for a HA cluster :-) All right. I register this problem with Bugzilla as a demand first of all. I will wait for the opinion from other users already appearing a little. Thanks, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Fri, Oct 1, 2010 at 4:00 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. During crmd startup, one could read all the values from attrd into the hashtable. So the hashtable would only do something if only attrd went down. If attrd communicates with crmd at the time of start and reads the data of the hash table, the problem seems to be able to be settled. Is the change of this attrd and crmd difficult? I dont think so. But its not a huge priority because I've never heard of attrd actually crashing. So while I agree that its theoretically a problem, in practice no-one is going to hit this in production. Even if they were unlucky enough to see it, at worst the resource is able to run on the node again - which doesn't seem that bad for a HA cluster :-) I mean: did you see this behavior in a production system, or only during testing when you manually killed attrd? We carry out kill-command by manual operation as one of the tests of the trouble of the processes. Our user minds behavior of the process trouble very much. Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Wed, Sep 29, 2010 at 3:59 AM, nbsp;renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. The problem here is that attrd is supposed to be the authoritative source for this sort of data. Yes. I understand. Additionally, you don't always want attrd reading from the status section - like after the cluster restarts. The problem seems to be able to solve even that it retrieves a status section from cib after attrd rebooted. method2 which I suggested is such a meaning. method 2)When attrd started, Attrd communicates with cib and receives fail-count. For failcount, the crmd could keep a hashtable of the current values which it could re-send to attrd if it detects a disconnection. But that might not be a generic-enough solution. If a Hash table of crmd can maintain it, it may be a good thought. However, I have a feeling that the same problem happens when crmd causes trouble and rebooted. During crmd startup, one could read all the values from attrd into the hashtable. So the hashtable would only do something if only attrd went down. The chance that attrd dies _and_ there were relevant values for fail-count is pretty remote though... is this a real problem you've experienced or a theoretical one? I did not understand meanings well. Does this mean that there is fail-count of attrd in the other node? I mean: did you see this behavior in a production system, or only during testing when you manually killed attrd? Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Mon, Sep 27, 2010 at 7:26 AM, #65533;renayama19661...@ybb.ne.jp wrote: Hi, When I investigated another problem, I discovered this phenomenon. If attrd causes process trouble and does not restart, the problem does not occur. Step1) After start, it causes a monitor error in UmIPaddr twice. Online: [ srv01 srv02 ] #65533;Resource Group: UMgroup01 #65533; #65533; UmVIPcheck (ocf::heartbeat:Dummy): Started srv01 #65533; #65533; UmIPaddr #65533; (ocf::heartbeat:Dummy2): #65533; #65533; #65533; #65533;Started srv01 Migration summary: * Node srv02: * Node srv01: #65533; UmIPaddr: migration-threshold=10 fail-count=2 Step2) Kill Attrd and Attrd reboots. Online: [ srv01 srv02 ] #65533;Resource Group: UMgroup01 #65533; #65533; UmVIPcheck (ocf::heartbeat:Dummy): Started srv01 #65533; #65533; UmIPaddr #65533; (ocf::heartbeat:Dummy2): #65533; #65533; #65533; #65533;Started srv01 Migration summary: * Node srv02: * Node srv01:
Re: [Pacemaker] Missing lrm_opstatus
Dejan: looks like something in the lrm library. Any idea why the message doesn't contain lrm_opstatus? lrm_targetrc also looks strange. On Thu, Sep 30, 2010 at 9:41 PM, Ron Kerry rke...@sgi.com wrote: Folks - I am seeing the following message sequence that results in a bogus declaration of monitor failures for two resources very quickly after a failover completes (from hendrix to genesis) with all resources coming up. The scenario is the same for both resources. CXFS resource monitor invoked after a successful start but the response is faked likely due to the start-delay defined for monitoring. Sep 30 10:23:33 genesis crmd: [12176]: info: te_rsc_command: Initiating action 8: monitor CXFS_monitor_3 on genesis (local) Sep 30 10:23:33 genesis crmd: [12176]: info: do_lrm_rsc_op: Performing key=8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c op=CXFS_monitor_3 ) Sep 30 10:23:33 genesis lrmd: [12173]: debug: on_msg_perform_op: an operation operation monitor[15] on ocf::cxfs::CXFS for client 12176, its parameters: CRM_meta_name=[monitor] CRM_meta_start_delay=[60] crm_feature_set=[3.0.2] CRM_meta_on_fail=[restar.. Sep 30 10:23:33 genesis crmd: [12176]: info: do_lrm_rsc_op: Faking confirmation of CXFS_monitor_3: execution postponed for over 5 minutes Sep 30 10:23:33 genesis crmd: [12176]: info: send_direct_ack: ACK'ing resource op CXFS_monitor_3 from 8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c: lrm_invoke-lrmd-1285860213-14 Sep 30 10:23:33 genesis crmd: [12176]: info: process_te_message: Processing (N)ACK lrm_invoke-lrmd-1285860213-14 from genesis Sep 30 10:23:33 genesis crmd: [12176]: info: match_graph_event: Action CXFS_monitor_3 (8) confirmed on genesis (rc=0) Similar sequence for the TMF resource ... Sep 30 10:23:44 genesis crmd: [12176]: info: te_rsc_command: Initiating action 12: monitor TMF_monitor_6 on genesis (local) Sep 30 10:23:44 genesis crmd: [12176]: info: do_lrm_rsc_op: Performing key=12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c op=TMF_monitor_6 ) Sep 30 10:23:44 genesis lrmd: [12173]: debug: on_msg_perform_op: an operation operation monitor[19] on ocf::tmf::TMF for client 12176, its parameters: admin_emails=[rke...@sgi.com] loader_hosts=[ibm3494cps] devgrpnames=[ibm3592] loader_names=[ibm3494] loa... Sep 30 10:23:44 genesis crmd: [12176]: info: do_lrm_rsc_op: Faking confirmation of TMF_monitor_6: execution postponed for over 5 minutes Sep 30 10:23:44 genesis crmd: [12176]: info: send_direct_ack: ACK'ing resource op TMF_monitor_6 from 12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c: lrm_invoke-lrmd-1285860224-19 Sep 30 10:23:44 genesis crmd: [12176]: info: process_te_message: Processing (N)ACK lrm_invoke-lrmd-1285860224-19 from genesis Sep 30 10:23:44 genesis crmd: [12176]: info: match_graph_event: Action TMF_monitor_6 (12) confirmed on genesis (rc=0) TMF monitor operation state gets an error. Note that the operation id matches the above invocation. Sep 30 10:26:12 genesis lrmd: [12173]: debug: on_msg_get_state:state of rsc TMF is LRM_RSC_IDLE Sep 30 10:26:12 genesis crmd: [12176]: WARN: msg_to_op(1326): failed to get the value of field lrm_opstatus from a ha_msg Sep 30 10:26:12 genesis crmd: [12176]: info: msg_to_op: Message follows: Sep 30 10:26:12 genesis crmd: [12176]: info: MSG: Dumping message with 16 fields Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[0] : [lrm_t=op] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[1] : [lrm_rid=TMF] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[2] : [lrm_op=monitor] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[3] : [lrm_timeout=63] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[4] : [lrm_interval=6] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[5] : [lrm_delay=60] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[6] : [lrm_copyparams=0] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[7] : [lrm_t_run=0] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[8] : [lrm_t_rcchange=0] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[9] : [lrm_exec_time=0] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[10] : [lrm_queue_time=0] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[11] : [lrm_targetrc=-2] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[12] : [lrm_app=crmd] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[13] : [lrm_userdata=12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c] Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[14] : [(2)lrm_param=0x60081260(331 413)] Same for the CXFS monitor operation state ... Sep 30 10:26:12 genesis lrmd: [12173]: debug: on_msg_get_state:state of rsc CXFS is LRM_RSC_IDLE Sep 30 10:26:12 genesis crmd: [12176]: WARN: msg_to_op(1326): failed to get the value of field lrm_opstatus from a ha_msg Sep 30 10:26:12 genesis crmd: [12176]: info: msg_to_op: Message follows: Sep 30 10:26:12 genesis crmd: [12176]: info: MSG: Dumping message with 16 fields Sep 30 10:26:12 genesis crmd: [12176]:
Re: [Pacemaker] crm_mon SNMP function
On Monday 04 October 2010 15:00:25 mathias.enzensber...@knapp.com wrote: Hi all, I use openais/pacemaker v.1.1.2 on SLES 11.1 and would like to use the SNMP function of crm_mon. But this part is documented really scanty (e.g. the part for configuring SNMP notifications is blank). I found out that there is a special MIB named linux-ha-mib but I don´t know how to use this MIB in connection with the crm_mon command and its SNMP function. Does anyone of you have experience with that, or can someone document it shortly for me? Thank you in advance. Mit freundlichen Grüßen / Best Regards Mathias Enzensberger Systemadministration Hi, they say that there is a good German about Linux Clusters published by O'Reilly. ;-) The book deals with the SNMP part of the clustersoftware in a seperate chapter. As far a I know there are two SNMP Agents implemented in the clustersoftware at the moment: 1) One within the crm_mon program. This one is only able to send out traps. It does it in case of an unexpected event. Configure it with the -s (?) option of the crm_mon program. Please check because I'm not quite sure about the option. 2) A full blown SNMP subagent program called hbagent. It uses the AgentX socket of the netsnmp agent. It provides a complete information MIB including useful things like failcounter. Please mail again if you are stuck using the agent and want to know more. Greetings, -- Dr. Michael Schwartzkopff Guardinistr. 63 81375 München Tel: (0163) 172 50 98 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resources are restarted without obvious reasons
On 5 October 2010 11:15, Andrew Beekhof and...@beekhof.net wrote: On Fri, Oct 1, 2010 at 9:53 AM, Pavlos Parissis pavlos.paris...@gmail.com wrote: Hi, It seams that it happens every time PE wants to check the conf 09:23:55 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped! and then check_rsc_parameters() wants to reset my resources 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_02 on node-02, provider changed: heartbeat - null 09:23:55 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_01 on node-01, provider changed: heartbeat - null Could be a bug in the code that detects changes to the resource definition. Could you file a bug please? http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker here it is http://developerbugs.linux-foundation.org/show_bug.cgi?id=2504 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] init Script fails in 1 of LSB Compatible test
Hi, I am thinking to put under cluster control the sshd and I am checking if the /etc/init.d/sshd supplied by RedHat 5.4 is compatible with LSB. So, I run the test mentioned here [1] and it fails at test 6, it returns 1 and failed message. Could this create problems within pacemaker? Regards, Pavlos [1] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Fail-count and failure timeout
The resource failed when the sleep expired, i.e. each 600 secs. Now I changed the resource to sleep 7200, failure-timeout 3600 i.e. to values far beyond the recheck-interval opf 15m. Now everything behaves as expected. Mit freundlichen Grüßen / Kind regards Holger Teutsch From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Date: 05.10.2010 11:09 Subject:Re: [Pacemaker] Fail-count and failure timeout On Tue, Oct 5, 2010 at 11:07 AM, Andrew Beekhof and...@beekhof.net wrote: On Fri, Oct 1, 2010 at 3:40 PM, holger.teut...@fresenius-netcare.com wrote: Hi, I observed the following in pacemaker Versions 1.1.3 and tip up to patch 10258. In a small test environment to study fail-count behavior I have one resource anything doing sleep 600 with monitoring interval 10 secs. The failure-timeout is 300. I would expect to never see a failcount higher than 1. Why? The fail-count is only reset when the PE runs... which is on a failure and/or after the cluster-recheck-interval So I'd expect a maximum of two. Actually this is wrong. There is no maximum, because there needs to have been 300s since the last failure when the PE runs. And since it only runs when the resource fails, it is never reset. cluster-recheck-interval = time [15min] Polling interval for time based changes to options, resource parameters and constraints. The Cluster is primarily event driven, however the configuration can have elements that change based on time. To ensure these changes take effect, we can optionally poll the cluster’s status for changes. Allowed values: Zero disables polling. Positive values are an interval in seconds (unless other SI units are specified. eg. 5min) I observed some sporadic clears but mostly the count is increasing by 1 each 10 minutes. Am I mistaken or is this a bug ? Hard to say without logs. What value did it reach? Regards Holger -- complete cib for reference --- cib epoch=32 num_updates=0 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.4 have-quorum=0 cib-last-written=Fri Oct 1 14:17:31 2010 dc-uuid=hotlx configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.3-09640bd6069e677d5eed65203a6056d9bf562e67/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=openais/ nvpair id=cib-bootstrap-options-expected-quorum-votes name=expected-quorum-votes value=2/ nvpair id=cib-bootstrap-options-no-quorum-policy name=no-quorum-policy value=ignore/ nvpair id=cib-bootstrap-options-stonith-enabled name=stonith-enabled value=false/ nvpair id=cib-bootstrap-options-start-failure-is-fatal name=start-failure-is-fatal value=false/ nvpair id=cib-bootstrap-options-last-lrm-refresh name=last-lrm-refresh value=1285926879/ /cluster_property_set /crm_config nodes node id=hotlx uname=hotlx type=normal/ /nodes resources primitive class=ocf id=test provider=heartbeat type=anything meta_attributes id=test-meta_attributes nvpair id=test-meta_attributes-target-role name=target-role value=started/ nvpair id=test-meta_attributes-failure-timeout name=failure-timeout value=300/ /meta_attributes operations id=test-operations op id=test-op-monitor-10 interval=10 name=monitor on-fail=restart timeout=20s/ op id=test-op-start-0 interval=0 name=start on-fail=restart timeout=20s/ /operations instance_attributes id=test-instance_attributes nvpair id=test-instance_attributes-binfile name=binfile value=sleep 600/ /instance_attributes /primitive /resources constraints/ /configuration /cib ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
Re: [Pacemaker] init Script fails in 1 of LSB Compatible test
On Tue, Oct 5, 2010 at 12:51 PM, Pavlos Parissis pavlos.paris...@gmail.com wrote: Hi, I am thinking to put under cluster control the sshd and I am checking if the /etc/init.d/sshd supplied by RedHat 5.4 is compatible with LSB. So, I run the test mentioned here [1] and it fails at test 6, it returns 1 and failed message. Could this create problems within pacemaker? yes Regards, Pavlos [1] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] init Script fails in 1 of LSB Compatible test
On 5 October 2010 13:19, Andrew Beekhof and...@beekhof.net wrote: On Tue, Oct 5, 2010 at 12:51 PM, Pavlos Parissis pavlos.paris...@gmail.com wrote: Hi, I am thinking to put under cluster control the sshd and I am checking if the /etc/init.d/sshd supplied by RedHat 5.4 is compatible with LSB. So, I run the test mentioned here [1] and it fails at test 6, it returns 1 and failed message. Could this create problems within pacemaker? yes what kind of prolems and why? Regards, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Dependency on either of two resources
05.10.2010 12:12, Andrew Beekhof wrote: On Mon, Oct 4, 2010 at 6:31 AM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi all, just wondering, is there a way to make resource depend on (be colocated with) either of two other resources? Not yet. Its something we want to support eventually though. That would be a killer feature. Hope you will find time for it until 1.2. So now I probably should exploit monitor op to try to repair failed portals. That is a hack and it doesn't let me see in cib if there are actual problems, but it is better than nothing. Use case is iSCSI initiator connection to iSCSI target with two portals. Idea is to have f.e. device manager multipath resource depend on both iSCSI connection resources, but in a soft way, so fail of any single iSCSI connection will not cause multipath resource to stop, but fail of both connections will cause it. I should be missing something but I cannot find answer is it possible with current pacemaker. Can someone bring some light? Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] pacemaker version
Hi, I was interested in knowing that if I have to choose between pacemaker 1.0 vs 1.1 which one should I use. Thanks Shravan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Online and Offline status when doing crm_mon
We are setup in a two node active/passive cluster using pacemaker/corosync. We shutdown the pacemaker/corosync on both nodes and changed the uname -n on our nodes to show the short name instead of the FQDN. Started up pacemaker/corosync and ever since we done that, when we run the crm_mon command, we see this below. Last updated: Tue Oct 5 13:28:16 2010 Stack: openais Current DC: e-magdb2 - partition with quorum Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677 4 Nodes configured, 2 expected votes 2 Resources configured. Online: [ e-magdb2 e-magdb1 ] OFFLINE: [ e-magdb1.testingpcmk.com e-magdb2.testingpcmkr.com ] We did edit the crm configuration file to use short names for both nodes. We can ping both the short name and the FQDN on our internal network and both come back with the right IP address. We are running on RHEL 5. Anybody have any ideas why the FQDN shows offline since this change since we configured pacemaker/corosync to use short names? Is it grabbing it from internal DNS from the IP address we have in the /etc/corosync.conf file? Everything seems to be working correctly and failing over correctly. Should this be something to worry about though or is it a display bug maybe? Below is the corosync.conf file. # Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 secauth: off threads: 0 interface { ringnumber: 0 bindnetaddr: 172.26.5.167 mcastaddr: 226.94.5.1 mcastport: 5405 } } logging { fileline: off to_stderr: no to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } amf { mode: disabled } Thanks, Mike - Thise-mailmessageisintendedonlyforthepersonaluseoftherecipient(s) namedabove.Ifyouarenotanintendedrecipient,youmaynotreview,copyor distributethismessage.Ifyouhavereceivedthiscommunicationinerror, pleasenotifytheCDSGlobalHelpDesk(cdshelpd...@cds-global.com)immediately bye-mailanddeletetheoriginalmessage. - ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] how to test network access and fail over accordingly?
Hello, I have a 2 node cluster, running DRBD, heartbeat and pacemaker in active/passive mode. On both nodes, eth0 is connected to the main network, eth1 is used to connect the nodes directly to each other. The nodes share a virtual IP address on eth0. Pacemaker is also controlling a custom service with an LSB compliant script in /etc/init.d/. All of this is working fine and I'm happy with it. I'd like to configure the nodes so that they fail over if eth0 goes down (or if they cannot access a particular gateway), so I tried adding the following (as per http://www.clusterlabs.org/wiki/Example_configurations#Set_up_pingd) primitive p_pingd ocf:pacemaker:pingd params host_list=172.20.0.254 op monitor interval=15s timeout=5s clone c_pingd p_pingd meta globally-unique=false location loc_pingd g_cluster_services rule -inf: not_defined p_pingd or p_pingd lte 0 ... but when I do add that, all resource are stopped and they don't come back up on either node. Am I making a basic mistake or do you need more info from me? All help is appreciated, Craig. pacemaker Version: 1.0.8+hg15494-2ubuntu2 heartbeat Version: 1:3.0.3-1ubuntu1 drbd8-utils Version: 2:8.3.7-1ubuntu2.1 r...@rpalpha:~$ sudo crm configure show node $id=32482293-7b0f-466e-b405-c64bcfa2747d rpalpha node $id=3f2aac12-05aa-4ac7-b91f-c47fa28efb44 rpbravo primitive p_drbd_data ocf:linbit:drbd \ params drbd_resource=data \ op monitor interval=30s primitive p_fs_data ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/data directory=/mnt/data fstype=ext4 primitive p_ip ocf:heartbeat:IPaddr2 \ params ip=172.20.50.3 cidr_netmask=255.255.0.0 nic=eth0 \ op monitor interval=30s primitive p_rp lsb:rp \ op monitor interval=30s \ meta target-role=Started group g_cluster_services p_ip p_fs_data p_rp ms ms_drbd p_drbd_data \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true location loc_preferred_master g_cluster_services inf: rpalpha colocation colo_mnt_on_master inf: g_cluster_services ms_drbd:Master order ord_mount_after_drbd inf: ms_drbd:promote g_cluster_services:start property $id=cib-bootstrap-options \ dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \ cluster-infrastructure=Heartbeat \ no-quorum-policy=ignore \ stonith-enabled=false \ expected-quorum-votes=2 \ r...@rpalpha:~$ sudo cat /etc/ha.d/ha.cf node rpalpha node rpbravo keepalive 2 warntime 5 deadtime 15 initdead 60 mcast eth0 239.0.0.43 694 1 0 bcast eth1 use_logd yes autojoin none crm respawn r...@rpalpha:~$ sudo cat /etc/drbd.conf global { usage-count no; } common { protocol C; handlers {} startup {} disk {} net { cram-hmac-alg sha1; shared-secret foobar; } syncer { verify-alg sha1; rate 100M; } } resource data { device /dev/drbd0; meta-disk internal; on rpalpha { disk /dev/mapper/rpalpha-data; address 192.168.1.1:7789; } on rpbravo { disk /dev/mapper/rpbravo-data; address 192.168.1.2:7789; } } ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker