Re: [Linux-ha-dev] [Linux-HA] Announcing the Heartbeat 3.0.6 Release
On 11 Feb 2015, at 8:24 am, Lars Ellenberg lars.ellenb...@linbit.com wrote: TL;DR: If you intend to set up a new High Availability cluster using the Pacemaker cluster manager, you typically should not care for Heartbeat, but use recent releases (2.3.x) of Corosync. If you don't care for Heartbeat, don't read further. Unless you are beekhof... there's a question below ;-) After 3½ years since the last officially tagged release of Heartbeat, I have seen the need to do a new maintenance release. The Heartbeat 3.0.6 release tag: 3d59540cf28d and the change set it points to: cceeb47a7d8f The main reason for this was that pacemaker more recent than somewhere between 1.1.6 and 1.1.7 would no longer work properly on the Heartbeat cluster stack. Because some of the daemons have moved from glue to pacemaker proper, and changed their paths. This has been fixed in Heartbeat. And because during that time, stonith-ng was refactored, and would still reliably fence, but not understand its own confirmation message, so it was effectively broken. This I fixed in pacemaker. If you chose to run new Pacemaker with the Heartbeat communication stack, it should be at least 1.1.12 with a few patches, see my December 2014 commits at the top of https://github.com/lge/pacemaker/commits/linbit-cluster-stack-pcmk-1.1.12 I'm not sure if they got into pacemaker upstream yet. beekhof? Do I need to rebase? Or did I miss you merging these? Merged now :-) We're about to start the 1.1.13 release cycle, so it wont be far away --- If you have those patches, consider setting this new ha.cf configuration parameter: # If pacemaker crmd spawns the pengine itself, # it sometimes forgets to kill the pengine on shutdown, # which later may confuse the system after cluster restart. # Tell the system that Heartbeat is supposed to # control the pengine directly. crmd_spawns_pengine off Here is the shortened Heartbeat changelog, the longer version is available in mercurial: http://hg.linux-ha.org/heartbeat-STABLE_3_0/shortlog - fix emergency shutdown due to broken update_ackseq - fix node dead detection problems - fix converging of membership (ccm) - fix init script startup glitch (caused by changes in glue/resource-agents) - heartbeat.service file for systemd platforms - new ucast6 UDP IPv6 communication plugin - package ha_api.py in standard package - update some man pages, specifically the example ha.cf - also report ccm membership status for cl_status hbstatus -v - updated some log messages, or their log levels - reduce max_delay in broadcast client_status query to one second - apply various (mostly cosmetic) patches from Debian - drop HBcompress compression plugins: they are part of cluster glue - drop openais HBcomm plugin - better support for current pacemaker versions - try to not miss a SIGTERM (fix problem with very fast respawn/stop cycle) - dopd: ignore dead ping nodes - cl_status improvements - api internals: reduce IPC round-trips to get at status information - uid=root is sufficient to use heartbeat api (gid=haclient remains sufficient) - fix /dev/null as log- or debugfile setting - move daemon binaries into libexecdir - document movement of compression plugins into cluster-glue - fix usage of SO_REUSEPORT in ucast sockets - fix compile issues with recent gcc and -Werror Note that a number of the mentioned fixes have been created two years ago already, and may have been released in packages for a long time, where vendors have chosen to package them. As to future plans for Heartbeat: Heartbeat is still useful for non-pacemaker, haresources-mode clusters. We (Linbit) will maintain Heartbeat for the foreseeable future. That should not be too much of a burden, as it is stable, and due to long years of field exposure, all bugs are known ;-) The most notable shortcoming when using Heartbeat with Pacemaker clusters would be the limited message size. There are currently no plans to remove that limitation. With its wide choice of communications paths, even exotic communication plugins, and the ability to run arbitrarily many paths, some deployments may even favor it over Corosync still. But typically, for new deployments involving Pacemaker, in most cases you should chose Corosync 2.3.x as your membership and communication layer. For existing deployments using Heartbeat, upgrading to this Heartbeat version is strongly recommended. Thanks, Lars Ellenberg ___ Linux-HA mailing list linux...@lists.linux-ha.org
Re: [Linux-ha-dev] quorum status
Check corosync.conf, I'm guessing pcs enabled two node mode. On 29 Jan 2015, at 5:21 pm, Yan, Xiaoping (NSN - CN/Hangzhou) xiaoping@nsn.com wrote: Hi, I used this command: pcs cluster stop rhel1 I think both is shutdown. Br, Rip -Original Message- From: ext Andrew Beekhof [mailto:and...@beekhof.net] Sent: Thursday, January 29, 2015 1:51 PM To: Yan, Xiaoping (NSN - CN/Hangzhou) Cc: Linux-HA-Dev@lists.linux-ha.org Subject: Re: quorum status Did you shut down pacemaker or corosync or both? On 29 Jan 2015, at 4:18 pm, Yan, Xiaoping (NSN - CN/Hangzhou) xiaoping@nsn.com wrote: Hi, Any suggestion please? Br, Rip _ From: Yan, Xiaoping (NSN - CN/Hangzhou) Sent: Wednesday, January 28, 2015 4:04 PM To: Linux-HA-Dev@lists.linux-ha.org Subject: quorum status Hi experts: I’m making 2 node (rhel1 and rhel2) Linux cluster following Pacemaker-1.1-Clusters_from_Scratch-en-US After I shutdown one of the node (pcs cluster stop rhel1) , the other partition still have quorum. While according to document, chapter 5.3.1 ,it should be partition without quorum. (A cluster is said to have quorum when total_nodes2*active_nodes) What could be the problem? Thank you. [root@rhel2 ~]# pcs status Cluster name: mycluster Last updated: Wed Jan 28 10:49:46 2015 Last change: Wed Jan 28 10:07:10 2015 via cibadmin on rhel1 Stack: corosync Current DC: rhel2 (2) - partition with quorum Version: 1.1.10-29.el7-368c726 2 Nodes configured 1 Resources configured Online: [ rhel2 ] OFFLINE: [ rhel1 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started rhel2 PCSD Status: rhel1: Unable to authenticate rhel2: Unable to authenticate Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root@rhel2 ~]# pcs status corosync Membership information -- Nodeid Votes Name 2 1 rhel2 (local) Br, Rip ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] quorum status
Did you shut down pacemaker or corosync or both? On 29 Jan 2015, at 4:18 pm, Yan, Xiaoping (NSN - CN/Hangzhou) xiaoping@nsn.com wrote: Hi, Any suggestion please? Br, Rip _ From: Yan, Xiaoping (NSN - CN/Hangzhou) Sent: Wednesday, January 28, 2015 4:04 PM To: Linux-HA-Dev@lists.linux-ha.org Subject: quorum status Hi experts: I’m making 2 node (rhel1 and rhel2) Linux cluster following Pacemaker-1.1-Clusters_from_Scratch-en-US After I shutdown one of the node (pcs cluster stop rhel1) , the other partition still have quorum. While according to document, chapter 5.3.1 ,it should be partition without quorum. (A cluster is said to have quorum when total_nodes2*active_nodes) What could be the problem? Thank you. [root@rhel2 ~]# pcs status Cluster name: mycluster Last updated: Wed Jan 28 10:49:46 2015 Last change: Wed Jan 28 10:07:10 2015 via cibadmin on rhel1 Stack: corosync Current DC: rhel2 (2) - partition with quorum Version: 1.1.10-29.el7-368c726 2 Nodes configured 1 Resources configured Online: [ rhel2 ] OFFLINE: [ rhel1 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started rhel2 PCSD Status: rhel1: Unable to authenticate rhel2: Unable to authenticate Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root@rhel2 ~]# pcs status corosync Membership information -- Nodeid Votes Name 2 1 rhel2 (local) Br, Rip ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] April 1st joke? (pengine: [20168]: ERROR: crm_abort: gregorian_to_ordinal: Triggered assert at iso8601.c:635 : a_date-days 0)
Its an underflow error that has since been fixed. Sorry for the noise. On 09/04/2013, at 11:06 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi, I found these error messages in syslog on April 1st: Apr 1 00:04:30 h06 pengine: [20168]: ERROR: crm_abort: gregorian_to_ordinal: Triggered assert at iso8601.c:635 : a_date-days 0 [...] Apr 1 01:49:30 h06 pengine: [20168]: ERROR: crm_abort: convert_from_gregorian: Triggered assert at iso8601.c:622 : gregorian_to_ordinal(a_date) Before and after that times I could not see any of these. (pacemaker-1.1.7-0.13.9 of SLES11 SP2) Regards, Ulrich ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Linux-HA] Mysql RA issue: Heartbeat/Pacemaker stops switching Master/Slave after killing mysql processes of Master many times (3 times)
On Fri, Jan 18, 2013 at 9:06 PM, Thai Nguyen nqt...@tma.com.vn wrote: Hello all, I am running Heartbeat/Pacemaker with MySql Master/Slave Replication on my servers. And i am facing an issue which involved to MySQL RA as follow: Steps to reproduce: Step 1: Kill mysql processes of Master. Step 2: Wait until Heartbeat/Pacemaker switched Master/Slave. Step 3: Repeat step 1 and step 2 two times. Step 4: Observe Master/Slave status. Expected result: Heartbeat/Pacemaker switches Master/Slave successfully. Actually result: Heartbeat/Pacemaker stops switching Master/Slave. After killed Master in 2nd time, I check the new Master 's log (ha-log) , the message MySQL monitor succeeded (master) didn't show up in log. Then i kill mysql processes of new Master (3rd time), the result is heartbeat/pacemaker stops switching Master/Slave. To work around this issue, I need to restart Heartbeat. You could have also just run crm resource cleanup ms_MySQL to clear out the failures. If that doesn't work, some logs would make it easier to comment. And this is my pacemaker config: node $id=fabe2f8e-9ba2-4f85-a644-fa16fe492830 ares \ attributes apollo-log-file-p_mysql=mysql-bin.67 apollo-log-pos-p_mysql=107 node $id=fd5a954a-aadc-450e-9dda-ca2c18e980c2 apollo primitive MailTo ocf:heartbeat:MailTo \ params email=nqt...@gmail.com primitive p_mysql ocf:heartbeat:mysql \ params config=/etc/mysql/my.cnf pid=/var/run/mysqld/mysqld.pid socket=/var/run/mysqld/mysqld.sock binary=/usr/bin/mysqld_safe replication_user=root replication_passwd=nec test_user=root test_passwd=nec max_slave_lag=10 evict_outdated_slaves=false \ op monitor interval=1s role=Master timeout=120s \ op monitor interval=3s timeout=120s \ op start interval=0 role=Stopped timeout=120s on-fail=restart \ op stop interval=0 timeout=120s \ meta is-managed=true primitive virtualIP ocf:heartbeat:IPaddr \ params ip=192.168.103.223 cidr_netmask=255.255.255.0 \ op monitor interval=1s \ meta is-managed=true ms ms_MySQL p_mysql \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true globally-unique=false target-role=Master is-managed=true colocation mysql_co_ip inf: virtualIP ms_MySQL:Master order my_MySQL_promote_before_vip inf: ms_MySQL:promote virtualIP:start property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=Heartbeat \ stonith-enabled=false \ default-action-timeout=30 \ cluster-recheck-interval=30s \ no-quorum-policy=ignore property $id=mysql_replication \ p_mysql_REPL_INFO=ares|mysql-bin.34|107 rsc_defaults $id=rsc-options \ resource-stickiness=1 \ migration-threshold=1 \ failure-timeout=15s Best regards, Thai Nguyen ___ Linux-HA mailing list linux...@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [resource-agents] Low: pgsql: check existence of instance number in replication mode (#159)
On Mon, Oct 29, 2012 at 9:51 PM, Dejan Muhamedagic de...@suse.de wrote: On Fri, Oct 26, 2012 at 11:36:53AM +1100, Andrew Beekhof wrote: On Fri, Oct 26, 2012 at 12:52 AM, Dejan Muhamedagic de...@suse.de wrote: On Thu, Oct 25, 2012 at 06:09:38AM -0700, Lars Ellenberg wrote: On Thu, Oct 25, 2012 at 03:38:47AM -0700, Takatoshi MATSUO wrote: Usually, we use crm_master command instead of crm_attribute to change master score in RA. But PostgreSQL's slave can't get own replication status, so Master changes Slave's master-score using instance number on Pacemaker 1.0.x . This probably is not ordinary usage. Would the existing resource agent work with globally-unique=true ? I don't know it works with true. I use it with false and it dosen't need true. I suggested that you actually should use globally-unique clones, as in that case you still get those instance numbers... Does using different clones make sense in pgsql? What is to be different between them? Or would it be just for the sake of getting instance numbers? If so, then it somehow looks wrong to me :) But thinking about it once more, I'm not so sure anymore. Correct me where I'm wrong. This is about the master score. In case the Master instance fails, we preferably want to promote the slave instance that is as close as possible to the Master. We only know which *node* was best at the last monitoring interval, which may be good enough. We need to then change the master score for *all possible instances*, for all nodes, accordingly. Which is what that loop did. (I think skipping the current instance is actually a bug; If pacemaker relabeles things in a bad way, you may hit it). Now, with pacemaker 1.1.8, all instances become equal (for anonymous clones, aka globally-unique=false), and we only need to set the score on the resource-id, not for all resource-id:instance combinations. OK. Which is great. After all, the master score in this case is attached to the node (or, the data set accessible from that node), and not to the (arbitrary, potentially relabeled anytime) instance number pacemaker assigned to the clone instance running on that node. And that is exactly what your patch does: * detect if a version of pacemaker is in use that attaches the instance number to the resource id * if so, do the loop on all possible instance numbers as before * if not, only set the master score on the resource-id Is my understanding correct? Then I think you patch is good. Yes, the patch seems good then. Though there is quite a bit of code repetition. The set attribute part should be moved to an extra function. Still, other resource agents that use master scores (or any other attributes that reference instance numbers of anonymous clones) need to be reviewed. Though this I'll set scores for other instances, not only myself logic is unique to pgsql, so most other resource agents should just work with whatever is present in the environment, they typically treat the $OCF_RESOURCE_INSTANCE as opaque. Seems like no other RA uses instance numbers. However, quite a few use OCF_RESOURCE_INSTANCE which, in case of clone/ms resources, may potentially lead to unpredictable results on upgrade to 1.1.8. No. Otherwise all the regression tests would fail. The PE is smart enough to find promotion score and failcounts in either case. Cool. Also, OCF_RESOURCE_INSTANCE contains whatever the local lrmd knows the resource as, not what we call it internally to the PE. What I meant was that some RA use OCF_RESOURCE_INSTANCE to name local files which keep some kind of state. If OCF_RESOURCE_INSTANCE changes on upgrade... Well, I guess that the worst that can happen is for the probe to fail. Right. But only for attach/reattach. And people should have maintenance-mode enabled at the point the probe is run, so there is time to fix things up before the cluster does anything about it. But I didn't take a closer look. Thanks, Dejan Thanks, Lars Cheers, Dejan ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [resource-agents] Low: pgsql: check existence of instance number in replication mode (#159)
On Thu, Oct 25, 2012 at 10:01 PM, Takatoshi MATSUO matsuo@gmail.com wrote: Usually, we use crm_master command instead of crm_attribute to change own master score in RA. But PostgreSQL's Slave can't get own replication status, so Master changes Slave's master-score using instance number on Pacemaker 1.0.x . This probably is not ordinary usage. Ouch! No, not ordinary (or recommended) at all :-) What does the crm_attribute command line look like? Maybe the --node option could help? So if pgsql thinks it needs these instance numbers, maybe it is not so anonymous a clone, after all? Would the existing resource agent work with globally-unique=true ? No, I use it with false and it dosen't need true. -- Takatoshi MATSUO 2012/10/25 Lars Ellenberg lars.ellenb...@linbit.com: On Thu, Oct 25, 2012 at 01:24:40AM -0700, Takatoshi MATSUO wrote: check existence of instance number in replication mode because Pacemaker 1.1.8 or higher do not append instance numbers. I think this is wrong. It seems this became necessary because of commit 427c7fe6ea94a566aaa714daf8d214290632f837 Author: Andrew Beekhof and...@beekhof.net Date: Fri Jul 13 13:37:42 2012 +1000 High: PE: Do not append instance numbers to anonymous clones Benefits: - they shouldnt have been exposed in the first place, but I didnt know how not to back then - if admins don't know what they are, they can't be misunderstood or misused - more reliable failcount and promotion scores (since you dont have to check for all possible permutations) - smaller status section since there cant be entries for each possible :N suffix - the name in the config corresponds to the resource in the logs So if pgsql thinks it needs these instance numbers, maybe it is not so anonymous a clone, after all? Would the existing resource agent work with globally-unique=true ? Lars You can merge this Pull Request by running: git pull https://github.com/t-matsuo/resource-agents check-instance-number Or you can view, comment on it, or merge it online at: https://github.com/ClusterLabs/resource-agents/pull/159 -- Commit Summary -- * Low: pgsql: check existence of instance number in replication mode -- File Changes -- M heartbeat/pgsql (44) -- Patch Links -- https://github.com/ClusterLabs/resource-agents/pull/159.patch https://github.com/ClusterLabs/resource-agents/pull/159.diff --- Reply to this email directly or view it on GitHub: https://github.com/ClusterLabs/resource-agents/pull/159 -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [resource-agents] Low: pgsql: check existence of instance number in replication mode (#159)
On Fri, Oct 26, 2012 at 12:52 AM, Dejan Muhamedagic de...@suse.de wrote: On Thu, Oct 25, 2012 at 06:09:38AM -0700, Lars Ellenberg wrote: On Thu, Oct 25, 2012 at 03:38:47AM -0700, Takatoshi MATSUO wrote: Usually, we use crm_master command instead of crm_attribute to change master score in RA. But PostgreSQL's slave can't get own replication status, so Master changes Slave's master-score using instance number on Pacemaker 1.0.x . This probably is not ordinary usage. Would the existing resource agent work with globally-unique=true ? I don't know it works with true. I use it with false and it dosen't need true. I suggested that you actually should use globally-unique clones, as in that case you still get those instance numbers... Does using different clones make sense in pgsql? What is to be different between them? Or would it be just for the sake of getting instance numbers? If so, then it somehow looks wrong to me :) But thinking about it once more, I'm not so sure anymore. Correct me where I'm wrong. This is about the master score. In case the Master instance fails, we preferably want to promote the slave instance that is as close as possible to the Master. We only know which *node* was best at the last monitoring interval, which may be good enough. We need to then change the master score for *all possible instances*, for all nodes, accordingly. Which is what that loop did. (I think skipping the current instance is actually a bug; If pacemaker relabeles things in a bad way, you may hit it). Now, with pacemaker 1.1.8, all instances become equal (for anonymous clones, aka globally-unique=false), and we only need to set the score on the resource-id, not for all resource-id:instance combinations. OK. Which is great. After all, the master score in this case is attached to the node (or, the data set accessible from that node), and not to the (arbitrary, potentially relabeled anytime) instance number pacemaker assigned to the clone instance running on that node. And that is exactly what your patch does: * detect if a version of pacemaker is in use that attaches the instance number to the resource id * if so, do the loop on all possible instance numbers as before * if not, only set the master score on the resource-id Is my understanding correct? Then I think you patch is good. Yes, the patch seems good then. Though there is quite a bit of code repetition. The set attribute part should be moved to an extra function. Still, other resource agents that use master scores (or any other attributes that reference instance numbers of anonymous clones) need to be reviewed. Though this I'll set scores for other instances, not only myself logic is unique to pgsql, so most other resource agents should just work with whatever is present in the environment, they typically treat the $OCF_RESOURCE_INSTANCE as opaque. Seems like no other RA uses instance numbers. However, quite a few use OCF_RESOURCE_INSTANCE which, in case of clone/ms resources, may potentially lead to unpredictable results on upgrade to 1.1.8. No. Otherwise all the regression tests would fail. The PE is smart enough to find promotion score and failcounts in either case. Also, OCF_RESOURCE_INSTANCE contains whatever the local lrmd knows the resource as, not what we call it internally to the PE. Thanks, Lars Cheers, Dejan ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [resource-agents] Low: pgsql: check existence of instance number in replication mode (#159)
On Fri, Oct 26, 2012 at 12:49 PM, Takatoshi MATSUO matsuo@gmail.com wrote: 2012/10/26 Andrew Beekhof and...@beekhof.net: On Thu, Oct 25, 2012 at 10:01 PM, Takatoshi MATSUO matsuo@gmail.com wrote: Usually, we use crm_master command instead of crm_attribute to change own master score in RA. But PostgreSQL's Slave can't get own replication status, so Master changes Slave's master-score using instance number on Pacemaker 1.0.x . This probably is not ordinary usage. Ouch! No, not ordinary (or recommended) at all :-) What does the crm_attribute command line look like? Maybe the --node option could help? # crm_attribute -l reboot -N pm02 -n master-pgsql:1 -v 1000 That looks fine, just drop the :1 (or use whatever is in OCF_RESOURCE_INSTANCE) This line uses crm_master as a reference. I would like crm_master to have a parameter which can set hostname. Probably not going to happen. crm_master is a convenience function for the common use case. Its fine to switch to crm_attribute for advanced usage. But crm_master gets hostname using crm_node -n command in these days, so I think that I should fix method to get hostname for next version. It also needs compatible code for Pacemaker 1.0.x :( So if pgsql thinks it needs these instance numbers, maybe it is not so anonymous a clone, after all? Would the existing resource agent work with globally-unique=true ? No, I use it with false and it dosen't need true. -- Takatoshi MATSUO 2012/10/25 Lars Ellenberg lars.ellenb...@linbit.com: On Thu, Oct 25, 2012 at 01:24:40AM -0700, Takatoshi MATSUO wrote: check existence of instance number in replication mode because Pacemaker 1.1.8 or higher do not append instance numbers. I think this is wrong. It seems this became necessary because of commit 427c7fe6ea94a566aaa714daf8d214290632f837 Author: Andrew Beekhof and...@beekhof.net Date: Fri Jul 13 13:37:42 2012 +1000 High: PE: Do not append instance numbers to anonymous clones Benefits: - they shouldnt have been exposed in the first place, but I didnt know how not to back then - if admins don't know what they are, they can't be misunderstood or misused - more reliable failcount and promotion scores (since you dont have to check for all possible permutations) - smaller status section since there cant be entries for each possible :N suffix - the name in the config corresponds to the resource in the logs So if pgsql thinks it needs these instance numbers, maybe it is not so anonymous a clone, after all? Would the existing resource agent work with globally-unique=true ? Lars You can merge this Pull Request by running: git pull https://github.com/t-matsuo/resource-agents check-instance-number Or you can view, comment on it, or merge it online at: https://github.com/ClusterLabs/resource-agents/pull/159 -- Commit Summary -- * Low: pgsql: check existence of instance number in replication mode -- File Changes -- M heartbeat/pgsql (44) -- Patch Links -- https://github.com/ClusterLabs/resource-agents/pull/159.patch https://github.com/ClusterLabs/resource-agents/pull/159.diff --- Reply to this email directly or view it on GitHub: https://github.com/ClusterLabs/resource-agents/pull/159 -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ -- Thanks, Takatoshi MATSUO ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Patch] The problem that the cord of the digest cord of crmd becomes mismatched for.
On Wed, Oct 10, 2012 at 11:21 PM, Dejan Muhamedagic de...@suse.de wrote: Hi Hideo-san, On Wed, Oct 10, 2012 at 03:22:08PM +0900, renayama19661...@ybb.ne.jp wrote: Hi All, We found pacemaker that we could not judge a result of the operation of lrmd well. When we carry out following crm, a parameter of the operation of start is given back to crmd as a result of operation of monitor. (snip) primitive prmDiskd ocf:pacemaker:Dummy \ params name=diskcheck_status_internal device=/dev/vda interval=30 \ op start interval=0 timeout=60s on-fail=restart prereq=fencing \ op monitor interval=30s timeout=60s on-fail=restart \ op stop interval=0s timeout=60s on-fail=block (snip) This is because lrmd gives back prereq parameter of start as a result of monitor operation. As a result, crmd judge mismatched with a parameter of the monitor operation that crmd asked lrmd for for the parameter that Irmd carried out of the monitor operation. We can confirm this problem by the next command in Pacemaker1.0.12. Command 1) crm_verify command outputs the difference in digest cord. [root@rh63-heartbeat1 ~]# crm_verify -L crm_verify[19988]: 2012/10/10_20:29:58 CRIT: check_action_definition: Parameters to prmDiskd:0_monitor_3 on rh63-heartbeat1 changed: recorded 7d7c9f601095389fc7cc0c6b29c61a7a vs. d38c85388dea5e8e2568c3d699eb9cce (reload:3.0.1) 0:0;6:1:0:ca6a5ad2-0340-4769-bab7-289a00862ba6 Command 2) The ptest command outputs the difference in digest cord, too. [root@rh63-heartbeat1 ~]# ptest -L -VV ptest[19992]: 2012/10/10_20:30:19 WARN: unpack_nodes: Blind faith: not fencing unseen nodes ptest[19992]: 2012/10/10_20:30:19 CRIT: check_action_definition: Parameters to prmDiskd:0_monitor_3 on rh63-heartbeat1 changed: recorded 7d7c9f601095389fc7cc0c6b29c61a7a vs. d38c85388dea5e8e2568c3d699eb9cce (reload:3.0.1) 0:0;6:1:0:ca6a5ad2-0340-4769-bab7-289a00862ba6 [root@rh63-heartbeat1 ~]# Command 3) By cibadmin -B command, pengine restart monitor of an unnecessary resource. Oct 10 20:31:00 rh63-heartbeat1 pengine: [19842]: CRIT: check_action_definition: Parameters to prmDiskd:0_monitor_3 on rh63-heartbeat1 changed: recorded 7d7c9f601095389fc7cc0c6b29c61a7a vs. d38c85388dea5e8e2568c3d699eb9cce (reload:3.0.1) 0:0;6:1:0:ca6a5ad2-0340-4769-bab7-289a00862ba6 Oct 10 20:31:00 rh63-heartbeat1 pengine: [19842]: notice: RecurringOp: Start recurring monitor (30s) for prmDiskd:0 on rh63-heartbeat1 Oct 10 20:31:00 rh63-heartbeat1 pengine: [19842]: notice: LogActions: Leave resource prmDiskd:0#011(Started rh63-heartbeat1) Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: unpack_graph: Unpacked transition 2: 1 actions in 1 synapses Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: do_te_invoke: Processing graph 2 (ref=pe_calc-dc-1349868660-20) derived from /var/lib/pengine/pe-input-2.bz2 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: te_rsc_command: Initiating action 1: monitor prmDiskd:0_monitor_3 on rh63-heartbeat1 (local) Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: do_lrm_rsc_op: Performing key=1:2:0:ca6a5ad2-0340-4769-bab7-289a00862ba6 op=prmDiskd:0_monitor_3 ) Oct 10 20:31:00 rh63-heartbeat1 lrmd: [19836]: info: cancel_op: operation monitor[4] on prmDiskd:0 for client 19839, its parameters: CRM_meta_clone=[0] CRM_meta_prereq=[fencing] device=[/dev/vda] name=[diskcheck_status_internal] CRM_meta_clone_node_max=[1] CRM_meta_clone_max=[1] CRM_meta_notify=[false] CRM_meta_globally_unique=[false] crm_feature_set=[3.0.1] interval=[30] prereq=[fencing] CRM_meta_on_fail=[restart] CRM_meta_name=[monitor] CRM_meta_interval=[3] CRM_meta_timeout=[6] cancelled Oct 10 20:31:00 rh63-heartbeat1 lrmd: [19836]: info: rsc:prmDiskd:0 monitor[5] (pid 20009) Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: process_lrm_event: LRM operation prmDiskd:0_monitor_3 (call=4, status=1, cib-update=0, confirmed=true) Cancelled Oct 10 20:31:00 rh63-heartbeat1 lrmd: [19836]: info: operation monitor[5] on prmDiskd:0 for client 19839: pid 20009 exited with return code 0 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: append_digest: yamauchi Calculated digest 7d7c9f601095389fc7cc0c6b29c61a7a for prmDiskd:0_monitor_3 (0:0;1:2:0:ca6a5ad2-0340-4769-bab7-289a00862ba6). Source: parameters device=/dev/vda name=diskcheck_status_internal interval=30 prereq=fencing CRM_meta_timeout=6/ Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: process_lrm_event: LRM operation prmDiskd:0_monitor_3 (call=5, rc=0, cib-update=53, confirmed=false) ok Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: match_graph_event: Action prmDiskd:0_monitor_3 (1)
Re: [Linux-ha-dev] Slight bending of OCF specs: Re: Issues found in Apache resource agent
On 06/09/2012, at 12:30 AM, Lars Marowsky-Bree l...@suse.com wrote: On 2012-09-05T15:25:44, Dejan Muhamedagic de...@suse.de wrote: How about a new element. Something like primitive vm1 ocf:heartbeat:VirtualDomain require vm1 web-test dns-test How we map this into Pacemaker's dependency scheme is obviously open to discussion. The require would imply that the resource vm1 requires monitors of web-test and dns-test to succeed, in addition to its monitor (if defined). Perhaps. But an as-a-whole attribute for groups to restart handling might already be enough, since we would want the system to eventually stabilize at the same state it runs to today (that is, with the group brought up to the last non-failing resource; otherwise, admins couldn't login to the VM to fix the problem). Those two requirements seem at odds with each other. I doubt it would end well. I suspect you really want the restart everything trigger to be attached to the monitor only resource (at the end). Monitor ops of web-test and dns-test will run only on the node where vm1 is started. They could in also get the environment (parameters) of vm1. That's implicit in the group. Internally, this could indeed map to a symmetric or whatever aspect of the order dependency, yes, that could be set for the whole group. monocf may be just like ocf, sans start and stop operations. That would make all ocf RA elligible for this use. None of the current resource agents would be able to cope with the use case I suggested, because they expect to run in the OS image where the service is provided - the idea of using the icinga/nagios plugins is exactly that they don't have this requirement, and thus can monitor the VM externally. For OCF agents, this sort-of already exists: meta is-managed=false. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] fence_legacy patch
Thanks Piotr, I've applied your patch and it will be in 1.1.8 Sorry for the delay. On Thu, Aug 23, 2012 at 12:04 AM, Chmylkowski, Piotr piotr.chmylkow...@atos.net wrote: Dear HA-dev I have implemented vcenter fencing, but the following patch was required to get it working. The problem was: variable HOSTLIST was not passed from crm configuration correctly to vcenter stonith script. In crm configuration HOSTLIST was something like HOSTLIST=hostname1=vm1;hostname2=vm2 But to vcenter script HOSTLIST was passed as HOSTLIST=hostname1 --- 8 --- diff -uN /usr/sbin/fence_legacy /usr/sbin/fence_legacy-patched --- /usr/sbin/fence_legacy 2011-08-25 18:09:42.0 +0200 +++ /usr/sbin/fence_legacy-patched 2012-08-22 11:55:59.0 +0200 @@ -83,7 +83,7 @@ $opt=$_; next unless $opt; -($name,$val)=split /\s*=\s*/, $opt; +($name,$val)=split /\s*=\s*/, $opt, 2; if ( $name eq ) { --- 8 --- Regards, Piotr -- This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos group liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted. Niniejsza wiadomosc oraz zalaczone dokumenty sa przeznaczone wylacznie dla adresatow i moga zawierac informacje poufne nalezace do grupy Atos. Jesli nie jestescie Panstwo adresatem tej wiadomosci badz otrzymaliscie ja przez pomylke, prosimy o powiadomienie o tym fakcie nadawcy oraz trwale usuniecie niniejszej wiadomosci wraz z zalacznikami. Atos IT Services Sp. z o.o., ul. Woloska 5, 02-675 Warszawa, KRS 128068, Sad Rejonowy dla m. st. Warszawy w Warszawie, XIII Wydzial Gospodarczy . Krajowego Rejestru Sadowego, NIP 521-32-07-288. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] apply_xml_diff: Digest mis-match
On Mon, Aug 13, 2012 at 11:39 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi! In pacemaker-1.1.6-1.29.1 (SLES11 SP2 x86_64) I see this for an idle cluster with just one stonith resource being running when doing some unrelated change: Aug 13 15:33:19 h3 cib: [31938]: info: apply_xml_diff: Digest mis-match: expected 466ee3cf78eec0772f78b6fc965e9601, calculated e8d85a1134f84f8b6eb8dff8ff598f71 Aug 13 15:33:19 h3 cib: [31938]: notice: cib_process_diff: Diff 0.14.77 - 0.15.1 not applied to 0.14.77: Failed application of an update diff Aug 13 15:33:19 h3 cib: [31938]: info: cib_server_process_diff: Requesting re-sync from peer I have the feeling there's something wrong in either generating the diff, or applying the diff. I would only expect this if either: - you're changing the order of resources in a group (which you aren't), or - when a node first comes online and sees the other node. In the latter case, the diff applies cleanly, but the digest alerts us to the fact that we are missing other (unrelated) changes. Regards, Ulrich ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] STONITH agent for SoftLayer API
On Fri, Jun 8, 2012 at 1:19 PM, Alan Robertson al...@unix.sh wrote: Red Hat invented their own API then disabled the working API in their version of the code. Of course, they don't have as many agents, and they're not as well tested Red Hat has had their own API for a very long time. Certainly long before Pacemaker was added to RHEL (the LHA agents never appeared there, so your timeline is way off). By my count there are at least 45 agents (more than double the number of non-external agents shipping in cluster-glue) and since RH doesn't ship things they don't test not as well tested is doubtful at best. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Pacemaker and conntrackd RA not obeying colocation constraint
On Thu, Jun 7, 2012 at 5:37 PM, aldo sarmiento sarmi...@gmail.com wrote: Hello, I'm having a problem getting conntrackd ms to work with a colocation constraint. I want to have conntrackd Master only on the node that has an IPaddr2 primitive running on it. Here are my specs: Ubuntu: 12.04 corosync: 1.4.2 crm: 1.1.6 conntrackd: 1.0.0 Here is my configuration (notice I weighted the fw02 higher than fw01 to test my failover): node fw01 \ attributes firewall=100 node fw02 \ attributes firewall=150 primitive lip ocf:heartbeat:IPaddr2 \ meta target-role=Started \ params ip=192.168.100.2 cidr_netmask=24 nic=eth0 primitive updated-conntrackd ocf:updates:conntrackd ms conntrackd updated-conntrackd \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Master location conntrackd-run conntrackd \ rule $id=conntrackd-run-rule-0 -inf: not_defined firewall or firewall number:lte 0 \ rule $id=conntrackd-run-rule-1 firewall: defined firewall location lip-loc lip \ rule $id=lip-loc-rule-0 -inf: not_defined firewall or firewall number:lte 0 \ rule $id=lip-loc-rule-1 firewall: defined firewall colocation conntrackd-loc inf: conntrackd:Master lip:Started property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ symmetric-cluster=false \ no-quorum-policy=ignore \ stonith-enabled=false \ last-lrm-refresh=1339040513 So based on my configuration above, I would expect conntrackd to be Master on fw02, but it's not: root@fw01:~# crm status Last updated: Wed Jun 6 20:46:55 2012 Last change: Wed Jun 6 20:41:53 2012 via crmd on fw01 Stack: openais Current DC: fw01 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 3 Resources configured. Online: [ fw01 fw02 ] Master/Slave Set: conntrackd [updated-conntrackd] Masters: [ fw01 ] Slaves: [ fw02 ] lip(ocf::heartbeat:IPaddr2): Started fw02 Your config looks right to me. Can you attach the output of cibadmin -Ql when the cluster is in this state? Interesting thing is... if I add lip to a group with another primitive and run the same logic, failover works just fine: root@fw01:~# crm configure show node fw01 \ attributes firewall=100 node fw02 \ attributes firewall=150 primitive lip ocf:heartbeat:IPaddr2 \ params ip=192.168.100.2 cidr_netmask=24 nic=eth0 primitive lip2 ocf:heartbeat:IPaddr2 \ params ip=192.168.100.101 cidr_netmask=24 nic=eth0 primitive updated-conntrackd ocf:updates:conntrackd group gateway lip lip2 \ meta target-role=Started ms conntrackd updated-conntrackd \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Master location conntrackd-run conntrackd \ rule $id=conntrackd-run-rule-0 -inf: not_defined firewall or firewall number:lte 0 \ rule $id=conntrackd-run-rule-1 firewall: defined firewall location gateway-loc gateway \ rule $id=lip-loc-rule-0 -inf: not_defined firewall or firewall number:lte 0 \ rule $id=lip-loc-rule-1 firewall: defined firewall colocation conntrackd-loc inf: conntrackd:Master gateway:Started property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ symmetric-cluster=false \ no-quorum-policy=ignore \ stonith-enabled=false \ last-lrm-refresh=1339041080 root@fw01:~# crm status Last updated: Wed Jun 6 20:52:17 2012 Last change: Wed Jun 6 20:52:03 2012 via cibadmin on fw01 Stack: openais Current DC: fw01 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 4 Resources configured. Online: [ fw01 fw02 ] Master/Slave Set: conntrackd [updated-conntrackd] Masters: [ fw02 ] Slaves: [ fw01 ] Resource Group: gateway lip(ocf::heartbeat:IPaddr2): Started fw02 lip2 (ocf::heartbeat:IPaddr2): Started fw02 Thanks, Aldo ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [rfc] SBD with Pacemaker/Quorum integration
On Sat, May 26, 2012 at 5:56 AM, Lars Marowsky-Bree l...@suse.com wrote: On 2012-05-25T21:44:25, Florian Haas flor...@hastexo.com wrote: If so, the master thread will not self-fence even if the majority of devices is currently unavailable. That's it, nothing more. Does that help? It does. One naive question: what's the rationale of tying in with Pacemaker's view of things? Couldn't you just consume the quorum and membership information from Corosync alone? Yes and no. On SLE HA 11 (which, alas, is still the prime motivator for this), corosync actually gets that state from Pacemaker. And, ultimately, it is Pacemaker's belief (from the CIB) that pengine bases its fencing decisions on, so that's where we need to look. Further, quorum isn't enough. If we have quorum, the local node could still be dirty (as in: stop failures, unclean, ...) that imply that it should self-fence, pronto. Since this overrides the decision to self-fence if the devices are gone, and thus a real poison pill may no longer be delivered, we must take steps to minimize that risk. But yes, what it does now is to sign in both with corosync/ais and the CIB, querying quorum state from both. Fun anecdote, I originally thought being notification-driven might be good enough - until the testers started SIGSTOPping corosync/cib and complaining that the pacemaker watcher didn't pick up on that ;-) I know this is bound to have some holes. It can't perform a comprehensive health check of pacemaker's stack; yet, this only matters for as long as the loss of devices persists. During that degraded phase, the system is a bit more fragile. I'm a bit weary of this, because I'm *sure* these will all get reported one after another and further contribute to the code obfuscation, but such is reality ... (I have opinions on particularly the last failure mode. This seems to arise specifically when customers have build setups with two HBAs, two SANs, two storages, but then cross-linked the SANs, connected the HBAs to each, and the storages too. That seems to frequently lead to hiccups where the *entire* fabric is affected. I'm thinking this cross-linking is a case of sham redundancy; it *looks* as if makes things more redundant, but in reality reduces it since faults are no longer independent. Alas, they've not wanted to change that.) Henceforth, I'm going to dangle this thread in front of everyone who believes their SAN can never fail. Thanks. :) Heh. Please dangle it in front of them and explain the benefits of separation/isolation to them. ;-) If they followed our recommendation - 2 independent SANs, and a third iSCSI device over the network (okok, effectively that makes 3 SANs) - they'd never experience this. (Since that's how my lab is actually set up, I had some troubles following the problems they reported initially. Oh, and *don't* get me started on async IO handling in Linux.) Are there any SUSEisms in SBD or would you expect it to be packageable on any platform? Should be packageable on every platform, though I admit that I've not tried building the pacemaker module against anything but the corosync+pacemaker+openais stuff we ship on SLE HA 11 so far. I assume that this may need further work; at least the places I stole code from had special treatment. And the source code to crm_node (ccm_epoche.c) ... I *think* this may indicate opportunities for improving the client libraries in pacemaker to hide all that stuff better. Yep, suggestions are welcome. In theory it shouldn't be required, but in practice there are so many membership/quorum combinations that sadly the compatibility code has become worthy of a real API. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] heartbeat gmain source priority inversion with rexmit and dead node detection
On Sat, Apr 28, 2012 at 12:11 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp wrote: Hi All, We gave test that assumed remote cluster environment. And we tested packet lost. You may be interested in this patch I have lying around for ages. It may be incomplete for one corner case: On a seriously misconfigured and overloaded system, I have seen reports for a single send_local_status() (that is basically one single send_cluster_msg()) which took longer to execute than deadtime (without even returning to the mainloop!). This cornercase should be handled with a watchdog. But without a watchdog, and without stonith, the CCM was confused, because one node saw a leave then re-join after partition event, while the other node did not even notice it had left and rejoined the membership... and pacemaker ended up being DC on both :-/ A side effect of the ccm being really confused I assume? So I guess send_local_status() could do with an explicit call to check_for_timeouts(), but that may need recursion protection. I should really polish and push my queue some day soon... Cheers, diff --git a/heartbeat/hb_rexmit.c b/heartbeat/hb_rexmit.c --- a/heartbeat/hb_rexmit.c +++ b/heartbeat/hb_rexmit.c @@ -168,6 +168,7 @@ send_rexmit_request( gpointer data) if (STRNCMP_CONST(node-status, UPSTATUS) != 0 STRNCMP_CONST(node-status, ACTIVESTATUS) !=0) { /* no point requesting rexmit from a dead node. */ + g_hash_table_remove(rexmit_hash_table, ri); return FALSE; } @@ -243,7 +244,7 @@ schedule_rexmit_request(struct node_info ri-seq = seq; ri-node = node; - sourceid = Gmain_timeout_add_full(G_PRIORITY_HIGH - 1, delay, + sourceid = Gmain_timeout_add_full(PRI_REXMIT, delay, send_rexmit_request, ri, NULL); G_main_setall_id(sourceid, retransmit request, config-heartbeat_ms/2, 10); diff --git a/heartbeat/heartbeat.c b/heartbeat/heartbeat.c --- a/heartbeat/heartbeat.c +++ b/heartbeat/heartbeat.c @@ -1585,7 +1585,7 @@ master_control_process(void) send_local_status(); - if (G_main_add_input(G_PRIORITY_HIGH, FALSE, + if (G_main_add_input(PRI_POLL, FALSE, polled_input_SourceFuncs) ==NULL){ cl_log(LOG_ERR, master_control_process: G_main_add_input failed); } diff --git a/include/hb_api_core.h b/include/hb_api_core.h --- a/include/hb_api_core.h +++ b/include/hb_api_core.h @@ -40,6 +40,12 @@ #define PRI_READPKT (PRI_SENDPKT+1) #define PRI_FIFOMSG (PRI_READPKT+1) +/* PRI_POLL is where the timeout checks on deadtime happen. + * Better be sure rexmit requests for lost packets + * from a now dead node do not preempt detecting it as being dead. */ +#define PRI_POLL (G_PRIORITY_HIGH) +#define PRI_REXMIT PRI_POLL + #define PRI_CHECKSIGS (G_PRIORITY_DEFAULT) #define PRI_FREEMSG (PRI_CHECKSIGS+1) #define PRI_CLIENTMSG (PRI_FREEMSG+1) ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [PATCH] Medium: Use the resource timeout as an override to the default dbus timeout for upstart RA
On Sat, Feb 18, 2012 at 12:00 AM, Ante Karamatic iv...@ubuntu.com wrote: On 17.02.2012 11:20, Andrew Beekhof wrote: Tangential question... but does upstart also implement the service binary? As in service pacemaker start ? It does, but the exit status is always '0', which makes 'service' binary unusable for monitoring the status of the service without parsing the command output. 10 head 20 desk 30 add 40 goto 10 ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [PATCH] Medium: Use the resource timeout as an override to the default dbus timeout for upstart RA
Tangential question... but does upstart also implement the service binary? As in service pacemaker start ? On Fri, Feb 17, 2012 at 6:52 PM, Ante Karamatic iv...@ubuntu.com wrote: # HG changeset patch # User Ante Karamatić ante.karama...@canonical.com # Date 1329463546 -3600 # Node ID 097ca775d3740a94591fbe0dd50124a51f140fff # Parent d8c154589a16cb99ab16f36a27756ba94eefdbee Medium: Use the resource timeout as an override to the default dbus timeout for upstart RA diff --git a/lib/plugins/lrm/raexecupstart.c b/lib/plugins/lrm/raexecupstart.c --- a/lib/plugins/lrm/raexecupstart.c +++ b/lib/plugins/lrm/raexecupstart.c @@ -169,7 +169,7 @@ /* It'd be better if it returned GError, so we can distinguish * between failure modes (can't contact upstart, no such job, * or failure to do action. */ - if (upstart_job_do(rsc_type, cmd)) { + if (upstart_job_do(rsc_type, cmd, timeout)) { exit(EXECRA_OK); } else { exit(EXECRA_NO_RA); diff --git a/lib/plugins/lrm/upstart-dbus.c b/lib/plugins/lrm/upstart-dbus.c --- a/lib/plugins/lrm/upstart-dbus.c +++ b/lib/plugins/lrm/upstart-dbus.c @@ -319,7 +319,7 @@ } gboolean -upstart_job_do(const gchar *name, UpstartJobCommand cmd) +upstart_job_do(const gchar *name, UpstartJobCommand cmd, const int timeout) { DBusGConnection *conn; DBusGProxy *manager; @@ -342,7 +342,8 @@ switch (cmd) { case UPSTART_JOB_START: cmd_name = Start; - dbus_g_proxy_call (job, cmd_name, error, + dbus_g_proxy_call_with_timeout (job, cmd_name, + timeout, error, G_TYPE_STRV, no_args, G_TYPE_BOOLEAN, TRUE, G_TYPE_INVALID, @@ -352,7 +353,8 @@ break; case UPSTART_JOB_STOP: cmd_name = Stop; - dbus_g_proxy_call(job, cmd_name, error, + dbus_g_proxy_call_with_timeout(job, cmd_name, + timeout, error, G_TYPE_STRV, no_args, G_TYPE_BOOLEAN, TRUE, G_TYPE_INVALID, @@ -360,7 +362,8 @@ break; case UPSTART_JOB_RESTART: cmd_name = Restart; - dbus_g_proxy_call (job, cmd_name, error, + dbus_g_proxy_call_with_timeout (job, cmd_name, + timeout, error, G_TYPE_STRV, no_args, G_TYPE_BOOLEAN, TRUE, G_TYPE_INVALID, diff --git a/lib/plugins/lrm/upstart-dbus.h b/lib/plugins/lrm/upstart-dbus.h --- a/lib/plugins/lrm/upstart-dbus.h +++ b/lib/plugins/lrm/upstart-dbus.h @@ -29,7 +29,7 @@ } UpstartJobCommand; G_GNUC_INTERNAL gchar **upstart_get_all_jobs(void); -G_GNUC_INTERNAL gboolean upstart_job_do(const gchar *name, UpstartJobCommand cmd); +G_GNUC_INTERNAL gboolean upstart_job_do(const gchar *name, UpstartJobCommand cmd, const int timeout); G_GNUC_INTERNAL gboolean upstart_job_is_running (const gchar *name); #endif /* _UPSTART_DBUS_H_ */ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Patch] OCF_RESKEY_CRM_meta_timeout not matching monitor timeout meta-data.
On Thu, Dec 15, 2011 at 8:45 PM, renayama19661...@ybb.ne.jp wrote: Hi Dejan, Thank you for comment. It looks like a wrong place for a fix. Shouldn't crmd send all environment? It is only by chance that we have the timeout value available in this function. In the case of stop, crmd does not ask lrmd for the substitution of the parameter. . (snip) /* reset the resource's parameters? */ if(op-interval == 0) { if(safe_str_eq(CRMD_ACTION_START, operation) || safe_str_eq(CRMD_ACTION_STATUS, operation)) { op-copyparams = 1; } } (snip) When the parameter of the resource is changed, I think this to be because I influence the stop of the resource of lrmd. It is necessary for the changed parameter not to copy it. When stopping, you always want to use the old parameters (think of someone changing 'ip' for an IPaddr resource). Options that are interpreted by the crmd or lrmd are a different matter which resulted in: https://github.com/ClusterLabs/pacemaker/commit/fcfe6fe522138343e4138248829926700fac213e My patch is an example when I handle it in lrmd. Is there a better patch? * For example, it may be good to give copyparams a different value. Best Regards, Hideo Yamauchi. --- On Thu, 2011/12/15, Dejan Muhamedagic de...@suse.de wrote: Hi Hideo-san, On Thu, Dec 15, 2011 at 06:21:00PM +0900, renayama19661...@ybb.ne.jp wrote: Hi All, I made the patch which revised the old next problem. * http://www.gossamer-threads.com/lists/linuxha/users/70262 In consideration of influence when a parameter was changed, I replace only a value of timeout. Please confirm my patch. And please commit a patch. It looks like a wrong place for a fix. Shouldn't crmd send all environment? It is only by chance that we have the timeout value available in this function. Cheers, Dejan Best Regards, Hideo Yamauchi. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Patch] OCF_RESKEY_CRM_meta_timeout not matching monitor timeout meta-data.
On Fri, Dec 16, 2011 at 1:21 PM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. When stopping, you always want to use the old parameters (think of someone changing 'ip' for an IPaddr resource). Options that are interpreted by the crmd or lrmd are a different matter which resulted in: https://github.com/ClusterLabs/pacemaker/commit/fcfe6fe522138343e4138248829926700fac213e All right. Will you apply this correction to 1.0 of Pacemaker? Sure. We'll pick it up for .13 Best Regards, Hideo Yamauchi. --- On Fri, 2011/12/16, Andrew Beekhof and...@beekhof.net wrote: On Thu, Dec 15, 2011 at 8:45 PM, renayama19661...@ybb.ne.jp wrote: Hi Dejan, Thank you for comment. It looks like a wrong place for a fix. Shouldn't crmd send all environment? It is only by chance that we have the timeout value available in this function. In the case of stop, crmd does not ask lrmd for the substitution of the parameter. . (snip) /* reset the resource's parameters? */ if(op-interval == 0) { if(safe_str_eq(CRMD_ACTION_START, operation) || safe_str_eq(CRMD_ACTION_STATUS, operation)) { op-copyparams = 1; } } (snip) When the parameter of the resource is changed, I think this to be because I influence the stop of the resource of lrmd. It is necessary for the changed parameter not to copy it. When stopping, you always want to use the old parameters (think of someone changing 'ip' for an IPaddr resource). Options that are interpreted by the crmd or lrmd are a different matter which resulted in: https://github.com/ClusterLabs/pacemaker/commit/fcfe6fe522138343e4138248829926700fac213e My patch is an example when I handle it in lrmd. Is there a better patch? * For example, it may be good to give copyparams a different value. Best Regards, Hideo Yamauchi. --- On Thu, 2011/12/15, Dejan Muhamedagic de...@suse.de wrote: Hi Hideo-san, On Thu, Dec 15, 2011 at 06:21:00PM +0900, renayama19661...@ybb.ne.jp wrote: Hi All, I made the patch which revised the old next problem. * http://www.gossamer-threads.com/lists/linuxha/users/70262 In consideration of influence when a parameter was changed, I replace only a value of timeout. Please confirm my patch. And please commit a patch. It looks like a wrong place for a fix. Shouldn't crmd send all environment? It is only by chance that we have the timeout value available in this function. Cheers, Dejan Best Regards, Hideo Yamauchi. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] In RHEL5 and RHEL6 about different HA_RSCTMP
On Tue, Dec 6, 2011 at 8:00 PM, nozawat noza...@gmail.com wrote: Hi A maintenance mode in the heartbeat-stack does not work by this difference now in RHEL6. The reason is because /var/run/heartbeat/rsctmp is deleted at the time of initialization of Heartbeat. Right, but the location and its deletion date back over 8 years IIRC. Some RA makes a status file there. Which makes the resource seemed stopped? At the point heartbeat starts all bets are off and the RA needs to be able to correctly rediscover its own state. The resource-agents.spec files are as follows. -- 151 %if 0%{?fedora} = 11 || 0%{?centos_version} 5 || 0%{?rhel} 5 152 CFLAGS=$(echo '%{optflags}') 153 %global conf_opt_rsctmpdir --with-rsctmpdir=%{_var}/run/heartbeat/rsctmp 154 %global conf_opt_fatal --enable-fatal-warnings=no 155 %else 156 CFLAGS=${CFLAGS} ${RPM_OPT_FLAGS} 157 %global conf_opt_fatal --enable-fatal-warnings=yes 158 %endif -- Why is it that I use rsctmp in RHEL6? Should I want to delete the 153 line if possible? I contribute a patch if good. Regards, Tomo ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] VirtualDomain issue
On Mon, Nov 14, 2011 at 9:58 PM, Dejan Muhamedagic de...@suse.de wrote: Hi, On Thu, Jun 23, 2011 at 07:51:48AM +0200, Dominik Klein wrote: Hi code snippet from http://hg.linux-ha.org/agents/raw-file/7a11934b142d/heartbeat/VirtualDomain (which I believe is the current version) VirtualDomain_Validate_All() { snip if [ ! -r $OCF_RESKEY_config ]; then if ocf_is_probe; then ocf_log info Configuration file $OCF_RESKEY_config not readable during probe. else ocf_log error Configuration file $OCF_RESKEY_config does not exist or is not readable. return $OCF_ERR_INSTALLED fi fi } snip VirtualDomain_Validate_All || exit $? snip if ocf_is_probe [ ! -r $OCF_RESKEY_config ]; then exit $OCF_NOT_RUNNING fi So, say one node does not have the config, but the cluster decides to run the vm on that node. The probe returns NOT_RUNNING, so the cluster tries to start the vm, that start returns ERR_INSTALLED, the cluster has to try to recover from the start failure, so stop it, but that stop op returns ERR_INSTALLED as well, so we need to be stonith'd. I think this is wrong behaviour. On stop, it should return OCF_SUCCESS. I wonder if it would be safe for the CRM to interpret ERR_INSTALLED on stop as resource stopped. Opinions? Feels dangerous. Even if the binaries are missing, the RA should arguably look for and kill any relevant processes before returning OK. Cheers, Dejan P.S. Very sorry for such a delay! I read the comments about configurations being on shared storage which might not be available at certain points in time and I see the point. But the way this is implemented clearly does not work for everybody. I vote for making this configurable. Unfortunately, due to several reasons, I am not able to contribute this patch myself at the moment. Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] attrd and repeated changes
On Sat, Oct 22, 2011 at 7:14 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Thu, Oct 20, 2011 at 08:48:36AM -0600, Alan Robertson wrote: On 10/20/2011 03:41 AM, Philipp Marek wrote: Hello, when constantly sending new data via attrd the changes are never used. Example: while sleep 1 do attrd_updater -l reboot -d 5 -n rep_chg -U try$SECONDS cibadmin -Ql | grep rep_chg done This always returns the same value - the one that was given with more than 5 seconds delay afterwards, so that the dampen interval wasn't broken by the next change. I've attached two draft patches; one for allowing the _first_ value in a dampen interval to be used (effectively ignoring changes until this value is written), and one for using the _last_ value in the dampen interval (by not changing the dampen timer). [1] *** Note: they are for discussion only! *** I didn't test them, not even for compilation. What is the correct way to handle multiple updates within the dampen interval? Personally, I'd vote for the last value. I agree with you about this being a bug. If the attribute is used to check connectivity changes (ping resource agent), or similar, and we have a flaky, flapping connectivity, it would be useful to have a max or min consolidation function for incoming values during a dampen interval. Otherwise, I get + + - + + -|+ + + + and if the dampen interval just happened to expire where I put the | above, it would have pushed a - to the cib, where I'd rather kept it at +. Thats why dampen should typically be a multiple of the monitor interval. We likely want to add an option to attrd_updater (and to the ipc messages it sends to attrd, and to the rest of the chain involved), which can specify the consolidation function to be used. The initial set I suggest would be generic: oldest latest (default?) for values assumed to be numeric: max (also a candidate for default behaviour) min avg (with a printf like template for rounding, %.2f or similar, so we could even average boolean values) For avg you'd need to specify how many values to remember. I suggest this behaviour: * If different updates request a different consolidation function, the last one (within the respective dampen interval) wins. * update with the _same_ value: Do not start or modify any timer. If a timer is pending, still add the value to the list of values to be processed by the consolidation function (relevant for avg, possibly not yet listed others). * update with a different value: Start a new timer, unless one is pending already. Do not restart/modify an already pending timer. Add to the list of values for the consolidation function. * Flush message received: expire timer. See below. * Timer expires: Apply consolidation function to list of values. If list is empty (probably flush message without pending timer), use current value. Send that result to the cib. Sounds reasonable, there's no way I'm going to be able to get to implement it any time soon though. If someone else wants to implement it, I think it would be useful to have it be part of a larger rework that ensured atomicity of the updates. I.e. have all nodes send their values to a designated instance which did all the updates. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] How to use reload action of RA agent?
On Fri, Oct 7, 2011 at 2:42 PM, Serge Dubrouski serge...@gmail.com wrote: Hello - How one supposed to use reload action of RA agent it it's supported by RA? When I try to set up an order like this: order Reload_After_Start +inf: res1:start res2:reload neither crm nor cibadmin allow me to define such order. Right, its not something you specify like that. It happens automatically when the resource definition changes: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-reload.html -- Serge Dubrouski. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Linux-HA] [ha-wg] CFP: HA Mini-Conference in Prague on Oct 25th
On Sat, Oct 8, 2011 at 6:03 AM, Digimer li...@alteeve.com wrote: On 10/07/2011 02:58 PM, Florian Haas wrote: Vienna before the early afternoon of Saturday the 29th, so if anyone has plans to do something interesting that Saturday morning I'd be more than happy to join. Cheers, Florian I'm going to be in the city all day Saturday as well. Knowing there will be at least a few who will have trouble making the unofficial meeting on the 26th, The 26th is just the meeting start. perhaps we could have an even more informal meeting/talk/debrief/waffles on Saturday morning? I'll be flying to Munich late on Friday afternoon. -- Digimer E-Mail: digi...@alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math? ___ Linux-HA mailing list linux...@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Linux-HA] [ha-wg] CFP: HA Mini-Conference in Prague on Oct 25th
On Thu, Oct 6, 2011 at 1:53 AM, Lars Marowsky-Bree l...@suse.com wrote: On 2011-10-03T11:10:13, Andrew Beekhof and...@beekhof.net wrote: Based on Boston last year, I imagine the conversations will last right up until Lars starts presenting his talk on Friday afternoon. People came and went at random, and if someone essential was missing for a conversation we deferred it until later. Oh, then we're going to not stop, ever - because I don't have a talk at the main conference this time ;-) The schedule has you in a friday afternoon slot iirc. Very informal, but it seemed to work ok. yes, and given that the ha mailing lists are still down, probably the best we can hope for ... indeed ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [ha-wg] CFP: HA Mini-Conference in Prague on Oct 25th
On Sat, Oct 1, 2011 at 12:55 AM, Digimer li...@alteeve.com wrote: On 09/27/2011 07:58 AM, Lars Marowsky-Bree wrote: Hi all, it turns out that there was zero feedback about people wanting to present, only some about travel budget being too tight to come. So we had some discussions about whether to cancel this completely, as this made planning rather difficult. But just in the last few days, I got a fair share of e-mails asking if this still takes place, and who is going to be there. ;-) So: we have the room. I will be there, and it seems so will at least a few other people, including Andrew. I suggest we do it in an unconference style and draw up the agenda as we go along; you're welcome to stop by and discuss HA/clustering topics that are important to you. It is going to be as successful as we all make it out to be. We share the venue with LinuxCon Europe: Clarion Congress Hotel · Prague, Czech Republic, on Oct 25th. I suggest we start at 9:30 in the morning and go from there. Regards, Lars Is it possible, if this isn't set in stone, to push back to later in the day? I don't fly in until the 25th, and I think there is one other person who wants to attend in the same boat. Based on Boston last year, I imagine the conversations will last right up until Lars starts presenting his talk on Friday afternoon. People came and went at random, and if someone essential was missing for a conversation we deferred it until later. Very informal, but it seemed to work ok. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Pacemaker] ping RA question
Dan - any objections if I incorporate the fping parts into the ping RA? On Fri, Jul 29, 2011 at 12:47 AM, Dan Urist dur...@ucar.edu wrote: Here's my fping RA, for anyone who's interested. Note that some of the parameters are different than ping/pingd, since fping works differently. The major advantages of fping over the system ping are that multiple hosts can be pinged with a single fping command, and fping will return as soon as all hosts succeed (the linux system ping will not return until it has exhausted either its count or the timeout, regardless of success). -- Dan Urist dur...@ucar.edu 303-497-2459 ___ Pacemaker mailing list: pacema...@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [PATCH] pacemaker-1.1.5 : fix autotools build system
applied. thanks! On Tue, Jul 12, 2011 at 6:54 PM, Ultrabug ultra...@gentoo.org wrote: Hello mates, I would like you to consider having the attached patch committed in order to fix and improve the build system of pacemaker. We Gentoo compilation lovers have to apply this patch in order to have a clean and working process of building pacemaker mostly because we use LDFLAGS=--as-needed in our default setup. As you may already know, this requires the library linking to be strictly ordered and organized and this patch is meant to do it. If you remember, we Gentoo cluster team already submitted such kind of a patch (see attached message) but it seems some features/files slipped since then. Thanks in advance for considering. Ultrabug, Gentoo cluster herd. PS: sorry, posted this message earlier with wrong mail account ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] pacemaker - migrate RA, based on the state of other RA, w/o clone?
On Thu, Jul 14, 2011 at 1:51 AM, RNZ renoi...@gmail.com wrote: I make next resource agent - https://github.com/rnz/resource-agents/blob/master/heartbeat/couchdb At end of file exist next example configuration: node vub001 node vub002 primitive couchdb-1 ocf:heartbeat:couchdb \ params dbuser=user dbpass=pass replcmds=http://admin:pass@192.168.1.2:5984/testdb,http://admin:pass@127.0.0.1:5984/testdb,true;http://admin:pass@192.168.1.2:5984/testdb1,http://admin:pass@127.0.0.1:5984/testdb1,true;; \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ op monitor interval=10s \ meta target-role=Started primitive couchdb-2 ocf:heartbeat:couchdb \ params dbuser=user dbpass=pass replcmds=http://admin:pass@192.168.1.1:5984/testdb,http://admin:pass@127.0.0.1:5984/testdb,true;http://admin:pass@192.168.1.1:5984/testdb1,http://admin:pass@127.0.0.1:5984/testdb1,true;; \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ op monitor interval=10s \ meta target-role=Started primitive vIP ocf:heartbeat:IPaddr2 \ params ip=192.168.1.10 nic=eth1 \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ op monitor interval=5s timeout=20s depth=0 \ meta target-role=Started location cdb-1-c couchdb-1 inf: vub001 location cdb-1-p couchdb-1 -inf: vub002 location cdb-2-c couchdb-2 inf: vub002 location cdb-2-p couchdb-2 -inf: vub001 location vIP_c vIP 100: vub001 location vIP_p vIP 10: vub002 property $id=cib-bootstrap-options \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ stonith-enabled=false \ symmetric-cluster=false \ rsc_defaults $id=rsc-options \ resource-stickiness=110 -- One CouchDB resources infinite stay per node, because use master-master/multi-master replication. And I want to make easy couchdb replication, without use external files for configuration of replication. Need resource vIP migrate to next node by check fail/stop state resource couchdb (per node). How make it in pacemaker configuration? I think you want to be using the Master/Slave construct of pacemaker. That would let you colocate the vIP with instances of couchdb with the master role. May be use location rule: #uname or need additional RA for control state (as pingd)? May use next method: #!/bin/bash curl -s http://127.0.0.1:5984 | grep -q 'couchdb' if [ $? != 0 ]; then #... add control by crm_mon of started couchdb on current node crm resource migrate vIP crm configure delete cli-standby-vIP fi But I think, this is not good P.S. Sorry for my bad English... ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Filesystem ocf file
On Fri, May 6, 2011 at 9:37 AM, Florian Haas florian.h...@linbit.com wrote: On 2011-05-06 09:26, Darren Thompson wrote: Team I was reviewing some errors on a cluster mounted file-system that caused me to review the Filesystem ocf file. I notice that it uses an undeclared parameter of OCF_CHECK_LEVEL to determine what degree of testing of the filesystem is required in monitor I have now updated it to more formally work with a check_level value with the more obvious values of mounted, read write ( my updated version attached ) Could someone (Florian is this something you can do?) please review this with a view to patching the upstream Filesystem ocf file. NACK, sorry. The OCF_CHECK_LEVEL is specific to the monitor action and described as such in the OCF spec; this will not be changed without a change to the spec. To use it, set op monitor interval=X OCF_CHECK_LEVEL=Y Yes, it's poorly designed, it makes no sense why this is pretty much the only sensible time to set a parameter specifically for an operation (as opposed to on a resource), it's inexplicable why it's all caps, etc., but that's the way it is. Honest. It was broken when we got here. Maybe it was the neighbor's dog? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [ha-wg] Cluster Stack - Ubuntu Developer Summit
On Thu, May 5, 2011 at 10:25 AM, Florian Haas florian.h...@linbit.com wrote: On 2011-04-26 19:33, Andres Rodriguez wrote: UDS' are open-to-public events, and I believe it would be great if upstream could participate and maybe even further the discussion about the Cluster Stack. For more information about UDS, please visit [1]. The specific date/time for the Cluster Stack session is not yet available. If you require any further information please don't hesitate to contact me. Andres already knows this, but FWIW I'll repost here that I'll be at UDS in time for the cluster stack session at 12 noon on 5/12. I'll stay in Budapest that evening and will probably join the Budapest sightseeing tour that the Hungarian Ubuntu team is organizing, so if anyone wants to link up with Andres and me for a few beverages please let us know. Andrew, interested in making a day trip to Budapest while you're still on this continent? With under 4 weeks to go - not a chance :-) ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] ACLs and privilege escalation (was Re: New OCF RA: symlink)
On Thu, May 5, 2011 at 9:09 AM, Florian Haas florian.h...@linbit.com wrote: Rather than going into ACLs in more detail, I wanted to highlight that however we limit access to the CIB, the resource agents still _execute_ as root, so we will always have what would normally be considered a privilege escalation issue. Now, we could agree on security guidelines for RAs, and some of those would certainly be no-brainers to define (such as, don't ever eval unsanitized user input), but I refuse to even suggest to tackle any such guidelines before the OCF spec update has gotten off the ground. One such thing that could be added to the spec would be optional meta variables named user and group, directing the LRM (or any successor) to execute the RA as that user rather than root. Just an idea. Seems plausible. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] New OCF RA: symlink
On Wed, May 4, 2011 at 4:36 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: Services running under Pacemaker control are probably critical, so a malicious person with even only stop access on the CIB can do a DoS. I guess we have to assume people with any write access at all to the CIB are trusted, and not malicious. Exactly. If the cluster (or access to it) has been compromised, you're in for so much pain that a symlink RA is the least of your problems. A generic cluster manager is, by design, a way to run arbitrary scripts as root - there's no coming back from there. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Translate crm_cli.txt to Japanese
On Wed, Apr 27, 2011 at 12:54 PM, Dejan Muhamedagic de...@suse.de wrote: Hi Junko-san, On Wed, Apr 27, 2011 at 06:42:52PM +0900, Junko IKEDA wrote: Hi, May I suggest that you go with the devel version, because crm_cli.txt was converted to crm.8.txt. There are not many textual changes, just some obsolete parts removed. OK, I got crm.8.txt from devel. Each directory structure for Pacemaker 1.0,1.1 and devel is just a bit different. Does 1.0 keep its doc dir structure for now? Until the next release I guess. If so, it seems that just create html file is not so difficult when asciidoc is available. No, not difficult. It just depends on the build environment. If asciidoc is found by configure, then it is going to be used to produce the html files. Do any distros _not_ ship asciidoc? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Translate crm_cli.txt to Japanese
On Wed, Apr 27, 2011 at 3:47 PM, Dejan Muhamedagic de...@suse.de wrote: On Wed, Apr 27, 2011 at 02:01:40PM +0200, Andrew Beekhof wrote: On Wed, Apr 27, 2011 at 12:54 PM, Dejan Muhamedagic de...@suse.de wrote: Hi Junko-san, On Wed, Apr 27, 2011 at 06:42:52PM +0900, Junko IKEDA wrote: Hi, May I suggest that you go with the devel version, because crm_cli.txt was converted to crm.8.txt. There are not many textual changes, just some obsolete parts removed. OK, I got crm.8.txt from devel. Each directory structure for Pacemaker 1.0,1.1 and devel is just a bit different. Does 1.0 keep its doc dir structure for now? Until the next release I guess. If so, it seems that just create html file is not so difficult when asciidoc is available. No, not difficult. It just depends on the build environment. If asciidoc is found by configure, then it is going to be used to produce the html files. Do any distros _not_ ship asciidoc? AFAIK none of contemporary distributions. And going back three years or so, it's the other way around. How quickly we forget. Anyway, I advocate that the project makes decisions based on it being around (but fails gracefully when its not) and leaves it up to older distros to ship a pre-generated copy if they so desire. I can't imagine lack of HTML versions being a deal breaker. And by fail gracefully, I mean the current behavior of just not building those versions of the doc. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Bug in crm shell or pengine
On Mon, Apr 18, 2011 at 11:38 PM, Serge Dubrouski serge...@gmail.com wrote: Ok, I've read the documentation. It's not a bug, it's a feature :-) Might be nice if the shell could somehow prevent such configs, but it would be non-trivial to implement. On Mon, Apr 18, 2011 at 3:01 PM, Serge Dubrouski serge...@gmail.com wrote: Hello - Looks like there is a bug in crm shell Pacemaker version 1.1.5 or in pengine. primitive pg_drbd ocf:linbit:drbd \ params drbd_resource=drbd0 \ op monitor interval=60s role=Master timeout=10s \ op monitor interval=60s role=Slave timeout=10s Log file: Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s Apr 17 04:05:29 cs51 crmd: [5535]: info: do_state_transition: Starting PEngine Recheck Timer Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s Plus strange behavior of the cluster like inability to mover resources from one node to another. -- Serge Dubrouski. -- Serge Dubrouski. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Dovecot OCF Resource Agent
On Fri, Apr 15, 2011 at 12:53 PM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote: On 04/15/2011 11:10 AM, jer...@intuxicated.org wrote: Yes, it does the same thing but contains some additional features, like logging into a mailbox. first of all, i do not know how the others think about a ocf ra implemented in c. i'll suggest waiting for comments from dejan or fghass. the ipv6addr agent was written in C too the OCF standard does not dictate the language to be used - its really a matter of whether C is the best tool for this job you could then create a fork on github and make sure it integrates well with the current build environment. second, what do you think about extending this ra to be able to handle multiple email MDAs? deep probing routines would also be needed for other MDAs. i'm thinking about giving this ra a shot but would like to hear some comments on my first remark before doing so. thanks for your work! raoul -- DI (FH) Raoul Bhatia M.Sc. email. r.bha...@ipax.at Technischer Leiter IPAX - Aloy Bhatia Hava OG web. http://www.ipax.at Barawitzkagasse 10/2/2/11 email. off...@ipax.at 1190 Wien tel. +43 1 3670030 FN 277995t HG Wien fax. +43 1 3670030 15 ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Resource agent implementing SPC-3 Persistent Reservations (contribution from Evgeny Nifontov)
Awesome. I was wondering if someone would ever write one of these :) On Tue, Apr 12, 2011 at 10:29 AM, Florian Haas florian.h...@linbit.com wrote: Hi everyone, Evgeny Nifontov has started to implement sg_persist, a resource agent managing SPC-3 Persistent Reservations (PRs) using the sg_persist binary. He's put up a personal repo on Github and the initial commit is here: https://github.com/nif/ClusterLabs__resource-agents/commit/d0c46fb35338d28de3e2c20c11d0ad01dded13fd I've added some comments for an initial review. Everyone interested please pitch in. Thanks to Evgeny for an the contribution! Cheers, Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] new resource agents repository commit policy
On Mon, Mar 14, 2011 at 6:07 PM, Dejan Muhamedagic de...@suse.de wrote: Hello everybody, It's time to figure out how to maintain the new Resource Agents repository. Fabio and I already discussed this a bit in IRC. There are two options: a) everybody gets an account at github.com and commit rights, where everybody is all people who had commit rights to linux-ha.org and rgmanager agents repositories. b) several maintainers have commit rights and everybody else sends patches to a ML; then one of the maintainers does a review and commits the patch (or pulls it from the author's repository). I suspect you want b) with maybe 6 people for redundancy. The pull request workflow should be well suited to a project like this and impose minimal overhead. The ability to comment on patches in-line before merging them should be pretty handy. You're also welcome to put a copy at http://www.clusterlabs.org/git/ Its pretty easy to keep the two repos in sync, for example I have this in .git/config for matahari: [remote origin] fetch = +refs/heads/*:refs/remotes/origin/* url = g...@github.com:matahari/matahari.git pushurl = g...@github.com:matahari/matahari.git pushurl = ssh://beek...@git.fedorahosted.org/git/matahari.git git push then sends to both locations Option a) incurs a bit less overhead and that's how our old repositories worked. Option b) gives, at least nominally, more control to the select group of maintainers, but also places even more burden on them. We are open for either of these. Cheers, Fabio and Dejan ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] new resource agents repository
On Thu, Feb 24, 2011 at 4:10 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: On Thu, Feb 24, 2011 at 03:56:27PM +0100, Andrew Beekhof wrote: On Thu, Feb 24, 2011 at 2:59 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hello, There is a new repository for Resource Agents which contains RA sets from both Linux HA and Red Hat projects: git://github.com/ClusterLabs/resource-agents.git The purpose of the common repository is to share maintenance load and try to consolidate resource agents. There were no conflicts with the rgmanager RA set and both source layouts remain the same. It is only that autoconf bits were merged. The only difference is that if you want to get Linux HA set of resource agents installed, configure should be run like this: configure --with-ras-set=linux-ha ... The new repository is git but the existing history is preserved. People used to Mercurial shouldn't have hard time working with git. We need to retire the existing repository hg.linux-ha.org. Are there any objections or concerns that still need to be addressed? Might not hurt to leave it around - there might be various URLs that point there. Yes, it will definitely remain there. What I meant with retire, is that the developers then start using the git repository exclusively. A Yes, and making read-only on the server it probably a good idea (to avoid pushes). ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] New ocft config file for IBM db2 resource agent
On Tue, Feb 15, 2011 at 10:50 AM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi Holger, On Tue, Feb 15, 2011 at 09:49:07AM +0100, Holger Teutsch wrote: Hi, please find enclosed an ocft config for db2 for review and inclusion into the project if appropriate. Wonderful! This is the first time somebody contributed an ocft testcase. Looks like lmb owes somebody lunch :-) The current 1.0.4 agent passes the tests 8-) . I've never doubted that either. Cheers, Dejan Regards Holger # db2 # # This test assumes a db2 ESE instance with two partions and a database. # Default is instance=db2inst1, database=ocft # adapt this in set_testenv below # # Simple steps to generate a test environment (if you don't have one): # # A virtual machine with 1200MB RAM is sufficient # # - download an eval version of DB2 server from IBM # - create an user db2inst1 in group db2inst1 # # As root # - install DB2 software in some location # - create instance # cd this_location/instance # ./db2icrt -s ese -u db2inst1 db2inst1 # - adapt profile of db2inst1 as instructed by db2icrt # # As db2inst1 # # allow to run with small memory footprint # db2set DB2_FCM_SETTINGS=FCM_MAXIMIZE_SET_SIZE:FALSE # db2start # db2start dbpartitionnum 1 add dbpartitionnum hostname $(uname -n) port 1 without tablespaces # db2stop # db2start # db2 create database ocft # Done # In order to install a real cluster refer to http://www.linux-ha.org/wiki/db2_(resource_agent) CONFIG HangTimeout 40 SETUP-AGENT # nothing CASE-BLOCK set_testenv Var OCFT_instance=db2inst1 Var OCFT_db=ocft CASE-BLOCK crm_setting Var OCF_RESKEY_instance=$OCFT_instance Var OCF_RESKEY_CRM_meta_timeout=3 CASE-BLOCK default_status AgentRun stop CASE-BLOCK prepare Include set_testenv Include crm_setting Include default_status CASE check base env Include prepare AgentRun start OCF_SUCCESS CASE check base env: invalid 'OCF_RESKEY_instance' Include prepare Var OCF_RESKEY_instance=no_such AgentRun start OCF_ERR_INSTALLED CASE invalid instance config Include prepare Bash eval mv ~$OCFT_instance/sqllib ~$OCFT_instance/sqllib- BashAtExit eval mv ~$OCFT_instance/sqllib- ~$OCFT_instance/sqllib AgentRun start OCF_ERR_INSTALLED CASE unimplemented command Include prepare AgentRun no_cmd OCF_ERR_UNIMPLEMENTED CASE normal start Include prepare AgentRun start OCF_SUCCESS CASE normal stop Include prepare AgentRun start AgentRun stop OCF_SUCCESS CASE double start Include prepare AgentRun start AgentRun start OCF_SUCCESS CASE double stop Include prepare AgentRun stop OCF_SUCCESS CASE started: monitor Include prepare AgentRun start AgentRun monitor OCF_SUCCESS CASE not started: monitor Include prepare AgentRun monitor OCF_NOT_RUNNING CASE killed instance: monitor Include prepare AgentRun start OCF_SUCCESS AgentRun monitor OCF_SUCCESS BashAtExit rm /tmp/ocft-helper1 Bash echo su $OCFT_instance -c '. ~$OCFT_instance/sqllib/db2profile; db2nkill 0 /dev/null 21' /tmp/ocft-helper1 Bash sh -x /tmp/ocft-helper1 AgentRun monitor OCF_NOT_RUNNING CASE overload param instance by admin Include prepare Var OCF_RESKEY_instance=no_such Var OCF_RESKEY_admin=$OCFT_instance AgentRun start OCF_SUCCESS CASE check start really activates db Include prepare AgentRun start OCF_SUCCESS BashAtExit rm /tmp/ocft-helper2 Bash echo su $OCFT_instance -c '. ~$OCFT_instance/sqllib/db2profile; db2 get snapshot for database on $OCFT_db/dev/null' /tmp/ocft-helper2 Bash sh -x /tmp/ocft-helper2 CASE multipartion test Include prepare AgentRun start OCF_SUCCESS AgentRun monitor OCF_SUCCESS # start does not start partion 1 Var OCF_RESKEY_dbpartitionnum=1 AgentRun monitor OCF_NOT_RUNNING # now start 1 AgentRun start OCF_SUCCESS AgentRun monitor OCF_SUCCESS # now stop 1 AgentRun stop OCF_SUCCESS AgentRun monitor OCF_NOT_RUNNING # does not affect 0 Var OCF_RESKEY_dbpartitionnum=0 AgentRun monitor OCF_SUCCESS # fault injection does not work on the 1.0.4 client due to a hardcoded path CASE simulate hanging db2stop (not meaningful for 1.0.4 agent) Include prepare AgentRun start OCF_SUCCESS Bash [ ! -f /usr/local/bin/db2stop ] BashAtExit rm /usr/local/bin/db2stop Bash echo -e #!/bin/sh\necho fake db2stop\nsleep 1 /usr/local/bin/db2stop Bash chmod +x /usr/local/bin/db2stop AgentRun stop OCF_SUCCESS #
Re: [Linux-ha-dev] [PATCH] manage PostgreSQL 9.0 streaming replication using Master/Slave
On Mon, Feb 14, 2011 at 8:46 PM, Serge Dubrouski serge...@gmail.com wrote: On Mon, Feb 14, 2011 at 1:28 AM, Takatoshi MATSUO matsuo@gmail.com wrote: Ideally demote operation should stop a master node and then restart it in hot-standby mode. It's up to administrator to make sure that no node with outdated data gets promoted to the master role. One should follow standard procedures: cluster software shouldn't be configured for autostart at the boot time, administrator has to make sure that data was refreshed if the node was down for some prolonged time. Hmm.. Do you mean that RA puts recovery.conf automatically at demote op to start hot standby? Please give me some time to think it over. Sorry, I got the wrong idea about restoring data. To start as hot-standby needs restoring anytime, because Time-line ID of PostgreSQL is incremented. In addition, shutting down the PostgreSQL with immediate option causes inconsistent WAL between primary and hot-standby. So I think it's difficult to start slave automatically at demote. Still, do you think it's better to implement restoring ? I'm afraid it's not just better, but it's a must. We have to play by Pacemaker's rules and that means that we have to properly implement demote operation and that's switching from Master to Slave, not just stopping Master. I do appreciate your efforts, but implementation has to conform to Pacemaker standards, i.e. Master has to start where it's configured in Pacemaker, not just where recovery.conf file exists. Thats the ideal at least. Most of the time it should be possible to self-promote and let pacemaker figure out the result. But I can easily imagine there would also be situations where this is going blow up in your face. Administrator has to be able to easily switch between node roles and so on. I still need some more time to learn PostgreSQL data replication and do some tests. Let's think if that's possible to implement real Master/Slave in Pacemaker sense of things. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ -- Serge Dubrouski. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] ocft: status vs. monitor
On Sun, Feb 13, 2011 at 11:01 AM, Holger Teutsch holger.teut...@web.de wrote: Hi, to my knowledge OCF *requires* a method monitor while status is optional (or what is it really for? heritage, compatibility, ...) Shouldn't the ocft configs check for status ? Yes, unless its trying to talk to an LSB resource. -holger diff -r 722c8a7a03e9 tools/ocft/apache --- a/tools/ocft/apache Fri Feb 11 18:49:09 2011 +0100 +++ b/tools/ocft/apache Sun Feb 13 10:57:50 2011 +0100 @@ -52,14 +52,14 @@ Include prepare AgentRun stop OCF_SUCCESS -CASE running status +CASE running monitor Include prepare AgentRun start - AgentRun status OCF_SUCCESS + AgentRun monitor OCF_SUCCESS -CASE not running status +CASE not running monitor Include prepare - AgentRun status OCF_NOT_RUNNING + AgentRun monitor OCF_NOT_RUNNING CASE unimplemented command Include prepare diff -r 722c8a7a03e9 tools/ocft/mysql --- a/tools/ocft/mysql Fri Feb 11 18:49:09 2011 +0100 +++ b/tools/ocft/mysql Sun Feb 13 10:57:50 2011 +0100 @@ -46,14 +46,14 @@ Include prepare AgentRun stop OCF_SUCCESS -CASE running status +CASE running monitor Include prepare AgentRun start - AgentRun status OCF_SUCCESS + AgentRun monitor OCF_SUCCESS -CASE not running status +CASE not running monitor Include prepare - AgentRun status OCF_NOT_RUNNING + AgentRun monitor OCF_NOT_RUNNING CASE check lib file Include prepare ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode
On Wed, Feb 9, 2011 at 12:15 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: On Wed, Feb 09, 2011 at 12:06:04PM +0100, Florian Haas wrote: On 2011-02-09 11:56, Dejan Muhamedagic wrote: It is plugin compatible to the old version of the agent. Great! Unfortunately, we can't replace the old db2 now, the number of changes is very large: db2 | 1076 +++- 1 file changed, 687 insertions(+), 389 deletions(-) And the code is completely new (though I have no doubt that it is of excellent quality). So, I'd suggest to add this as another db2 RA. Once it gets some field testing we can mark the old one as deprecated. What name would you suggest? db2db2? Just making sure: Is that a joke? A bit of a joke, yes. But the alternatives such as db22 or db2new looked a bit boring. I think boring is the least of our problems with those names. Are you going to change the name of every agent that gets a rewrite? IPaddr2-ng-ng-again-and-one-more-plus-one Solicit feedback, like was done for kliend's new agent, and replace the existing one it if/when people respond positively. Its not like the old one disappears from the face of the earth after you merge the new one. wget -o /usr/lib/ocf/resource.d/heartbeat/db2 http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2 HADR is a very different beast from non-HADR db, right? Why not then add the hadr boolean parameter and use that instead of checking if the resource has been configured as multi-state? I'll take responsibility for suggesting the use of ocf_is_ms(), and I'd be curious to find out what you think is wrong with that approach. There's nothing wrong in the sense whether it is going to work. But someday, db2 may sport say HADR2 or VHA or whatever else which may run as a ms resource. I just think that it's better to make it obvious in the configuration that the user runs HADR. Does that make sense? Because if anything is, then the mysql RA needs fixing too. No idea what's up with mysql. Cheers, Dejan Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode
On Wed, Feb 9, 2011 at 2:17 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi Andrew, On Wed, Feb 09, 2011 at 01:33:03PM +0100, Andrew Beekhof wrote: On Wed, Feb 9, 2011 at 12:15 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: On Wed, Feb 09, 2011 at 12:06:04PM +0100, Florian Haas wrote: On 2011-02-09 11:56, Dejan Muhamedagic wrote: It is plugin compatible to the old version of the agent. Great! Unfortunately, we can't replace the old db2 now, the number of changes is very large: db2 | 1076 +++- 1 file changed, 687 insertions(+), 389 deletions(-) And the code is completely new (though I have no doubt that it is of excellent quality). So, I'd suggest to add this as another db2 RA. Once it gets some field testing we can mark the old one as deprecated. What name would you suggest? db2db2? Just making sure: Is that a joke? A bit of a joke, yes. But the alternatives such as db22 or db2new looked a bit boring. I think boring is the least of our problems with those names. Are you going to change the name of every agent that gets a rewrite? IPaddr2-ng-ng-again-and-one-more-plus-one I don't think it is going to happen that often. It happens often enough - its just normally by a core developer. And realistically, almost every RA is going to get similar treatment (over time) as they're merged with the Red Hat ones. Solicit feedback, like was done for kliend's new agent, and replace the existing one it if/when people respond positively. That would be for the best, but it takes time. We may opt for it, but I wanted to add the this agent to the new release. Understood - but I think the long-term pain that is created outweighs any perceived benefit in the short-term. Also, it is very seldom that people test anything which is not contained in the release. Unless there's no alternative as was the case with conntrac. Its not like the old one disappears from the face of the earth after you merge the new one. wget -o /usr/lib/ocf/resource.d/heartbeat/db2 http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2 What do you suggest? That we add to the release announcement: The db2 RA has been rewritten and didn't get yet a lot of field testing. Please help test it. So don't do that :-) Put up a wiki page with instructions for how to download+use the new agent and give feedback. If the new version is significantly better, you're going to hear people pleading for its inclusion pretty soon. But, if you want to keep the old agent, download the old one from the repository and use it instead of the new one. And don't forget to do the same when installing the next resource-agents release. At any rate, I wouldn't want to take responsibility for replacing the existing (and working RA) with a completely new and not yet tested code. Call me coward :) I wouldn't either - which is why I keep saying test then replace :-) Another alternative, create a testing provider... not sure if its a good idea or not, just putting it out there. Finally, I expected that the new functionality is going to be added without much changes to the existing code. But it turned out to be a rewrite. Cheers, Dejan HADR is a very different beast from non-HADR db, right? Why not then add the hadr boolean parameter and use that instead of checking if the resource has been configured as multi-state? I'll take responsibility for suggesting the use of ocf_is_ms(), and I'd be curious to find out what you think is wrong with that approach. There's nothing wrong in the sense whether it is going to work. But someday, db2 may sport say HADR2 or VHA or whatever else which may run as a ms resource. I just think that it's better to make it obvious in the configuration that the user runs HADR. Does that make sense? Because if anything is, then the mysql RA needs fixing too. No idea what's up with mysql. Cheers, Dejan Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http
Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode
On Wed, Feb 9, 2011 at 3:35 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Wed, Feb 09, 2011 at 02:43:17PM +0100, Andrew Beekhof wrote: Are you going to change the name of every agent that gets a rewrite? IPaddr2-ng-ng-again-and-one-more-plus-one I don't think it is going to happen that often. It happens often enough - its just normally by a core developer. And realistically, almost every RA is going to get similar treatment (over time) as they're merged with the Red Hat ones. Solicit feedback, like was done for kliend's new agent, and replace the existing one it if/when people respond positively. Its not like the old one disappears from the face of the earth after you merge the new one. wget -o /usr/lib/ocf/resource.d/heartbeat/db2 http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2 What do you suggest? That we add to the release announcement: The db2 RA has been rewritten and didn't get yet a lot of field testing. Please help test it. So don't do that :-) Put up a wiki page with instructions for how to download+use the new agent and give feedback. How about a staging area? /usr/lib/ocf/resource.d/staging/ I was thinking along the same lines when I said testing. Either name works for me :-) we can also add a /usr/lib/ocf/resource.d/deprecated/ The thing in .../heartbeat/ can become a symlink, and be given config file status by the package manager? Something like that. So we have it bundled with the release, it is readily available without much go to that web page and download and save to there and make executable and then blah. It would simply pop up in crm shell and DRBD-MC and so on. We can add please give feedback to the description, and this will replace the current RA with release + 2 unless we get veto-ing feedback to the release notes. Once settled, we copy over the staging one to the real directory, replacing the original one, and add a please fix your config to the thing that remains in staging/, so we will be able to start a further rewrite with the next merge window. * does not break existing setups * new RAs and rewrites are readily available -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Patch: fix wrong variable names in Dummy agent
Applied to Pacemaker too. Thanks! On Wed, Jan 19, 2011 at 6:45 PM, Holger Teutsch holger.teut...@web.de wrote: Hi, small fix. -holger # HG changeset patch # User Holger Teutsch holger.teut...@web.de # Date 1295458942 -3600 # Node ID f9bb7dc26c80aaae2711a1b66e1af7a92d33bbc6 # Parent 2b5603283560ca1c895d610a85155ddde198019e Low: Dummy: migrate_from/to: correct OCF_RESKEY_CRM_meta_migrate_xxx variable names diff -r 2b5603283560 -r f9bb7dc26c80 heartbeat/Dummy --- a/heartbeat/Dummy Tue Jan 18 18:01:33 2011 +0100 +++ b/heartbeat/Dummy Wed Jan 19 18:42:22 2011 +0100 @@ -143,10 +143,10 @@ start) dummy_start;; stop) dummy_stop;; monitor) dummy_monitor;; -migrate_to) ocf_log info Migrating ${OCF_RESOURCE_INSTANCE} to ${OCF_RESKEY_CRM_meta_migrate_to}. +migrate_to) ocf_log info Migrating ${OCF_RESOURCE_INSTANCE} to ${OCF_RESKEY_CRM_meta_migrate_target}. dummy_stop ;; -migrate_from) ocf_log info Migrating ${OCF_RESOURCE_INSTANCE} to ${OCF_RESKEY_CRM_meta_migrated_from}. +migrate_from) ocf_log info Migrating ${OCF_RESOURCE_INSTANCE} from ${OCF_RESKEY_CRM_meta_migrate_source}. dummy_start ;; reload) ocf_log err Reloading... ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Antwort: Re: Antwort: Re: OCF RA dev guide: final heads up
On Mon, Dec 13, 2010 at 4:32 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Fri, Dec 10, 2010 at 01:48:26PM +0100, Florian Haas wrote: On 2010-12-10 13:42, alexander.kra...@basf.com wrote: So, best thing would be, as you already said: remove it from the environment. I could just save your time answering stupid questions. Seconded. @Florian: Isn't OCF_CHECK_LEVEL also missing in the guide ? And thank you very much for section 9.4 (fits to my questions from yesterday) :-) OCF_CHECK_LEVEL is such a terrible abomination that I refuse to write about it. Not until lmb has written his updated OCF spec, we've discussed and approved of it, and it's _still_ in there (which I doubt). While we're at it... Andrew, could you pass the OCF_RESKEY_CRM_meta_depth variable? Then we can update the resource agents and the documentation. You mean create one and pass it? No such thing currently exists. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] OCF RA dev guide: final heads up
On Fri, Dec 10, 2010 at 12:06 PM, Florian Haas florian.h...@linbit.com wrote: On 2010-12-08 18:15, alexander.kra...@basf.com wrote: Hi Florian, Section 5.10: The variables are missing a notify. It is: OCF_RESKEY_CRM_meta_notify_start_uname not OCF_RESKEY_CRM_meta_start_uname Thanks! Fixed. http://people.linbit.com/~florian/ra-dev-guide/_literal_notify_literal_action.html There is also the same set of variables that end on _resource. I'll leave those out of now, as those have never been of any practical relevance to me. If you're actually using them in an agent, please do let me know. Section 6.2: I think this statement: 'should never be changed by a resource agent ' is in conflict with Section 4.3. No it's not. 4.3 says you can override it _from the command line_, 6.2 says the resource agent should not modify it. Section 8.4: Statement: 'Stateful (master/slave) resources may influence their own /master preference'/ IMHO they _must_ influence thier own master preference. If not, they will never been promoted. Correct. Fixed. http://people.linbit.com/~florian/ra-dev-guide/_specifying_a_master_preference.html A Note that 'crm_mon -A' shows the current values, might be also very helpfull. Nope. I'll try to not talk about Pacemaker specific binaries too much. No Section: Is there a reason, why the environment variable 'OCF_RESKEY_CRM_meta_role', which is set in the monitor action, isn't mentioned anywhere ? Make a good case for it to be explained, and convince me that it won't just serve to confuse everybody, and I'll include it. My best guess, however, is that once I do include it, we'll see a lot of monitor() { if [ $OCF_RESKEY_CRM_meta_role = Master ]; then return $OCF_RUNNING_MASTER fi ... } ... and that's clearly nonsense. And a good way to ensure i strip it from the environment :-) ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] LF #1943: alternative implementation of an apache resource agent
On Thu, Dec 9, 2010 at 9:12 AM, Florian Haas florian.h...@linbit.com wrote: On 2010-12-07 22:08, Andrew Beekhof wrote: On Tue, Dec 7, 2010 at 4:17 PM, Florian Haas florian.h...@linbit.com wrote: Folks, shamefully we've had a resource agent contribution sitting around in the LF bugzilla for almost 2 1/2 years without acting on it. It's an alternative implementation of an Apache httpd resource agent: http://developerbugs.linux-foundation.org/show_bug.cgi?id=1943 (I just renamed the bug, previously it was something totally misleading about Debian lenny.) Any objections to getting this RA into decent late-2010 shape and adding it to the repo as ocf:heartbeat:apache2? Please, not a repeat of ipaddr. Can we not replace the old one and include some compatibility code (that we can remove eventually)? I believe you sat across the table from me when we discussed resource agent deprecation? Both apache and IPaddr are storybook examples for that. Thats fine, but you didn't actually say that you planned for the existing one to be deprecated. Here's a thought, why dont we hold off a fraction longer and put this one in the new combined namespace and thus avoid needing the 2 suffix forevermore. I really have no intention sanitizing ocf:heartbeat:apache and then leaving things like 2 parameters being named testregex and testregex10 around for compatibility reasons. No sir I do not. Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] LF #1943: alternative implementation of an apache resource agent
On Thu, Dec 9, 2010 at 7:28 PM, Florian Haas florian.h...@linbit.com wrote: On 12/09/2010 10:45 AM, Andrew Beekhof wrote: On Thu, Dec 9, 2010 at 9:12 AM, Florian Haas florian.h...@linbit.com wrote: On 2010-12-07 22:08, Andrew Beekhof wrote: On Tue, Dec 7, 2010 at 4:17 PM, Florian Haas florian.h...@linbit.com wrote: Folks, shamefully we've had a resource agent contribution sitting around in the LF bugzilla for almost 2 1/2 years without acting on it. It's an alternative implementation of an Apache httpd resource agent: http://developerbugs.linux-foundation.org/show_bug.cgi?id=1943 (I just renamed the bug, previously it was something totally misleading about Debian lenny.) Any objections to getting this RA into decent late-2010 shape and adding it to the repo as ocf:heartbeat:apache2? Please, not a repeat of ipaddr. Can we not replace the old one and include some compatibility code (that we can remove eventually)? I believe you sat across the table from me when we discussed resource agent deprecation? Both apache and IPaddr are storybook examples for that. Thats fine, but you didn't actually say that you planned for the existing one to be deprecated. Oh sorry, I thought the implication was obvious. :) Here's a thought, why dont we hold off a fraction longer and put this one in the new combined namespace and thus avoid needing the 2 suffix forevermore. Hm. Well I think I'll grudgingly agree. Since we've already let this sit on the shelf for two years, I guess two more months or so doesn't make much difference... Yeah, might be something we want to think about creating a strategy for though. Not every day will we be opening up a new namespace ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] LF #1943: alternative implementation of an apache resource agent
On Tue, Dec 7, 2010 at 4:17 PM, Florian Haas florian.h...@linbit.com wrote: Folks, shamefully we've had a resource agent contribution sitting around in the LF bugzilla for almost 2 1/2 years without acting on it. It's an alternative implementation of an Apache httpd resource agent: http://developerbugs.linux-foundation.org/show_bug.cgi?id=1943 (I just renamed the bug, previously it was something totally misleading about Debian lenny.) Any objections to getting this RA into decent late-2010 shape and adding it to the repo as ocf:heartbeat:apache2? Please, not a repeat of ipaddr. Can we not replace the old one and include some compatibility code (that we can remove eventually)? Cheers, Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] OCF Resource Agent Developer's Guide (Draft)
On Mon, Nov 22, 2010 at 2:40 PM, Florian Haas florian.h...@linbit.com wrote: On 2010-11-22 11:02, alexander.kra...@basf.com wrote: Hi Florian, I did shortly read about your guide. Very good aggregation and definitely a good starting point to begin with own development. Properly you could go into more datails about Master/Slave resources ? Concrete I am missing something about the usage of the crm_master command. Thanks for the tip. Does this work for you? http://people.linbit.com/~florian/ra-dev-guide/_special_considerations.html#_specifying_a_master_preference And also something about the monitoring function in a M/S case. E.g. should a M/S resource return OCF_NOT_RUNNING or OCF_FAILED_MASTER, if there are no more processes of the resource on the node ? Well, to be honest I don't know how $OCF_NOT_RUNNING and $OCF_FAILED_MASTER are _expected_ to be handled differently, as master/slave resources never made it back into the OCF spec. Maybe Andrew can shed some light on this: is $OCF_FAILED_MASTER expected to mean some failure occurred while the resource was running in master mode, or some problem occurred so that the resource is no longer running in master mode while it is expected to, but it is otherwise running normally? If only it were documented somewhere like: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-ocf-return-codes.html In the former case, I'd expect demote-stop-start-promote as the recovery action; in the latter, just demote-promote. Or am I completely off the mark? How should the agent remenber his last status ? It really shouldn't. That's the cluster manager's job. All the agent has to deliver is the _current_ status, by means of the monitor action. Cheers, Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] a scalable membership and LRM proxy proposal
On Tue, Nov 16, 2010 at 4:23 PM, Alan Robertson al...@unix.sh wrote: Thanks for this information. This is _SOOO_ much better than trying to dig it all out of the web site. On 11/16/2010 03:04 AM, Andrew Beekhof wrote: Alan Robertson wrote: I was hoping for something a little more lightweight - although I clearly understand the benefits of it already exists and having some credible claims to security as a goal (since nothing is ever secure). I wonder if you really want that kind of very strongly guaranteed message delivery Not always, possibly not ever. But happily this is configurable, so we'd only ask for those guarantees if we needed them. That's good to know. Is this an Apache extension, or is this part of the standard? Not sure Does Qpid support IPv6? RH requires everything we ship to support v6 so I'd be highly surprised if it didnt. - since messages sent to a node that crashes before receiving them are delivered after it comes back up. But, of course, there's always a way to work around things that don't do what you need for them to. Presumably you'd also need to clean up those messages out of the queues of all senders if the node is going away permanently - at least once you figure that out... Messages to clients seem to better match the semantics of RDS. Messages back to overlords could use AMQP without obvious corresponding issues. I wonder about latency - particularly when federated - and taking garbage collection into account... I see that QPID claims to be extremely fast. It probably is pretty fast for a large and complex Java program. Here are the numbers from their website: Red Hat MRG product built on Qpid has shown 760,000msg/sec ingress on an 8 way box or 6,000,000 msg/sec Is there something missing from this sentence, or am I just dense? I'm guessing that this is intended to imply that it can process 760K msgs/sec per CPU, giving a projected 6M msgs/sec for an 8-way... Latencies have been recored as low as 180-250us (.18ms-.3ms) for TCP round trip and 60-80us for RDMA round trip using the C++ broker For latencies, something more like 99th percentile guarantees are a better measure than best case latencies. And, if it uses TCP, then the overhead of holding 10K TCP connections open at once seems a bit high - just to do nothing most of the time... This model is different from the design point for this protocol. I expect that most of the time these connections would sit idle. You'd have to talk to the qpid/qmf teams about how the numbers were obtained. I just copied them verbatim off the website. Current indications from them are that they can easily handle the workloads we're planning for. One of the cool things about the proposal I made is that that the overlords incur near-zero ongoing overhead to monitor a very very big network, and no network congestion. This work to do this monitoring is spread pretty evenly among all the nodes in the system such that no node has to keep track of more than a handful of peers (most only have two peers - it looks like it could be bounded to 4 peers worst case). Ring-structured heartbeat communication looks like it should work out very well. Its not that I'm against your proposal, I just don't know of enough resources to build, test and stabilize a new communication protocol. In that context, an off-the-shelf component that gives us a couple of magnitudes worth of additional scaling looks pretty attractive - and should provide some valuable feedback for how to take it to the next level. Nevertheless, I see the attraction. Not sure it's what I want, but since I don't know yet quite what I want - that would be hard to say :-). Yep, nothing forcing everyone down the same path. Got that. I see advantages to having at least some common APIs/libraries/interfaces/something. Cross-pollination of ideas is good. Sharing code and having alternatives is better - if not too expensive in code, organizational overhead and emotional energy. Thanks for taking time to share ideas and educate me, NP. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [PATCH 06 of 10] cl_log: Always print the?common log entity to syslog messages
On Wed, Nov 17, 2010 at 9:03 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Wed, Nov 17, 2010 at 08:21:03AM +0100, Andrew Beekhof wrote: On Tue, Nov 16, 2010 at 2:27 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Mon, Nov 15, 2010 at 06:18:29PM +0100, Bernd Schubert wrote: On Monday, November 15, 2010, Dejan Muhamedagic wrote: This may truncate entity, and of course breaks existing filtering setups that trigger on it. Right. So, this needs to be optional. Ok, any favourite option keyword in logd.cf? If it comes to me, the present logd.cf *suggests* that not adding the commong log entity is a bug: quote # Entity to be shown at beginning of a message # for logging daemon # Default: logd entity logd /quote Was that option only for its own messages? Yes, and the default in case the client didn't supply its entity. Isn't that a bit over-due? I guess you meant overkill? Perhaps. So maybe entity none should suppress it in the future? We can't change the semantic. So, we need a new option name. extra_entity? common_entity? fwiw, syslog_facility works the way Bernd proposed It does not, at least not if you don't have it exclusively. Pretty sure it does, I recall adding the code since there was no other way to turn syslog logging off from ha.cf. Has it changed since? That has been his starting point. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] a scalable membership and LRM proxy proposal
On Tue, Nov 16, 2010 at 7:06 AM, Alan Robertson al...@unix.sh wrote: Hi, I missed the through federation part. Sorry... As a point of comparison - the proposal as described on my blog does not require federation. Probably at least as scalable, and it's very probable that it's lower latency - and it's pretty much dead certain that it's lower traffic on the network. https://cwiki.apache.org/qpid/faq.html#FAQ-Heartbeats Heartbeats are sent by the broker at a client specified, per-connection frequency I doubt that the granularity would be 1s for most use-cases. Certainly there is scope to tune it to match your network. I assume that QMF is the Qpid Management Framework found here? https://cwiki.apache.org/qpid/qpid-management-framework.html Correct I was hoping for something a little more lightweight - although I clearly understand the benefits of it already exists and having some credible claims to security as a goal (since nothing is ever secure). I wonder if you really want that kind of very strongly guaranteed message delivery Not always, possibly not ever. But happily this is configurable, so we'd only ask for those guarantees if we needed them. - since messages sent to a node that crashes before receiving them are delivered after it comes back up. But, of course, there's always a way to work around things that don't do what you need for them to. Presumably you'd also need to clean up those messages out of the queues of all senders if the node is going away permanently - at least once you figure that out... Messages to clients seem to better match the semantics of RDS. Messages back to overlords could use AMQP without obvious corresponding issues. I wonder about latency - particularly when federated - and taking garbage collection into account... I see that QPID claims to be extremely fast. It probably is pretty fast for a large and complex Java program. Here are the numbers from their website: Red Hat MRG product built on Qpid has shown 760,000msg/sec ingress on an 8 way box or 6,000,000 msg/sec Latencies have been recored as low as 180-250us (.18ms-.3ms) for TCP round trip and 60-80us for RDMA round trip using the C++ broker Nevertheless, I see the attraction. Not sure it's what I want, but since I don't know yet quite what I want - that would be hard to say :-). Yep, nothing forcing everyone down the same path. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [PATCH 06 of 10] cl_log: Always print the?common log entity to syslog messages
On Tue, Nov 16, 2010 at 2:27 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Mon, Nov 15, 2010 at 06:18:29PM +0100, Bernd Schubert wrote: On Monday, November 15, 2010, Dejan Muhamedagic wrote: This may truncate entity, and of course breaks existing filtering setups that trigger on it. Right. So, this needs to be optional. Ok, any favourite option keyword in logd.cf? If it comes to me, the present logd.cf *suggests* that not adding the commong log entity is a bug: quote # Entity to be shown at beginning of a message # for logging daemon # Default: logd entity logd /quote Was that option only for its own messages? Yes, and the default in case the client didn't supply its entity. Isn't that a bit over-due? I guess you meant overkill? Perhaps. So maybe entity none should suppress it in the future? We can't change the semantic. So, we need a new option name. extra_entity? common_entity? fwiw, syslog_facility works the way Bernd proposed ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] a scalable membership and LRM proxy proposal
Some of your thinking mirrors our own. What we're moving towards is indeed two tiers of membership. One being a small but fully meshed set of, to use your terminology, Overlords running a traditional cluster stack. The other being a much larger set of independent nodes or VMs running only an lrm-like proxy. Members of the second tier have no knowledge of each other's existence, nor even of the cluster itself. The transport layer we plan on using to talk to these nodes is QMF (which implements AMQP). QMF has the nice properties of being cross-platform (ie. windows), standards based and something that already exists. We also know that it is secure, fast, and scales well through federation. Happily it also gives us node up/down information for free. As Lars mentioned, a Matahari agent (essentially the lrm with a QMF interface on top) is intended to act as the proxy. He also mentioned container resources, but this was a red herring. Whether the entities running Matahari are also guests being managed by Pacemaker is irrelevant. They can equally be physical machines or cloud instances. The Matahari and QMF pieces are both generically useful components with no ties to Pacemaker. There will still need to be integration done to hook up the node liveliness and add the ability to send resource commands via the QMF bus. What form this work takes will depend on which parts of Pacemaker are being used in the overall architecture. On Thu, Nov 4, 2010 at 3:48 PM, Alan Robertson al...@unix.sh wrote: I've been thinking about the idea of very highly scalable membership, and also about the LRM proxy function which is currently being performed by Pacemaker. Towards this end I wrote up a high-level design (or architecture, or design philosophy or something) for such a scalable membership/LRM proxy service. The design is not specific to working with Pacemaker - it could work with Pacemaker, or a number of other kinds of management entities. The kind of membership outlined here would be (in Pacemaker terms) sort of a second-class membership - which has advantages and disadvantages. The blog post can be found here: http://techthoughts.typepad.com/managing_computers/2010/10/big-clusters-scalable-membership-proposal.html Please feel to comment on it on the blog, or on the mailing list. I've reproduced the blog posting below: Really Big Clusters: A Scalable membership proposal This blog entry is a bit different than previous entries - I'm proposing some enhanced capabilities to go with the LRM and friends from the Linux-HA project. I will update this entry on an ongoing basis to match my current thinking about this proposal. This post outlines a proposed server liveness (membership) design which is intended to scale up to tens of thousands of servers to be managed as an entity. Scalability depends on a lot of factors - processor overhead, network bandwidth, and network load. A highly scalable system will take all of these factors into account. From the perspective of the server software author (like, for example, me), one of the easiest to overlook is network load. Network load depends on a number of factors - number of packets, size of the packets, how many switches or routers it has to go through, and how many endpoints will receive the packet. To best accomplish this task, it is desirable that the majority of normal traffic be network topology aware. To scale up to very large collections of computers, it also necessary that as much as possible be monitored as locally as possible. In addition, since switching gear is not optimized for multicast packets, and multicast packets consume significant resources when compared to unicast packets, it is desirable to avoid using multicast packets during normal operation. The Basic Concept - network aware liveness Although the LRM in Linux-HA is not network-enabled, it tries to minimize monitoring overhead by distributing the task of monitoring resources to the machine providing the resource, and only reporting failures upstream. To extend this idea of local monitoring into system liveness monitoring, one might imagine a standard 48-port network switch with 48 servers attached to it. If one were to choose a server to act as proxy for monitoring the servers on that switch, then the other 47 nodes on the switch would send that unicast heartbeats to that node. That node in turn would report on failure-to-receive-heartbeat events from the other 47 nodes on the switch. To ensure detection of failures of the monitoring node, that node could send its heartbeat upstream to a process monitoring it. In order to ensure continual service, it may be desirable to have two monitoring nodes per switch, with each one also monitoring the other. This results in a 24-to-one reduction in traffic going off the switch, and a corresponding decrease in workload to monitor these 48 servers. If one were to implement this in the context
Re: [Linux-ha-dev] incorrect diff of sysinfo.txt
On Wed, Sep 22, 2010 at 4:07 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Wed, Sep 22, 2010 at 12:50:53PM +0200, Raoul Bhatia [IPAX] wrote: running debian 5.0.6 and cluster-glue 1.0.6-1~bpo50+1, hb_report's analysis.txt shows: Diff sysinfo.txt... --- /root/move_ftp_group/wc01/sysinfo.txt 2010-09-22 12:17:21.0 +0200 +++ /root/move_ftp_group/wc02/sysinfo.txt 2010-09-22 12:17:21.0 +0200 @@ -2,7 +2,7 @@ cluster-glue: 1.0.6 (1c87a0c58c59fc384b93ec11476cefdbb6ddc1e1) resource-agents: # Build version: 5ae70412eec8099b25e352110596dd279d267a8a CRM Version: 1.0.9 (74392a28b7f31d7ddc86689598bd23114f58978b) - 1.0.9.1+hg15626-1~bpo50+1 1.2.1-1~bpo50+1 1.0.6-1~bpo50+1 1:3.0.3-2~bpo50+1 1:3.0.3-2~bpo50+1 2.02.39-8Platform: Linux + 1.0.9.1+hg15626-1~bpo50+1 1.2.1-1~bpo50+1 1.0.6-1~bpo50+1 1:3.0.3-2~bpo50+1 1:3.0.3-2~bpo50+1 2.02.39-8 Platform: Linux Kernel release: 2.6.27.54+ipax Architecture: x86_64 Distribution: Description: Debian GNU/Linux 5.0.6 (lenny) there is one space missing before Platform: Linux. i do not know where this difference comes from, but what about using diff -wu instead of diff -u in txtdiff() ? We could try that, but better fixing the actual output. It seems like part of the output of crmd version (perhaps others too) goes to stderr which is then mixed randomly with the stdout material. Andrew: fprintf(stderr, CRM Version: ); fprintf(stdout, %s (%s)\n, VERSION, BUILD_VERSION); Is that intentional? I probably had some notion of allowing just the version to be captured in a shell variable. Happy if you want to change to to stdout instead. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] About movement of the Quorum control.
On Fri, Aug 27, 2010 at 8:12 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. I doubt it, I think development on heartbeat is at an end and maintenance is limited to regressions. We understand it enough. The biggest problem is that none of us actually understand the CCM. I looked at it once a long time ago and I've never been so frightened in my life. But there's nothing stoping someone at NTT becoming a CCM expert ;-) However, it is very difficult for us to wait for corosync to be stable. Its pretty close these days. It does lack ucast and bcast support, but there are plans to address that outside of corosync. But maybe lge would like to comment further. Well. Let's wait for more comment. Best Regards, --- Andrew Beekhof and...@beekhof.net wrote: On Wed, Aug 25, 2010 at 3:20 AM, renayama19661...@ybb.ne.jp wrote: Hi Developers of Heartbeat, When we combined Pacemaker with Heartbeat, we understand that Quorum control does not work well. For example, it occurs when a cluster consisted of plural nodes when I set it besides no-quorum-policy=ignore. We know that this is considerably always the problem that was already known. The problem occurs by the difference of the timing of the detection when a node was divided. We think that there is the problem in Heartbeat. #65533;* There may be the problem in CCM. #65533;* Because the reason is because the problem does not occur when it combined Pacemaker with corosync to notify of node division definitely. Our many users are going to use the Quorum control in Heartbeat. Heartbeat has to notify pacemaker of a change of right node constitution like corosync. Is there the plan when Quorum control of Heartbeat becomes right? I doubt it, I think development on heartbeat is at an end and maintenance is limited to regressions. But maybe lge would like to comment further. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] About movement of the Quorum control.
On Fri, Aug 27, 2010 at 4:04 PM, Lars Marowsky-Bree l...@novell.com wrote: On 2010-08-27T09:53:32, Andrew Beekhof and...@beekhof.net wrote: However, it is very difficult for us to wait for corosync to be stable. Its pretty close these days. It does lack ucast and bcast support, but there are plans to address that outside of corosync. The current corosync is really quite stable, certainly as stable as CCM. If it isn't, there's at least someone we can ask for active help. bcast is handled by corosync, by the way; and unicast support may also be coming. I'm curious what you mean by outside of corosync, though - how would the network protocol be addressed outside of corosync? http://www.linuxplumbersconf.org/2010/ocw/proposals/1065 Basically its intended to be a generically useful layer sitting underneath corosync. Steve was skeptical at first but is apparently on board with the idea now - to the point where its the preferred approach for providing ucast capabilities. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] About movement of the Quorum control.
On Fri, Aug 27, 2010 at 4:04 PM, Lars Marowsky-Bree l...@novell.com wrote: bcast is handled by corosync, by the way; and unicast support may also be coming. Yeah, but its not supported on RhEL which means he's not actively testing it or fixing bugs. Your milage may vary :-) ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] About movement of the Quorum control.
On Wed, Aug 25, 2010 at 3:20 AM, renayama19661...@ybb.ne.jp wrote: Hi Developers of Heartbeat, When we combined Pacemaker with Heartbeat, we understand that Quorum control does not work well. For example, it occurs when a cluster consisted of plural nodes when I set it besides no-quorum-policy=ignore. We know that this is considerably always the problem that was already known. The problem occurs by the difference of the timing of the detection when a node was divided. We think that there is the problem in Heartbeat. * There may be the problem in CCM. * Because the reason is because the problem does not occur when it combined Pacemaker with corosync to notify of node division definitely. Our many users are going to use the Quorum control in Heartbeat. Heartbeat has to notify pacemaker of a change of right node constitution like corosync. Is there the plan when Quorum control of Heartbeat becomes right? I doubt it, I think development on heartbeat is at an end and maintenance is limited to regressions. But maybe lge would like to comment further. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [PATCH] IPv6addr: removing libnet dependency
On Fri, Jul 30, 2010 at 11:42 AM, Keisuke MORI keisuke.mori...@gmail.com wrote: 2010/7/27 Andrew Beekhof and...@beekhof.net: On Tue, Jul 27, 2010 at 8:44 AM, Keisuke MORI keisuke.mori...@gmail.com wrote: For heartbeat, I personally like pacemaker on in ha.cf :) One thing thats coming in 1.1.3 is an mcp (master control process) and associated init script for pacemaker. This means that Pacemaker is started/stopped independently of the messaging layer. Currently this is only written for corosync[1], but I've been toying with the idea of extending it to Heartbeat. In which case, if you're already changing the option, you might want to make it: legacy on/off Where off would be the equivalent of starting with -M (no resource management) but wouldn't spawn any daemons. Thoughts? I have a several concerns with that change, 1) Is it possible to recover or cause a fail-over correctly when any of the Pacemaker/Heartbeat process was failed? (In particular, for the failure of the new mcp process of pacemaker and for the current heartbeat's MCP process failure) If the MCP dies, so will the crmd and cib (and by extension, everything else except the PE and LRMd). The types of failures are well tested. Failure of heartbeat also result in the same types of secondary failures and recovery as we see now. 2) Would the daemons used with respawn directive such as hbagent(SNMP daemon) or pingd work as compatible? Pingd will no longer exist in 1.1, if I've not removed it already. It is completely replaced by ocf:pacemaker:ping hbagent might have to be a bit smarter about what to do when the cib isn't around, but otherwise it shouldn't be a problem. 3) After all, what would be the benefit for end users with the change? For corosync based clusters its a clear win - we get a much more reliable startup/shutdown sequence. Its also far more obvious what is happening at each stage, one can also stop all resources on a node without taking the node offline. If we changed it for heartbeat, it would be mostly for consistency. I feel like it's only adding some complexity to the operations and the diagnostics by the end users. I guess that I would only use legacy on on the heartbeat stack... Correct. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [PATCH] IPv6addr: removing libnet dependency
On Tue, Jul 27, 2010 at 8:44 AM, Keisuke MORI keisuke.mori...@gmail.com wrote: 2010/7/26 Lars Ellenberg lars.ellenb...@linbit.com: On Mon, Jul 26, 2010 at 06:39:50PM +0900, Keisuke MORI wrote: By the way, do we have any plan to release the next agents/glue/heartbeat packages from the Linux-HA project? I think it's good time to consider them for the best use of pacemaker-1.0.9. I think glue was released by dejan just before he went on vacation, though the release announcement is missing (1.0.6). Heartbeat does not have many changes (appart from some cleanup in the build dependencies), so there is no urge to release a 3.0.4, but we could do so any time. Agents has a few fixes, but also has some big changes. I have to take an other close look, but yes, I think we should release an agents 1.0.4 within the next few weeks. Great! Then let's go for the next release for agents/heartbeat along with glue. My most concern about agents is LF#2378: http://developerbugs.linux-foundation.org/show_bug.cgi?id=2378 It is a change but it's a necessary change to make the maintenance mode work fine. For heartbeat, I personally like pacemaker on in ha.cf :) One thing thats coming in 1.1.3 is an mcp (master control process) and associated init script for pacemaker. This means that Pacemaker is started/stopped independently of the messaging layer. Currently this is only written for corosync[1], but I've been toying with the idea of extending it to Heartbeat. In which case, if you're already changing the option, you might want to make it: legacy on/off Where off would be the equivalent of starting with -M (no resource management) but wouldn't spawn any daemons. Thoughts? [1] Which has an API call which allows Pacemaker to prevent Corosync from shutting down if its still running. So service corosync stop will fail if you didnt already run service pacemaker stop ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [PATCH] IPv6addr: removing libnet dependency
On Fri, Jul 23, 2010 at 5:09 AM, Simon Horman ho...@verge.net.au wrote: On Fri, Jul 23, 2010 at 09:19:44AM +0900, Keisuke MORI wrote: The attached patch is to remove libnet dependency from IPv6addr RA by replacing the same functionality using the standard socket API. Currently there are following problems with resource-agents package: - IPv6addr RA requires an extra libnet package on the run-time environment. That is pretty inconvenient particularly for RHEL users because it's not included in the standard distribution. - The pre-built RPMs from ClusterLabs does not include IPv6addr RA. This was once reported in the pacemaker list: http://www.gossamer-threads.com/lists/linuxha/pacemaker/64295#64295 The patch will resolve those issues. I believe that none of Pacemaker/Heartbeat related packages would be depending on libnet library any more after patched. Hi Mori-san, I will add that libnet seems to be more or less unmaintained. Someone recently picked it up again, but I'm in favor of the patch for the reasons Mori-san already stated. You seem to make using libnet optional, is there a reason not to just remove it? portability? Agreed, lets just drop it. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Upstart RA
2010/5/16 Ante Karamatić iv...@ubuntu.com: Hi all I'm working on an OCF RA for upstart. In its current state, upstart doesn't return exit codes for 'start', 'stop' or 'status'. Or, to be precise, exit code is always 0. Exit codes weren't implemented since upstart knows a bit more states than just 'running' or 'not running', i.e. it knows distinction between running, but stopping and running. Which is still no excuse for them not doing exit codes properly. They should have just added a few more not thrown them out and made automation that much harder. I'm pretty sure that internally they're not using regex's to parse the state of services :-/ Never the less, it has exit statuses which are machine readable with grep/awk/whatever. Exit codes will be implemented in feature (probably in couple of months). So, to create an resource agent that would utilize upstart we could relay on greping the output of initctl commands or we could relay on dbus. Approach I've taken was to utilize python interface to dbus. Reason for this is that upstream prefers communication over dbus, as explained at http://upstart.ubuntu.com/wiki/DBusInterface. I should have this RA done very soon and my plan was to name it upstart-dbus, since it would depend on dbus. dbus isn't installed by default on ubuntu server and probably it isn't installed on other server distributions (correct me if I'm wrong). Would depending on dbus be a problem I think I'd not make it a strict dependency, and instead make sure the RA checked for dbus and produced OCF_NOT_INSTALLED if it wasn't available. and if not, would python based RA be acceptable at all? I think we're supposed to be language agnostic so I'd not imagine that to be a problem. Being a plugin is probably a better solution in the long term though, since then we might be able to take advantage of the upstart events. It also uses 0.1% fewer characters to configure too I guess :-) I'm aware that it is a bit slower than just greping, but measured with time(1) worst result for 'status' was 0,05 seconds. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Monitor Operation on MySQL Master/Slave Group
Try removing all the operations in p_mysql. They should only be defined in ms_mysql On Fri, May 14, 2010 at 9:20 PM, John Ratz john.mark.r...@gmail.com wrote: Hi, I did have different interval times, but the problem seems to be that a separate monitor function (monitor_Master_0) in the mysql script is needed in order to support a separate monitor op. Here's my config before and after the change: crm(live)configure# show node $id=58397566-b452-43f3-ac8e-45549776d863 node2 node $id=611d50bd-62c1-4629-bb16-c29ec6210807 node1 primitive p_clusterip ocf:heartbeat:IPaddr2 \ params ip=10.176.1.167 cidr_netmask=32 \ op monitor interval=3s \ meta target-role=Started primitive p_mysql ocf:heartbeat:mysql \ params binary=/usr/bin/mysqld_safe config=/etc/my.cnf socket=/var/run/mysqld/mysql.sock datadir=/var/lib/mysql replication_user=root replication_passwd= pid=/var/run/mysqld/mysqld.pid test_user=root test_passwd= \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s \ op monitor interval=30 timeout=30 OCF_LEVEL_CHECK=1 ms ms_mysql p_mysql \ meta notify=true master-max=1 is-managed=true target-role=Started location l_master ms_mysql \ rule $id=l_master-rule $role=Master 50: #uname eq node1 colocation mysql_master_on_ip inf: p_clusterip ms_mysql:Master order mysql_before_ip inf: ms_mysql:promote p_clusterip:start property $id=cib-bootstrap-options \ dc-version=1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7 \ cluster-infrastructure=Heartbeat \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1273611238 rsc_defaults $id=rsc-options \ resource-stickiness=100 crm(live)configure# edit p_mysql WARNING: p_mysql: action monitor_Master_0 not advertised in meta-data, it may not be supported by the RA crm(live)configure# show node $id=58397566-b452-43f3-ac8e-45549776d863 node2 node $id=611d50bd-62c1-4629-bb16-c29ec6210807 node1 primitive p_clusterip ocf:heartbeat:IPaddr2 \ params ip=10.176.1.167 cidr_netmask=32 \ op monitor interval=3s \ meta target-role=Started primitive p_mysql ocf:heartbeat:mysql \ params binary=/usr/bin/mysqld_safe config=/etc/my.cnf socket=/var/run/mysqld/mysql.sock datadir=/var/lib/mysql replication_user=root replication_passwd= pid=/var/run/mysqld/mysqld.pid test_user=root test_passwd= \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s \ op monitor interval=20 role=Slave timeout=20 OCF_LEVEL_CHECK=1 \ op monitor interval=10 role=Master timeout=20 OCF_LEVEL_CHECK=1 ms ms_mysql p_mysql \ meta notify=true master-max=1 is-managed=true target-role=Started location l_master ms_mysql \ rule $id=l_master-rule $role=Master 50: #uname eq node1 colocation mysql_master_on_ip inf: p_clusterip ms_mysql:Master order mysql_before_ip inf: ms_mysql:promote p_clusterip:start property $id=cib-bootstrap-options \ dc-version=1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7 \ cluster-infrastructure=Heartbeat \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1273611238 rsc_defaults $id=rsc-options \ resource-stickiness=100 crm(live)configure# commit WARNING: p_mysql: action monitor_Master_0 not advertised in meta-data, it may not be supported by the RA crm(live)configure# I know that the drbd script file supports seperate a seperate monitor op for the Master. I tried changing action name=monitor depth=0 timeout=30 interval=10 / in the mysql file to action name=monitor depth=0 timeout=20 interval=20 role=Slave / action name=monitor depth=0 timeout=20 interval=10 role=Master / so that it matched what's in the drbd script file. Doing this prevents the above warning, but this still fails and crashes the Master/Slave set. I think that it's attempting to run the same monitor op with the same interval on both nodes despite having specified different intervals in the config. Perhaps a fix is needed somewhere in Pacemaker itself? It seems that it is not conforming to http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch10s03s03s03.html Thanks, John On Fri, May 14, 2010 at 3:37 AM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote: On 05/12/2010 09:30 PM, linuxha@gishpuppy.com wrote: Hello All, I've been testing out the newly added Master/Slave capability for MySQL, but it seems to be missing the monitor operation for the Master instance, i.e. the monitor op for the primitive mysql object only runs on the slave resource, and an added monitor op with role=Master fails. Is this intended behavior? hi, did you specify two monitor operations with different intervals? e.g. [1] crm configure primitive drbd0 ocf:heartbeat:drbd \ params drbd_resource=drbd0 \
Re: [Linux-ha-dev] Monitor Operation on MySQL Master/Slave Group [GishPuppy]
On Wed, May 12, 2010 at 9:30 PM, linuxha@gishpuppy.com wrote: Hello All, I've been testing out the newly added Master/Slave capability for MySQL, but it seems to be missing the monitor operation for the Master instance, i.e. the monitor op for the primitive mysql object only runs on the slave resource, and an added monitor op with role=Master fails. Is this intended behavior? Expected, yes. But not really ideal, we really need to fix that in pacemaker one day :-( ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Deprecated resource agents
On Tue, Apr 20, 2010 at 8:15 AM, Florian Haas florian.h...@linbit.com wrote: On 04/20/2010 07:03 AM, Tim Serong wrote: On 4/20/2010 at 06:48 AM, Lars Marowsky-Bree l...@novell.com wrote: In general, I think the ability to depreciate functionality is needed, but shouldn't be slip-streamed into a minor dot release, and we first need to do some more home work to get our infrastructure right before we should consider breaking customer configurations. This'd be easiest if the metadata explicitly said an RA was deprecated, for example something like: ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=Evmsd version=0.9 deprecated=true ... ATM, the deprecated RAs all seem to follow the same convention of using (deprecated) in the shortdesc, e.g.: shortdesc lang=enControls clustered EVMS volume management (deprecated)/shortdesc ...but grepping arbitrary text out of a description always irks me. It's a little inexact. I'll shoulder the blame for that. I came up with that (deprecated) kludge for fear of lmb jumping in circles about an unauthorized modification of the RA metadata schema. But now you started it! Which allows me to wholeheartedly second your motion. Which brings up another good point... Can we please make OCF relevant again by converting the repo to Hg and allowing access? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Deprecated resource agents
On Mon, Apr 19, 2010 at 2:05 PM, Florian Haas florian.h...@linbit.com wrote: Hello, in case you haven't yet noticed: as of resource-agents 1.0.2, several Linux-HA resource agents are marked as deprecated: - EvmsSCC and - Evmsd (both apply to EVMS, which is no longer maintained); - LinuxSCSI (superseded by SCSI reservations and SF-EX); - drbd (superseded by ocf:linbit:drbd); - pingd (superseded by ocf:pacemaker:pingd, which in turn is now considered obsolete and superseded by ocf:pacemaker:ping if I understand correctly). Correct. Everyone should be using ocf:pacemaker:ping Its highly likely that pingd will redirect to ping one of these days. Since nobody should be using these anymore, and we've already kept them for two releases, I suggest that we drop these from 1.0.4. In addition, I suggest that we deprecate (and after a couple more releases, drop) the Linux-HA incarnations of Dummy and Stateful, as duplicates of these already exist in Pacemaker, and Andrew has indicated he wants to maintain them there rather than fix them in the Linux-HA repo. Thoughts and comments appreciated. Cheers, Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] proposed fix for the ABI extension of cluster-glue
On Sat, Apr 17, 2010 at 11:56 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Sat, Apr 17, 2010 at 11:40:36AM +0200, Lars Marowsky-Bree wrote: Lars, I have no other way of saying this, but I still think you're completely misguided in this desire to preserve binary compatibility. What's the point in preserving local ABI compatibility if they have to restart everything anyway? Situation is: I had pacemaker 1.0.8 installed. There is no pacemaker 1.0.9 yet. Cluster glue is updated. I install updated cluster glue, as it better supports pacemaker 1.0.8. I do that, and boom, all my stack segfaults. No, because 1.0.8-4 was rebuilt for the new version of glue. Look, we all know lmb has some crazy-ass ideas, but I'm hard pressed to disagree with anything he's said in this thread. I vote for reapplying the patch, bumping the SO name and forgetting about the whole thing. Why would I require my users to fetch new builds of the very same version of heartbeat and pacemaker, if it is easily avoided? Why would I knowingly break ABI compatibility, if I can avoid it, just for two ints added at the end of a struct instead of in the middle? I have absolutely no understanding for your desire to keep this ABI compatible and make code more complicated by needing to support You don't need to support different semantics. If you want to only support the new semantics, require the 2.1 library. Done. Whether you require 3.0 or 2.1 does not make a difference to you, does it. Anyway, I've had my say. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] proposed fix for the ABI extension of cluster-glue
On Sat, Apr 17, 2010 at 7:58 PM, Andrew Beekhof and...@beekhof.net wrote: On Sat, Apr 17, 2010 at 11:56 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Sat, Apr 17, 2010 at 11:40:36AM +0200, Lars Marowsky-Bree wrote: Lars, I have no other way of saying this, but I still think you're completely misguided in this desire to preserve binary compatibility. What's the point in preserving local ABI compatibility if they have to restart everything anyway? Situation is: I had pacemaker 1.0.8 installed. There is no pacemaker 1.0.9 yet. Cluster glue is updated. I install updated cluster glue, as it better supports pacemaker 1.0.8. I do that, and boom, all my stack segfaults. No, because 1.0.8-4 was rebuilt for the new version of glue. Actually, I should probably be clearer on this point in advance... If someone has installed a different version of glue to the one Pacemaker is built with, then I'm not at all interested in looking at any problems the cluster is experiencing. If it works, great. Otherwise the very first thing I'm going to say is to rebuild or update Pacemaker. I burnt WAY too many hours on weird-ass bugs resulting from the debian packaging to even think about condoning this. Look, we all know lmb has some crazy-ass ideas, but I'm hard pressed to disagree with anything he's said in this thread. I vote for reapplying the patch, bumping the SO name and forgetting about the whole thing. Why would I require my users to fetch new builds of the very same version of heartbeat and pacemaker, if it is easily avoided? Why would I knowingly break ABI compatibility, if I can avoid it, just for two ints added at the end of a struct instead of in the middle? I have absolutely no understanding for your desire to keep this ABI compatible and make code more complicated by needing to support You don't need to support different semantics. If you want to only support the new semantics, require the 2.1 library. Done. Whether you require 3.0 or 2.1 does not make a difference to you, does it. Anyway, I've had my say. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] proposed fix for the ABI extension of cluster-glue
On Sat, Apr 17, 2010 at 8:13 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Sat, Apr 17, 2010 at 07:58:38PM +0200, Andrew Beekhof wrote: I vote for reapplying the patch, bumping the SO name and forgetting about the whole thing. The only thing I do is move the two new members to the end of the struct, Oh without question we should do that. Sorry, didn't mean to imply that this was waste of time. Just assumed that would be part of reapplying. keeping backwards compatibility, before bumping the SO name anyways, though not from 2.0.0 to 3.0.0, but only to 2.1.0. Because I don't see why we would insist on breaking backwards compatibility, if keeping it is that cheap. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] proposed fix for the ABI extension of cluster-glue
Sorry, pressed send too quickly... On Sat, Apr 17, 2010 at 8:13 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Sat, Apr 17, 2010 at 07:58:38PM +0200, Andrew Beekhof wrote: I vote for reapplying the patch, bumping the SO name and forgetting about the whole thing. The only thing I do is move the two new members to the end of the struct, keeping backwards compatibility, before bumping the SO name anyways, though not from 2.0.0 to 3.0.0, but only to 2.1.0. Because I don't see why we would insist on breaking backwards compatibility, if keeping it is that cheap. I'll admit to not following the conversation that closely, but the revised patch you posted didn't look all that cheap to me. Or is that a different conversation? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Pacemaker] Announcement: new releases for cluster-glue (1.0.4), resource-agents (1.0.3), and heartbeat (3.0.3)
On Wed, Apr 14, 2010 at 4:01 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hello, The new releases of cluster glue (1.0.4), resource agents (1.0.3), and Heartbeat (3.0.3) are finally ready. Nice. The repos up on clusterabs.org are being rebuilt with all three now and should be ready in an hour or so. It took us a whole week more to put the final touches, apologies if that caused any inconvenience. I guess that the schedule was a bit too tight this time. The highlights: - cluster-glue - interaction between crmd and lrmd changed a bit as of Pacemaker 1.0.8 (affects only resource cleanups) - new external/ippower9258 stonith plugin (thanks to Helmut Weymann) - hb_report now creates .dot and .png files for the PE input files and packing CTS tests is more convenient - resource-agents - timeouts in meta-data were reviewed and adjusted in all resource agents; that should make the Pacemaker 1.0.8 crm shell less noisy - all agents now print messages which would normally go to the system logs to the terminal if invoked from the command line; that should make resource debugging easier - a new ocft RA test suite (thanks to John Shi) - Heartbeat - support for setting the lrmd max children parameter - support for sbd fencing Of course, there's also a bunch of bug fixes, in particular in the agents and glue packages. See the changelogs for more details. Links to the new tarballs are at http://www.linux-ha.org/wiki/Download If you're already running Pacemaker 1.0.8 or planning to upgrade to that release, it would be a good idea to upgrade these releases as well. Of course, don't forget to first test the new packages on your test clusters. Enjoy! Lars Ellenberg Florian Haas Dejan Muhamedagic ___ Pacemaker mailing list: pacema...@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] Purpose of HA_LOGD in .ocf-shellfuncs?
Can anyone explain the purpose of this block? if [ x${HA_LOGD} = xyes ] ; then ha_logger -t ${HA_LOGTAG} $@ if [ $? -eq 0 ] ; then return 0 fi fi I ask because I cant find anything that actually sets HA_LOGD anywhere. -- Andrew smime.p7s Description: S/MIME cryptographic signature ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] announcement: cluster-glue 1.0.3 release
On Wed, Feb 3, 2010 at 10:46 AM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Tue, Feb 02, 2010 at 10:41:34AM -0800, Bob Schatz wrote: Does this bug only apply to the 1.0.2 release or was it also in the 1.0.0 release used with fc12? Don't know. The bug was introduced on Dec 07 2009. If you unpack the source tar archive, there should be a file called .hg_archival.txt. Paste the content and I'll be able to tell you if it is affected. Should be unaffected: Name: cluster-glue Relocations: (not relocatable) Version : 1.0 Vendor: Fedora Project Release : 0.11.b79635605337.hg.fc12 Build Date: Mon 12 Oct 2009 06:06:22 PM CEST Build date is prior to the fix. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] ulimit in ocf scripts
On Tue, Jan 12, 2010 at 10:43 AM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote: On 01/12/2010 10:39 AM, Florian Haas wrote: Why not simply set that for root at boot? (it rhymes too :) because i do not like the idea that each and every process gets elevated limits by default. i think that there *should* be a generic way to configure ulimits an a per resource basis. I'm confident Dejan would be happy to accept a patch in which you add such a parameter to each resource agent where it makes sense. of course this would be possible. but i *think* it is more helpful to add this to e.g. the cib/lrmd/you name it. so before i/we implement the ulimit stuff *inside* lots of different RAs, i'd like to hear beekhof's or lars' comments. If you want a configurable per-resource limit - thats a resource parameter. Why would we want to implement another mechanism? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] glue cs#7505f2e115c5 - reintroducing heartbeat name into paths?
On Fri, Dec 11, 2009 at 1:51 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Fri, Dec 11, 2009 at 11:50:10AM +0100, Florian Haas wrote: On 2009-12-10 22:34, Lars Marowsky-Bree wrote: On 2009-12-10T21:45:34, Dejan Muhamedagic deja...@fastmail.fm wrote: There are several packages using /usr/lib/heartbeat and similar. Yeah, but that was mostly a legacy thing, I thought - on a system without heartbeat installed, this is sort of a confusing artifact. The only thing were we have a hard time changing it are binary names (such as hb_report) or public interfaces (provider=heartbeat). Everything else is supposed to use %{name} under share/lib etc, I think FHS suggests that. What's the point of confusing people? There could also be a possibility of a regression. FWIW, I agree with Lars here. I guess it's much more confusing to have directories that belong to one package, yet carry the name of another. I agree in principle too, but the following is really ugly: /usr/lib64/cluster-glue/plugins/test/test.so /usr/lib64/heartbeat/base64_md5_test /usr/share/cluster-glue/lrmtest /usr/lib64/heartbeat/lrmd /usr/lib64/heartbeat/ha_logd Don't care about names, it's just that we should use either one or the other and not both. Moving lrmd would break pacemaker. In general, moving things could possibly break some software developed elsewhere. I'd prefer not to cause other users headache, I think they had enough of chasing moving targets produced by us. So, let's choose one or the other and move finally on, we've really got better things to do, and ask Andrew to change the pacemaker sources (which would make our package incompatible with pacemaker = 1.0.6, but oh well, not the first time we break things). Its not an either-or proposition. Switch completely to the new naming scheme and use symlinks for compatibility. Then after a conservative period of time you can just drop the symlinks. Thanks, Dejan That's how I stumbled across this, actually; it's changing path names for the SLE packages ;-) Must say that I was also not aware of the change, it probably happened while I was not around. The move happened when the packages were split, I think. P.S. glue doesn't carry much meaning. Correct, but the package name is now cluster-glue (both the autofoo $(PACKAGE_NAME) and the RPM %{name} and while not perfect, that is much more meaningful. Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Make error (Reusable-Cluster-Components-d3036c574587)
Sorry about that. Fix is in changeset 3a005082d4cb On Fri, Dec 4, 2009 at 2:28 AM, renayama19661...@ybb.ne.jp wrote: Hi, An error happens by a make of Reusable-Cluster-Components-d3036c574587. #cc1: warnings being treated as errors #cl_log.c:127: warning: no previous prototype for 'cl_log_enable_stdout' There is not prototype declaration in a header. I ask for a revision. void cl_log_enable_stderr(int truefalse); void cl_log_enable_stdout(int truefalse); - Required Best Regards, Hideo Yamauchi. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Problems starting heartbeat 3.0.1-1 - /etc/ha.d/shellfuncs No such file or directory [In reply to
On Mon, Nov 16, 2009 at 6:49 PM, Bob Schatz bsch...@yahoo.com wrote: I missed Andrew's reply so I am including his comment and my results below: On my cluster I did: [r...@fc11-1 ~]# rpm -qi resource-agents | grep ersion Version : 3.0.4 Vendor: Fedora Project ah, thats the problem. on fc11 resource-agents-3.0.4 doesn't yet have the heartbeat agents. with fc12 that problem goes away. for now, you'll have to remove 3.0.4 and specify 1.0.1 on the command line when you install. so: yum install resource-agents = 1.0.1 pacemaker [r...@fc11-1 ~]# rpm -ql resource-agents | grep shellfunc /usr/share/cluster/ocf-shellfuncs [r...@fc11-1 ~]# Thanks, Bob -- [05:08 PM] root[at]f12 ~ # rpm -qi resource-agents | grep ersion Version : 3.0.4 Vendor: Fedora Project [05:07 PM] root[at]f12 ~ # rpm -ql resource-agents | grep shellfunc /etc/ha.d/shellfuncs there /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs /usr/share/cluster/ocf-shellfuncs What version of resource-agents do you have installed? On Fri, Nov 13, 2009 at 3:52 AM, Bob Schatz bschatz[at]yahoo.com wrote: For some reason I did not receive the email from Andrew so I am including it below. Cluster glue was installed and I have attached the output from yum at the end of this email. Also, I noticed that the files that used to reside in /usr/lib/ocf/resource.d/heartbeat/* are no longer there. I could not configure an IPaddr resource. Thanks in advance Bob that file should be part of cluster-glue... was that package not installed? On Wed, Nov 11, 2009 at 8:19 PM, Bob Schatz bschatz[at]yahoo.com wrote: Hi, I am new to Linux HA and I am having a problem with heartbeat 3.0.1. It appears that /etc/ha.d/shellfuncs is no longer in the release but it is still called from /etc/init.d/heartbeat. I reloaded a system with FC11 and then downloaded the pacemaker/heartbeat binaries as follows: # wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/fedora-11/clusterlabs.repo # yum install -y pacemaker corosync heartbeat I copied a ha.cf to /etc/ha.d/ha.cf and attempted to start heartbeat as follows: root[at]fc11-2:# sh -x /etc/init.d/heartbeat start + '[' -f /etc/sysconfig/heartbeat ']' + HA_DIR=/etc/ha.d + export HA_DIR + CONFIG=/etc/ha.d/ha.cf + . /etc/ha.d/shellfuncs /etc/init.d/heartbeat: line 51: /etc/ha.d/shellfuncs: No such file or directory I did not see this as a known problem on the mailing lists. Thanks, Bob # yum install -y pacemaker corosync heartbeat Loaded plugins: refresh-packagekit clusterlabs | 1.2 kB 00:00 clusterlabs/primary | 14 kB 00:00 clusterlabs 47/47 Setting up Install Process Resolving Dependencies -- Running transaction check --- Package corosync.x86_64 0:1.1.2-1.fc11 set to be updated -- Processing Dependency: corosynclib = 1.1.2-1.fc11 for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libvotequorum.so.4(COROSYNC_VOTEQUORUM_1.0)(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libconfdb.so.4(COROSYNC_CONFDB_1.0)(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libcfg.so.4(COROSYNC_CFG_0.82)(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libquorum.so.4(COROSYNC_QUORUM_1.0)(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libpload.so.4(COROSYNC_PLOAD_1.0)(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libcoroipcs.so.4()(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: liblogsys.so.4()(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libquorum.so.4()(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libconfdb.so.4()(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libvotequorum.so.4()(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libcfg.so.4()(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libtotem_pg.so.4()(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libcoroipcc.so.4()(64bit) for package: corosync-1.1.2-1.fc11.x86_64 -- Processing Dependency: libpload.so.4()(64bit) for package: corosync-1.1.2-1.fc11.x86_64 --- Package heartbeat.x86_64 0:3.0.1-1.fc11 set to be updated -- Processing Dependency: resource-agents for package: heartbeat-3.0.1-1.fc11.x86_64 -- Processing Dependency: cluster-glue-libs for package: heartbeat-3.0.1-1.fc11.x86_64 -- Processing Dependency: PyXML for package: heartbeat-3.0.1-1.fc11.x86_64 --
Re: [Linux-ha-dev] suggestions and patch for meatclient
On Sat, Nov 14, 2009 at 6:58 PM, Lars Marowsky-Bree l...@suse.de wrote: On 2009-11-13T11:42:31, Dejan Muhamedagic deja...@fastmail.fm wrote: I would like to hear any opinion. Great idea! But I'd like to suggest a bit different execution, i.e. to have usage like this: The idea is nice, but what we actually want is a crm node clean-down-confirmation XXX command, that clears the CIB accordingly. So I think stonithd isn't actually the best place to implement this. Agreed. Someone create a bug for this? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Linux-HA] Problem with gratuitous arps in IPaddr2
On Wed, Sep 16, 2009 at 6:18 PM, Lars Marowsky-Bree l...@suse.de wrote: The send_arp.linux.c code file has different semantics than send_arp.libnet.c I added extra handling to try and make them match. Did I miss something? Basically I took the arping source code and added extra option handling. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Build dependency issue with heartbeat/cluster-glue
export CFLAGS=$CFLAGS -I$PREFIX/include -L$PREFIX/lib On Mon, Aug 31, 2009 at 10:01 AM, Florian Haasflorian.h...@linbit.com wrote: Andrew, Lars, Dejan, As I'm trying to fix the heartbeat init script (and struggling with automake in the process), I believe I've run into a heartbeat build issue where I'm guessing at some point $PREFIX isn't evaluated correctly. Here's what I do (in my hg clone for glue) export PREFIX=/tmp/cluster-build/usr export LCRSODIR=$PREFIX/libexec/lcrso export CLUSTER_USER=hacluster export CLUSTER_GROUP=haclient ./configure --prefix=$PREFIX --with-initdir=/tmp/cluster-build/etc/init.d --sysconfdir=/tmp/cluster-build/etc --localstatedir=/tmp/cluster-build/var Configure now reports: glue configuration: Version = 1.0.0 (Build: unknown) Features = Prefix = /tmp/cluster-build/usr Executables = /tmp/cluster-build/usr/sbin Man pages = /tmp/cluster-build/usr/share/man Libraries = /tmp/cluster-build/usr/lib Header files = /tmp/cluster-build/usr/include Arch-independent files = /tmp/cluster-build/usr/share State information = /tmp/cluster-build/var System configuration = /tmp/cluster-build/etc Use system LTDL = yes HA group name = haclient HA user name = hacluster CFLAGS = -g -O2 -ggdb3 -O0 -fstack-protector-all -Wall -Waggregate-return -Wbad-function-cast -Wcast-qual -Wcast-align -Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2 -Wformat-security -Wformat-nonliteral -Winline -Wmissing-prototypes -Wmissing-declarations -Wmissing-format-attribute -Wnested-externs -Wno-long-long -Wno-strict-aliasing -Wpointer-arith -Wstrict-prototypes -Wwrite-strings -ansi -D_GNU_SOURCE -DANSI_ONLY -Werror Libraries = -lbz2 -lxml2 -lc -luuid -lrt -ldl -lglib-2.0 -lltdl Stack Libraries = ... which I am happy with. I now do make and make install, and everything correctly installs into my directory hierarchy under /tmp/cluster-build. I now change to my hg clone for heartbeat (dev repo from hg.linux-ha.org) and try bootstrap with the same flags as for glue: ./bootstrap --prefix=$PREFIX --with-initdir=/tmp/cluster-build/etc/init.d --sysconfdir=/tmp/cluster-build/etc --localstatedir=/tmp/cluster-build/var ... which bombs out with: checking heartbeat/glue_config.h usability... no checking heartbeat/glue_config.h presence... no checking for heartbeat/glue_config.h... no configure: error: Core development headers were not found See `config.log' for more details. config.log lists includedir as being '/tmp/cluster-build/usr/include', which does contain heartbeat/glue_config.h. What am I doing wrong? Any advice much appreciated. Thanks! Cheers, Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] ANNOUNCE: New Linux-HA repository structure
Lars has asked me to announce that at long last, we have finalized the new Linux-HA repository/project structure. Effective immediately, Heartbeat 2.x has been split into the following projects: * cluster-glue 1.0 * resource-agents 1.0 * heartbeat 3.0-beta ### Cluster Glue 1.0 - http://hg.linux-ha.org/glue/ - http://hg.linux-ha.org/glue/archive/glue-1.0.tar.gz A collection of common tools that are useful for writing cluster stacks such as Heartbeat and cluster managers such as Pacemaker. Provides a local resource manager that understands the OCF and LSB standards, and an interface to common STONITH devices. ### Resource Agents 1.0 - http://hg.linux-ha.org/agents/ - http://hg.linux-ha.org/agents/archive/agents-1.0.tar.gz OCF compliant scripts to allow common services to operate in a High Availability environment. ### Heartbeat 3.0-beta - http://hg.linux-ha.org/dev/ A cluster stack providing messaging and membership services that can be used by resource managers such as Pacemaker. Heartbeat still contains the simple 2-node resource manager (aka. haresources) from before version 2. The board will release 3.0-final at a time of its choosing. These changes have been put in place to allow the group to release updates at interval that are suitable to each individual project. This also makes better use of our limited QA resources as we are no longer forced to test the entire stack in order to release an updated set of resource agents. Additionally, the changes aim to increase the usage of the individual components by allowing them to be used independently. Preliminary packages for the most recent openSUSE, SLES, Fedora and RHEL releases are currently available at http://download.opensuse.org/repositories/server:/ha-clustering:/NG Older distros can be added if there is sufficient demand. The existing repositories will be migrated to the new package layout over the coming days and weeks. -- Andrew ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] cleanup, coding style, checkpatch.pl
On Mon, Jul 27, 2009 at 11:19 AM, Hannes Ederhe...@google.com wrote: On Wed, Jul 22, 2009 at 17:13, Lars Marowsky-Breel...@suse.de wrote: On 2009-07-21T17:24:56, Hannes Eder he...@google.com wrote: Hi, Some parts of the linux-ha code base might benefit from a little code cleanup. In this case the question arises which coding style should be applied. I did not find any documentation on that in the linux-ha source tree. Did I miss something? What about obeying to Documentation/CodingStyle from the linux-kernel? By that means tools like scripts/checkpatch.pl could be used. Comments? I won't mind, but style cleanups for their own sake don't really convince me. If they come as a pre-requisite for a bugfix sure, but remember that basically the only bits of heartbeat that are still actively maintained is the LRM + resource agents. Agree, but other parts of linux-ha are still in use, no? So, I think for maintainability This isn't much of a concern. Apart from clplumbing and the pieces lars mentioned, the rest of code is essentially unmaintained. its worthwhile spending some effort tidying up the code. I do not ask you to do it, it's mere the question if one, e.g. me, spends some time cleaning up code, what style should be applied and if it is likely to be merged? I'd say not likely. We very rarely look at that code and, to me, the chance that the cleanup would introduce bugs offsets any positive aspect. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] What's going to happen with the heartbeat RAs?
On Thu, Jun 25, 2009 at 1:48 PM, Florian Haasflorian.h...@linbit.com wrote: Hello everyone, This is something that's been on my mind for a while, and I'm still looking for a definitive answer. :) Just what exactly is the current plan for the recent changes to the RAs provided by Heartbeat (i.e. the ones that install into /usr/lib/ocf/resource.d/heartbeat)? I understand there will be no further Heartbeat releases beyond the current 2.99, so those changed (and new) RAs won't ever be released as part of Heartbeat. Yet AFAICS there is no ongoing effort to move them to Pacemaker. What's the plan? Basically: http://hg.clusterlabs.org/extra/agents/ + http://download.opensuse.org/repositories/server:/ha-clustering:/NG The packaging is still a bit of a work in progress, but the full stack did seem to be working before I left for italy. Should the submitters of these new RAs re-submit to Pacemaker? No need, http://hg.clusterlabs.org/extra/agents is still a filtered copy of the ha dev repo. We'll make an announcement when everything is ready. Andrew seems to not be so fond of that idea, but I wonder what the alternative is. At this point I guess Lars' idea of all sorts of third parties contributing and maintaining their own RAs, all of them installing into separate provider directories, is just that: a good idea, with little chance of being widely adopted anytime soon. So what should we do? Keep preparing patches against the linux-ha Mercurial repo, and submitting them to linux-ha-dev, or create patches against http://hg.clusterlabs.org/extra/agents, and submit them to the Pacemaker list, or something completely different? Comments appreciated. Thanks! Cheers, Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] ocf (non) unique parameter
On Fri, May 22, 2009 at 5:52 PM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote: can someone explain the practical usage of non-unique attributes? in the ocf ra specs [1], one can read The meta data allows the RA to flag one or more instance parameters as 'unique'. This is a hint to the RM or higher level configuration tools that the combination of these parameters must be unique to the given resource type. speaking in the v2 configuration language, i thought that this means that one can specify multiple nvpairs with the same name to the cib.xml but for different resources. ie. multiple ip addresses can use the same netmask (unique=0) but each must have a different ip address (unique=1) file. e.g. looking at the pingd ra: parameter name=pidfile unique=0 longdesc lang=enPID file/longdesc shortdesc lang=enPID file/shortdesc content type=string default=$HA_RSCTMP/pingd-${OCF_RESOURCE_INSTANCE} / /parameter - e.g. nvpair id=pidfile1 name=pidfile value=/tmp/pid1.pid / nvpair id=pidfile2 name=pidfile value=/tmp/pid2.pid / as far as i can see, the first nvpair seems to win though and OCF_RESKEY_pidfile is set to /tmp/pid1.pid. yep can someone please shed a light on how to (not) use unique=0 attributes and how they are passed to the ocf ra scripts? as above. a resource can only have one value for any given attribute ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] Re: [Pacemaker] Heartbeat 2.99.2+sles11-rc3 and Pacemaker 1.0.2 packages for Debian (experimental)
On Sat, Feb 21, 2009 at 00:59, Simon Horman ho...@verge.net.au wrote: Hi, I have heartbeat 2.99.2+sles11-rc3-1 and pacemaker 1.0.2-2 packages prepared for Debian Experimental. They should install on top of the current Debian Unstable (Sid). The packages are available at: http://packages.vergenet.net/experimental/ There are source packages and binary packages for i386 and amd64. At this stage the major problem that I am facing is a failure report from /usr/lib/heartbeat/BasicSanityCheck which looks like the log below. BSC is really only a heartbeat thing and hasn't been maintained since the split. It will get resurrected in Pacemaker in some form, but its not a high priority. I'd suggest the crm parts of the Heartbeat BSC be removed. Full logs are available at: http://packages.vergenet.net/experimental/pacemaker/pacemaker-1.0.2-2_heartbeat-2.99.2+sle11-rc3_1_linux-ha_i386.testlog http://packages.vergenet.net/experimental/pacemaker/pacemaker-1.0.2-2_heartbeat-2.99.2+sle11-rc3_1_linux-ha_amd64.testlog http://packages.vergenet.net/experimental/pacemaker/pacemaker-1.0.1-1+heartbeat-2.99.2-1_linux-ha_i386.testlog I would really appreciate some advice on what if anything to do about this. In particular, is this a problem? Thanks -- Simon Horman VA Linux Systems Japan K.K., Sydney, Australia Satellite Office H: www.vergenet.net/~horms/ W: www.valinux.co.jp/en pengine: [27031]: ERROR: IDREF attribute rsc references an unknown ID bsc-rsc-yukiko.kent.sydney.vergenet.net-1 pengine: [27031]: ERROR: update_validation: Transformation /usr/share/pacemaker/upgrade06.xsl did not produce a valid configuration pengine: [27031]: ERROR: Element cluster_property_set has extra content: attributes pengine: [27031]: ERROR: Element crm_config has extra content: cluster_property_set pengine: [27031]: ERROR: Invalid sequence in interleave pengine: [27031]: ERROR: Element configuration failed to validate content pengine: [27031]: ERROR: Element cib failed to validate content cib: [27082]: ERROR: validate_cib_digest: Digest comparision failed: expected 2d643f7c4206c8d16db0331dd98c367d (/var/lib/heartbeat/crm/cib.xml.sig), calculated d18f2b4b9762eed144fe06a91e950b16 cib: [27082]: ERROR: retrieveCib: Checksum of /var/lib/heartbeat/crm/cib.xml failed! Configuration contents ignored! cib: [27082]: ERROR: retrieveCib: Usually this is caused by manual changes, please refer to http://linux-ha.org/v2/faq/cib_changes_detected cib: [27082]: ERROR: crm_abort: write_cib_contents: Triggered fatal assert at io.c:698 : retrieveCib(CIB_FILENAME, CIB_FILENAME.sig, FALSE) != NULL cib: [27002]: ERROR: Managed write_cib_contents process 27082 dumped core cib: [27002]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 cib: [27002]: ERROR: cib_diskwrite_complete: Disabling disk writes after write failure cib: [27085]: ERROR: validate_cib_digest: Digest comparision failed: expected 2d643f7c4206c8d16db0331dd98c367d (/var/lib/heartbeat/crm/cib.xml.sig), calculated d18f2b4b9762eed144fe06a91e950b16 cib: [27085]: ERROR: write_cib_contents: /var/lib/heartbeat/crm/cib.xml was manually modified while Heartbeat was active! cib: [27002]: ERROR: cib_diskwrite_complete: Disk write failed: status=256, signo=0, exitcode=1 pengine: [27031]: ERROR: Element cluster_property_set has extra content: attributes pengine: [27031]: ERROR: Element crm_config has extra content: cluster_property_set pengine: [27031]: ERROR: Invalid sequence in interleave pengine: [27031]: ERROR: Element configuration failed to validate content pengine: [27031]: ERROR: Element cib failed to validate content pengine: [27031]: ERROR: process_pe_message: Your current configuration could only be upgraded to transitional-0.6... the minimum requirement is pacemaker-1.0. ___ Pacemaker mailing list pacema...@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] HA Version 2 for monitoring apache mysql
Logs? On Mon, Dec 15, 2008 at 05:03, Tanveer Chowdhury tanveer.chowdh...@gmail.com wrote: Hi, I have configured the below in RHEL but it doesn't start the Virtual IP address or even the services. Below is the settings I took. Most probably the cib.xml file is not in right format. Thanks in advance. # cat /etc/ha.d/ha.cf debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility local0 keepalive 1 deadtime 5 warntime 3 initdead 10 udpport 694 bcast eth0 auto_failback on nodeclusternode1 nodeclusternode2 crm on # vi /etc/ha.d/haresources clusternode110.10.4.xxx httpd # vi /etc/ha.d/authkeys auth 1 1 crc # ll /etc/init.d/apache lrwxrwxrwx 1 root root 31 Feb 1 00:20 /etc/init.d/apache - /usr/local/apache/bin/apachectl # ll /etc/init.d/httpd lrwxrwxrwx 1 root root 31 Oct 22 2008 /etc/init.d/httpd - /usr/local/apache/bin/apachectl # rpm -qa | grep heartbeat heartbeat-pils-2.1.3-3.el5.centos heartbeat-stonith-2.1.3-3.el5.centos heartbeat-2.1.3-3.el5.centos # uname -a Linux clusternode2 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 EDT 2007 i686 i686 i386 GNU/Linux # cat /etc/issue Red Hat Enterprise Linux Server release 5.1 (Tikanga) This cib.xml file was auto generated after I started heartbeat service. # vi /var/lib/heartbeat/crm/cib.xml cib generated=true admin_epoch=0 epoch=1 num_updates=1 have_quorum=true ignore_dtd=false ccm_transition=1 num_peers=1 cib_feature_revision=2.0 cib-last-written=Fri Dec 15 00:15:48 2008 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-dc-version name=dc-version value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/ /attributes /cluster_property_set /crm_config nodes node id=d354acf4-568d-4583-9fe9-72eabb2835b1 uname=clusternode2 type=normal/ /nodes resources/ constraints/ /configuration /cib Then I added these lines with it as follows resources group id=apache_group primitive id=ip_resource_1 class=ocf type=IPaddr provider=heartbeat instance_attributes attributes nvpair name=ip value=10.10.4.xxx/ /attributes /instance_attributes /primitive primitive id=apache class=heartbeat type=apache/ /group So it became like cib generated=true admin_epoch=0 epoch=1 num_updates=1 have_quorum=true ignore_dtd=false ccm_transition=1 num_peers=1 cib_feature_revision=2.0 cib-last-written=Fri Dec 15 00:15:48 2008 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-dc-version name=dc-version value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/ /attributes /cluster_property_set /crm_config nodes node id=d354acf4-568d-4583-9fe9-72eabb2835b1 uname=clusternode2 type=normal/ /nodes resources group id=apache_group primitive id=ip_resource_1 class=ocf type=IPaddr provider=heartbeat instance_attributes attributes nvpair name=ip value=10.10.4.xxx/ /attributes /instance_attributes /primitive primitive id=apache class=heartbeat type=apache/ /group resources/ constraints/ /configuration /cib ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Need to detect ethN down (but NOT ping)
On Sun, Oct 26, 2008 at 19:37, tje [EMAIL PROTECTED] wrote: Through an interface that *must* be up (or we should fail over), there is no address that can be the subject of a ping. Everything is dynamically assigned via DHCP, all L2 switching on that side is incapable of having even a loopback address. ping the dhcp server? ping google.com? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [patch 0/6] Assorted build cleanups and fixes
Looks fine to me On Wed, Oct 15, 2008 at 09:02, Simon Horman [EMAIL PROTECTED] wrote: Hi, I noticed a couple of things while building heartbeat earlier today. Please let me know if there are any objections to any of these. -- Simon Horman VA Linux Systems Japan K.K., Sydney, Australia Satellite Office H: www.vergenet.net/~horms/ W: www.valinux.co.jp/en ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Re: Announcing: 2.99.1 (beta! release)
On Oct 7, 2008, at 1:28 PM, Dejan Muhamedagic wrote: Hi, On Tue, Oct 07, 2008 at 01:30:13PM +0800, Yan Gao wrote: Hi Lars, On Mon, 2008-10-06 at 12:30 +0200, Lars Marowsky-Bree wrote: On 2008-10-06T12:32:11, Yan Gao [EMAIL PROTECTED] wrote: Hi, On Sat, 2008-10-04 at 17:39 +0900, [EMAIL PROTECTED] wrote: Hi Xinwei, I understood it. Then I will use latest GUI next week. If there is a problem in GUI, I report it in a mailing list of Pacemaker. ???Any comments or suggestions are welcome;-) I think it definetely needs to be renamed to at least pacemaker- pygui, the package just being called pygui doesn't work. Yes, definetely. I've renamed it to pacemaker-pygui when I added 1.4 revision tag. Actually I don't think pacemaker-pygui is ideal either:-) Because it includes the snmp subagent and the backend of the pygui. Maybe pacemaker-mgmt or something is better? -mgmt sounds good to me. It would also be good if the gui is in a separate package, because of numerous dependencies which typically aren't needed on cluster nodes (e.g. gtk stuff). agreed on both counts ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] To avoid STONITH for a node which is doing kdump
On Tue, Sep 30, 2008 at 12:24, Satomi TANIGUCHI [EMAIL PROTECTED] wrote: Hi Dejan, Thank you for letting me know! I'll test it. Now, may I ask you a question? cluster-delay seems to still require the value which is longer than the maximum possible stonith timeout for tengine. Is this with pacemaker 0.7? If so, can you open a bug for this please? The CRM should be waiting forever. If cluster-delay is shorter than sum total of plugins' timeout values, then tengine detect a STONITH op timed out and new STONITH op is executed. Then two or more plugins are executed in parallel. Will it change in future? (We are just in a transition period?) Or did it lose possibility because of my insistence that the way to add fence(stonith)-timeuot is better? I'm afraid that because Andrew said that the cluster can just forever for stonithd to return if stonithd no longer needs a timeout value from crmd... Regards, Satomi TANIGUCHI Dejan Muhamedagic wrote: Hi, Just to let you know that I renamed fence-timeout to stonith-timeout, because there are already stonith-this and stonith-that in crm_cluster_properties. Still better to be consistently stonith: naming this fence-... would most probably confuse people. As if they weren't confused enough ;-) Thanks, Dejan On Fri, Sep 26, 2008 at 11:00:53AM +0200, Dejan Muhamedagic wrote: Hi Satomi-san, On Fri, Sep 26, 2008 at 05:39:53PM +0900, Satomi TANIGUCHI wrote: Hi Dejan, I found some bugs. 1) When fence-timeout is not set and priority is set, priority's value is used as both fence_timeout and priority. The patch for this bug is fence-timeout.patch Right. An oversight while pondering whether to have it in a loop or wait and see if there'll be more stonithd attributes coming :) 2) Stonithd can execute only 2 or less plugins. With 3 or more plugins, priority is ignored. The patch for this is stonith_rsc_priorities.patch Oh, that code was tricky. Strange that it fails on more than two plugins. I hope they are helpful to you. Of course. Thanks for the patches and testing! Cheers, Dejan Best Regards, Satomi TANIGUCHI ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Re: Announcing: 2.99.1 (beta! release)
On Oct 2, 2008, at 3:50 AM, HIDEO YAMAUCHI wrote: Hi, I tested GUI with this package. But, I was going to add a dummy resource newly, but was not able to do it. In addition, I took a feeling in the version that GUI was very old. It is because it seemed to be inferior to GUI of 2.1.4 versions in a function. Will GUI of this version be usable with this package? Only if you set validate-with=pacemaker-0.6 (which also prevents you from using some of the new pacemaker features) because the GUI doesn't understand the new syntax in 1.0 I believe the Novell China guys are rewriting the GUI and it will eventually be able to understand the new syntax. No idea what the ETA on that is though. Contents of the log when I failed in the addition of the dummy --- mgmtd[12142]: 2008/10/01_13:15:11 info: on_add_rsc:primitive id=resource_ class=ocf type=Dummy provider=heartbeatmeta_attributes id=resource__meta_attrs attributesnvpair id=resource__metaattr_target_role name=target_role value=stopped//attributes /meta_attributes/primitive cib[12137]: 2008/10/01_13:15:11 ERROR: Element meta_attributes has extra content: attributes cib[12137]: 2008/10/01_13:15:11 ERROR: Extra element meta_attributes in interleave cib[12137]: 2008/10/01_13:15:11 ERROR: Element primitive failed to validate content cib[12137]: 2008/10/01_13:15:11 ERROR: Element resources has extra content: primitive cib[12137]: 2008/10/01_13:15:11 ERROR: Invalid sequence in interleave cib[12137]: 2008/10/01_13:15:11 ERROR: Element cib failed to validate content cib[12137]: 2008/10/01_13:15:11 ERROR: cib_perform_op: Updated CIB does not validate against pacemaker-1.0 schema/dtd cib[12137]: 2008/10/01_13:15:11 WARN: cib_diff_notify: Update (client: mgmtd, call:3): 0.13.3 - 0.14.1 (Update does not conform to the configured schema/DTD) cib[12137]: 2008/10/01_13:15:11 ERROR: cib_process_request: Operation complete: op cib_create for section resources (origin=local/4a6ff9c3-26e0-4788-a081-fab048da199f/ 3): Update does not conform to the configured schema/DTD (rc=-47) mgmtd[12142]: 2008/10/01_13:15:11 WARN: cib_native_perform_op: Call failed: Update does not conform to the configured schema/DTD mgmtd[12142]: 2008/10/01_13:15:12 WARN: unpack_resources: No STONITH resources have been defined mgmtd[12142]: 2008/10/01_13:15:12 info: determine_online_status: Node rh52-1hb3 is online - Regards, Hideo Yamauchi. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Re: Bug#500219: heartbeat: please don't include hidden files
On Mon, Sep 29, 2008 at 13:29, Ferenc Wagner [EMAIL PROTECTED] wrote: Andrew Beekhof [EMAIL PROTECTED] writes: On Mon, Sep 29, 2008 at 01:05, Simon Horman [EMAIL PROTECTED] wrote: On Fri, Sep 26, 2008 at 12:57:44PM +0200, Ferenc Wagner wrote: Chkrootkit stumbles upon the hidden files under /usr/lib: /etc/cron.daily/chkrootkit: The following suspicious files and directories were found: /usr/lib/ocf/resource.d/heartbeat/.ocf-binaries /usr/lib/ocf/resource.d/heartbeat/.ocf-directories /usr/lib/ocf/resource.d/heartbeat/.ocf-returncodes /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs Please avoid using such names if possible. that sounds like a reasonable request to me. I am passing it on to the upstream development mailing list for comment there. I think chkrootkit is being a little over-protective here. Sure it is, by design. These files aren't meant to be included directly by the user and by naming them with a leading dot, we avoid the issue of them showing up as resources. I see. Isn't it possible to move them into a different directory then? Possible yes, but this isn't nearly a good enough reason to do so. Their current location is the most appropriate. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Re: Bug#500219: heartbeat: please don't include hidden files
On Mon, Sep 29, 2008 at 15:24, Dejan Muhamedagic [EMAIL PROTECTED] wrote: On Mon, Sep 29, 2008 at 01:10:28PM +0200, Andrew Beekhof wrote: On Mon, Sep 29, 2008 at 01:05, Simon Horman [EMAIL PROTECTED] wrote: On Fri, Sep 26, 2008 at 12:57:44PM +0200, Ferenc Wagner wrote: Package: heartbeat Version: 2.1.3-6 Severity: wishlist Hi, Chkrootkit stumbles upon the hidden files under /usr/lib: /etc/cron.daily/chkrootkit: The following suspicious files and directories were found: /usr/lib/ocf/resource.d/heartbeat/.ocf-binaries /usr/lib/ocf/resource.d/heartbeat/.ocf-directories /usr/lib/ocf/resource.d/heartbeat/.ocf-returncodes /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs Please avoid using such names if possible. Hi Ferenc, that sounds like a reasonable request to me. I am passing it on to the upstream development mailing list for comment there. I think chkrootkit is being a little over-protective here. These files aren't meant to be included directly by the user and by naming them with a leading dot, we avoid the issue of them showing up as resources. They won't show as resource agents if the scripts are not executable. Sourcing such files would work in that case too. Yes but its needlessly confusing (even if in a small way) for anyone looking in that directory. Audit tools such as the one mentioned have their place, but moving files around for no other reason than to keep the tool from complaining is a step too far. Particularly when there are sane enough reasons for the files to be located and named as they are. At any rate, even if they were relocated, we'd still have to provide links to the current pathname for compatibility. As an aside, are rootkit writers really that lame that they rely on a leading dot to hide the presence of a file? Even my little sister wouldn't be fooled by that. -EMOVEALONG ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/