from:"Andrew Beekhof"

Re: [Linux-ha-dev] [Linux-HA] Announcing the Heartbeat 3.0.6 Release

2015-02-19 Thread Andrew Beekhof


 On 11 Feb 2015, at 8:24 am, Lars Ellenberg lars.ellenb...@linbit.com wrote:
 
 
 TL;DR:
 
  If you intend to set up a new High Availability cluster
  using the Pacemaker cluster manager,
  you typically should not care for Heartbeat,
  but use recent releases (2.3.x) of Corosync.
 
  If you don't care for Heartbeat, don't read further.
 
 Unless you are beekhof... there's a question below ;-)
 
 
 
 After 3½ years since the last officially tagged release of Heartbeat,
 I have seen the need to do a new maintenance release.
 
  The Heartbeat 3.0.6 release tag: 3d59540cf28d
  and the change set it points to: cceeb47a7d8f
 
 The main reason for this was that pacemaker more recent than
 somewhere between 1.1.6 and 1.1.7 would no longer work properly
 on the Heartbeat cluster stack.
 
 Because some of the daemons have moved from glue to pacemaker proper,
 and changed their paths. This has been fixed in Heartbeat.
 
 And because during that time, stonith-ng was refactored, and would still
 reliably fence, but not understand its own confirmation message, so it
 was effectively broken. This I fixed in pacemaker.
 
 
 
 If you chose to run new Pacemaker with the Heartbeat communication stack,
 it should be at least 1.1.12 with a few patches,
 see my December 2014 commits at the top of
 https://github.com/lge/pacemaker/commits/linbit-cluster-stack-pcmk-1.1.12
 I'm not sure if they got into pacemaker upstream yet.
 
 beekhof?
 Do I need to rebase?
 Or did I miss you merging these?

Merged now :-)

We're about to start the 1.1.13 release cycle, so it wont be far away

 
 ---
 
 If you have those patches,
 consider setting this new ha.cf configuration parameter:
 
   # If pacemaker crmd spawns the pengine itself,
   # it sometimes forgets to kill the pengine on shutdown,
   # which later may confuse the system after cluster restart.
   # Tell the system that Heartbeat is supposed to
   # control the pengine directly.
   crmd_spawns_pengine off
 
 
 
 Here is the shortened Heartbeat changelog,
 the longer version is available in mercurial:
 http://hg.linux-ha.org/heartbeat-STABLE_3_0/shortlog
 
 - fix emergency shutdown due to broken update_ackseq
 - fix node dead detection problems
 - fix converging of membership (ccm)
 - fix init script startup glitch (caused by changes in glue/resource-agents)
 - heartbeat.service file for systemd platforms
 - new ucast6 UDP IPv6 communication plugin
 - package ha_api.py in standard package
 - update some man pages, specifically the example ha.cf
 - also report ccm membership status for cl_status hbstatus -v
 - updated some log messages, or their log levels
 - reduce max_delay in broadcast client_status query to one second
 - apply various (mostly cosmetic) patches from Debian
 - drop HBcompress compression plugins: they are part of cluster glue
 - drop openais HBcomm plugin
 - better support for current pacemaker versions
 - try to not miss a SIGTERM (fix problem with very fast respawn/stop cycle)
 - dopd: ignore dead ping nodes
 - cl_status improvements
 - api internals: reduce IPC round-trips to get at status information
 - uid=root is sufficient to use heartbeat api (gid=haclient remains 
 sufficient)
 - fix /dev/null as log- or debugfile setting
 - move daemon binaries into libexecdir
 - document movement of compression plugins into cluster-glue
 - fix usage of SO_REUSEPORT in ucast sockets
 - fix compile issues with recent gcc and -Werror
 
 Note that a number of the mentioned fixes have been created two years
 ago already, and may have been released in packages for a long time,
 where vendors have chosen to package them.
 
 
 
 As to future plans for Heartbeat:
 
 Heartbeat is still useful for non-pacemaker, haresources-mode clusters.
 
 We (Linbit) will maintain Heartbeat for the foreseeable future.
 That should not be too much of a burden, as it is stable,
 and due to long years of field exposure, all bugs are known ;-)
 
 The most notable shortcoming when using Heartbeat with Pacemaker
 clusters would be the limited message size.
 There are currently no plans to remove that limitation.
 
 With its wide choice of communications paths, even exotic
 communication plugins, and the ability to run arbitrarily many
 paths, some deployments may even favor it over Corosync still.
 
 But typically, for new deployments involving Pacemaker,
 in most cases you should chose Corosync 2.3.x
 as your membership and communication layer.
 
 For existing deployments using Heartbeat,
 upgrading to this Heartbeat version is strongly recommended.
 
 Thanks,
 
   Lars Ellenberg
 
 ___
 Linux-HA mailing list
 linux...@lists.linux-ha.org

Re: [Linux-ha-dev] quorum status

2015-01-29 Thread Andrew Beekhof

Check corosync.conf, I'm guessing pcs enabled two node mode.

 On 29 Jan 2015, at 5:21 pm, Yan, Xiaoping (NSN - CN/Hangzhou) 
 xiaoping@nsn.com wrote:
 
 Hi,
 
 I used this command: 
 pcs cluster stop rhel1
 
 I think both is shutdown.
 
 
 Br, Rip
 
 
 -Original Message-
 From: ext Andrew Beekhof [mailto:and...@beekhof.net] 
 Sent: Thursday, January 29, 2015 1:51 PM
 To: Yan, Xiaoping (NSN - CN/Hangzhou)
 Cc: Linux-HA-Dev@lists.linux-ha.org
 Subject: Re: quorum status
 
 Did you shut down pacemaker or corosync or both?
 
 On 29 Jan 2015, at 4:18 pm, Yan, Xiaoping (NSN - CN/Hangzhou) 
 xiaoping@nsn.com wrote:
 
 Hi,
 
 Any suggestion please?
 
 Br, Rip
 _
 From: Yan, Xiaoping (NSN - CN/Hangzhou) 
 Sent: Wednesday, January 28, 2015 4:04 PM
 To: Linux-HA-Dev@lists.linux-ha.org
 Subject: quorum status
 
 
 Hi experts:
 
 I’m making 2 node (rhel1 and rhel2) Linux cluster following 
 Pacemaker-1.1-Clusters_from_Scratch-en-US
 After I shutdown one of the node (pcs cluster stop rhel1) , the other 
 partition still have quorum.
 While according to document, chapter 5.3.1 ,it should be partition without 
 quorum. (A cluster is said to have quorum when total_nodes2*active_nodes)
 What could be the problem?
 Thank you.
 
 [root@rhel2 ~]# pcs status
 Cluster name: mycluster
 Last updated: Wed Jan 28 10:49:46 2015
 Last change: Wed Jan 28 10:07:10 2015 via cibadmin on rhel1
 Stack: corosync
 Current DC: rhel2 (2) - partition with quorum
 Version: 1.1.10-29.el7-368c726
 2 Nodes configured
 1 Resources configured
 
 
 Online: [ rhel2 ]
 OFFLINE: [ rhel1 ]
 
 Full list of resources:
 
 ClusterIP  (ocf::heartbeat:IPaddr2):   Started rhel2
 
 PCSD Status:
  rhel1: Unable to authenticate
  rhel2: Unable to authenticate
 
 Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
 [root@rhel2 ~]# pcs status corosync
 
 Membership information
 --
Nodeid  Votes Name
 2  1 rhel2 (local)
 
 
 Br, Rip
 

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] quorum status

2015-01-28 Thread Andrew Beekhof

Did you shut down pacemaker or corosync or both?

 On 29 Jan 2015, at 4:18 pm, Yan, Xiaoping (NSN - CN/Hangzhou) 
 xiaoping@nsn.com wrote:
 
 Hi,
  
 Any suggestion please?
  
 Br, Rip
 _
 From: Yan, Xiaoping (NSN - CN/Hangzhou) 
 Sent: Wednesday, January 28, 2015 4:04 PM
 To: Linux-HA-Dev@lists.linux-ha.org
 Subject: quorum status
  
  
 Hi experts:
  
 I’m making 2 node (rhel1 and rhel2) Linux cluster following 
 Pacemaker-1.1-Clusters_from_Scratch-en-US
 After I shutdown one of the node (pcs cluster stop rhel1) , the other 
 partition still have quorum.
 While according to document, chapter 5.3.1 ,it should be partition without 
 quorum. (A cluster is said to have quorum when total_nodes2*active_nodes)
 What could be the problem?
 Thank you.
  
 [root@rhel2 ~]# pcs status
 Cluster name: mycluster
 Last updated: Wed Jan 28 10:49:46 2015
 Last change: Wed Jan 28 10:07:10 2015 via cibadmin on rhel1
 Stack: corosync
 Current DC: rhel2 (2) - partition with quorum
 Version: 1.1.10-29.el7-368c726
 2 Nodes configured
 1 Resources configured
  
  
 Online: [ rhel2 ]
 OFFLINE: [ rhel1 ]
  
 Full list of resources:
  
 ClusterIP  (ocf::heartbeat:IPaddr2):   Started rhel2
  
 PCSD Status:
   rhel1: Unable to authenticate
   rhel2: Unable to authenticate
  
 Daemon Status:
   corosync: active/disabled
   pacemaker: active/disabled
   pcsd: active/enabled
 [root@rhel2 ~]# pcs status corosync
  
 Membership information
 --
 Nodeid  Votes Name
  2  1 rhel2 (local)
  
  
 Br, Rip

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] April 1st joke? (pengine: [20168]: ERROR: crm_abort: gregorian_to_ordinal: Triggered assert at iso8601.c:635 : a_date-days 0)

2013-04-09 Thread Andrew Beekhof

Its an underflow error that has since been fixed.
Sorry for the noise.

On 09/04/2013, at 11:06 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
wrote:

 Hi,
 
 I found these error messages in syslog on April 1st:
 Apr  1 00:04:30 h06 pengine: [20168]: ERROR: crm_abort: gregorian_to_ordinal: 
 Triggered assert at iso8601.c:635 : a_date-days  0
 [...]
 Apr  1 01:49:30 h06 pengine: [20168]: ERROR: crm_abort: 
 convert_from_gregorian: Triggered assert at iso8601.c:622 : 
 gregorian_to_ordinal(a_date)
 
 Before and after that times I could not see any of these.
 
 (pacemaker-1.1.7-0.13.9 of SLES11 SP2)
 
 Regards,
 Ulrich
 
 
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [Linux-HA] Mysql RA issue: Heartbeat/Pacemaker stops switching Master/Slave after killing mysql processes of Master many times (3 times)

2013-02-04 Thread Andrew Beekhof

On Fri, Jan 18, 2013 at 9:06 PM, Thai Nguyen nqt...@tma.com.vn wrote:
 Hello all,

 I am running Heartbeat/Pacemaker with MySql Master/Slave Replication on my
 servers.

 And i am facing an issue which involved to MySQL RA as follow:



 Steps to reproduce:

 Step 1: Kill mysql processes of Master.

 Step 2: Wait until Heartbeat/Pacemaker switched Master/Slave.

 Step 3: Repeat step 1 and step 2 two times.

 Step 4: Observe Master/Slave status.



 Expected result: Heartbeat/Pacemaker switches Master/Slave successfully.

 Actually result: Heartbeat/Pacemaker stops switching Master/Slave.



 After killed Master in 2nd time, I check the new Master 's log (ha-log) ,
 the message MySQL monitor succeeded (master) didn't show up in log. Then i
 kill mysql processes of new Master (3rd time), the result is
 heartbeat/pacemaker stops switching Master/Slave. To work around this issue,
 I need to restart Heartbeat.

You could have also just run crm resource cleanup ms_MySQL to clear
out the failures.
If that doesn't work, some logs would make it easier to comment.




 And this is my pacemaker config:



 node $id=fabe2f8e-9ba2-4f85-a644-fa16fe492830 ares \

 attributes apollo-log-file-p_mysql=mysql-bin.67
 apollo-log-pos-p_mysql=107

 node $id=fd5a954a-aadc-450e-9dda-ca2c18e980c2 apollo

 primitive MailTo ocf:heartbeat:MailTo \

 params email=nqt...@gmail.com

 primitive p_mysql ocf:heartbeat:mysql \

 params config=/etc/mysql/my.cnf pid=/var/run/mysqld/mysqld.pid
 socket=/var/run/mysqld/mysqld.sock binary=/usr/bin/mysqld_safe
 replication_user=root replication_passwd=nec test_user=root
 test_passwd=nec max_slave_lag=10 evict_outdated_slaves=false \

 op monitor interval=1s role=Master timeout=120s \

 op monitor interval=3s timeout=120s \

 op start interval=0 role=Stopped timeout=120s on-fail=restart \

 op stop interval=0 timeout=120s \

 meta is-managed=true

 primitive virtualIP ocf:heartbeat:IPaddr \

 params ip=192.168.103.223 cidr_netmask=255.255.255.0 \

 op monitor interval=1s \

 meta is-managed=true

 ms ms_MySQL p_mysql \

 meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
 notify=true globally-unique=false target-role=Master is-managed=true

 colocation mysql_co_ip inf: virtualIP ms_MySQL:Master

 order my_MySQL_promote_before_vip inf: ms_MySQL:promote virtualIP:start

 property $id=cib-bootstrap-options \

 dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \

 cluster-infrastructure=Heartbeat \

 stonith-enabled=false \

 default-action-timeout=30 \

 cluster-recheck-interval=30s \

 no-quorum-policy=ignore

 property $id=mysql_replication \

 p_mysql_REPL_INFO=ares|mysql-bin.34|107

 rsc_defaults $id=rsc-options \

 resource-stickiness=1 \

 migration-threshold=1 \

 failure-timeout=15s



 Best regards,

 Thai Nguyen



 ___
 Linux-HA mailing list
 linux...@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [resource-agents] Low: pgsql: check existence of instance number in replication mode (#159)

2012-10-29 Thread Andrew Beekhof

On Mon, Oct 29, 2012 at 9:51 PM, Dejan Muhamedagic de...@suse.de wrote:
 On Fri, Oct 26, 2012 at 11:36:53AM +1100, Andrew Beekhof wrote:
 On Fri, Oct 26, 2012 at 12:52 AM, Dejan Muhamedagic de...@suse.de wrote:
  On Thu, Oct 25, 2012 at 06:09:38AM -0700, Lars Ellenberg wrote:
  On Thu, Oct 25, 2012 at 03:38:47AM -0700, Takatoshi MATSUO wrote:
   Usually,  we use crm_master command instead of crm_attribute to 
   change master score in RA.
   But PostgreSQL's slave can't get own replication status, so Master 
   changes Slave's master-score
   using instance number on Pacemaker 1.0.x .
   This probably is not ordinary usage.
  
Would the existing resource agent work with globally-unique=true ?
  
   I don't know it works with true.
   I use it with false and it dosen't need true.
 
  I suggested that you actually should use globally-unique clones,
  as in that case you still get those instance numbers...
 
  Does using different clones make sense in pgsql? What is to be
  different between them? Or would it be just for the sake of
  getting instance numbers? If so, then it somehow looks wrong to
  me :)
 
  But thinking about it once more, I'm not so sure anymore.
 
  Correct me where I'm wrong.
 
  This is about the master score.
  In case the Master instance fails, we preferably want to promote the
  slave instance that is as close as possible to the Master.
  We only know which *node* was best at the last monitoring interval,
  which may be good enough.
 
  We need to then change the master score for *all possible instances*,
  for all nodes, accordingly.
 
  Which is what that loop did.
  (I think skipping the current instance is actually a bug;
   If pacemaker relabeles things in a bad way, you may hit it).
 
  Now, with pacemaker 1.1.8, all instances become equal
  (for anonymous clones, aka globally-unique=false),
  and we only need to set the score on the resource-id,
  not for all resource-id:instance combinations.
 
  OK.
 
  Which is great. After all, the master score in this case is attached to
  the node (or, the data set accessible from that node), and not to the
  (arbitrary, potentially relabeled anytime) instance number pacemaker
  assigned to the clone instance running on that node.
 
 
  And that is exactly what your patch does:
   * detect if a version of pacemaker is in use that attaches the instance
 number to the resource id
 * if so, do the loop on all possible instance numbers as before
 * if not, only set the master score on the resource-id
 
 
  Is my understanding correct?
  Then I think you patch is good.
 
  Yes, the patch seems good then. Though there is quite a bit of
  code repetition. The set attribute part should be moved to an
  extra function.
 
  Still, other resource agents that use master scores (or any other
  attributes that reference instance numbers of anonymous clones)
  need to be reviewed.
 
  Though this I'll set scores for other instances, not only myself
  logic is unique to pgsql, so most other resource agents should just
  work with whatever is present in the environment, they typically treat
  the $OCF_RESOURCE_INSTANCE as opaque.
 
  Seems like no other RA uses instance numbers. However, quite a
  few use OCF_RESOURCE_INSTANCE which, in case of clone/ms
  resources, may potentially lead to unpredictable results on
  upgrade to 1.1.8.

 No. Otherwise all the regression tests would fail.  The PE is smart
 enough to find promotion score and failcounts in either case.

 Cool.

 Also, OCF_RESOURCE_INSTANCE contains whatever the local lrmd knows the
 resource as, not what we call it internally to the PE.

 What I meant was that some RA use OCF_RESOURCE_INSTANCE to name
 local files which keep some kind of state. If
 OCF_RESOURCE_INSTANCE changes on upgrade... Well, I guess that
 the worst that can happen is for the probe to fail.

Right. But only for attach/reattach.
And people should have maintenance-mode enabled at the point the probe
is run, so there is time to fix things up before the cluster does
anything about it.

 But I didn't
 take a closer look.

 Thanks,

 Dejan

  Thanks,
Lars
 
  Cheers,
 
  Dejan
  ___
  Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
  Home Page: http://linux-ha.org/
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [resource-agents] Low: pgsql: check existence of instance number in replication mode (#159)

2012-10-25 Thread Andrew Beekhof

On Thu, Oct 25, 2012 at 10:01 PM, Takatoshi MATSUO matsuo@gmail.com wrote:
 Usually, we use crm_master command instead of crm_attribute to
 change own master score in RA.
 But PostgreSQL's Slave can't get own replication status, so Master
 changes Slave's master-score
 using instance number on Pacemaker 1.0.x .
 This probably is not ordinary usage.

Ouch!  No, not ordinary (or recommended) at all :-)
What does the crm_attribute command line look like?  Maybe the --node
option could help?


 So if pgsql thinks it needs these instance numbers,
 maybe it is not so anonymous a clone, after all?

 Would the existing resource agent work with globally-unique=true ?

 No, I use it with false and it dosen't need true.

 --
 Takatoshi MATSUO


 2012/10/25 Lars Ellenberg lars.ellenb...@linbit.com:
 On Thu, Oct 25, 2012 at 01:24:40AM -0700, Takatoshi MATSUO wrote:
 check existence of instance number in replication mode
 because Pacemaker 1.1.8 or higher do not append instance numbers.

 I think this is wrong.

 It seems this became necessary because of

  commit 427c7fe6ea94a566aaa714daf8d214290632f837
  Author: Andrew Beekhof and...@beekhof.net
  Date:   Fri Jul 13 13:37:42 2012 +1000

 High: PE: Do not append instance numbers to anonymous clones

 Benefits:
 - they shouldnt have been exposed in the first place, but I didnt know 
 how not to back then
 - if admins don't know what they are, they can't be misunderstood or 
 misused
 - more reliable failcount and promotion scores (since you dont have to 
 check for all possible permutations)
 - smaller status section since there cant be entries for each possible 
 :N suffix
 - the name in the config corresponds to the resource in the logs


 So if pgsql thinks it needs these instance numbers,
 maybe it is not so anonymous a clone, after all?

 Would the existing resource agent work with globally-unique=true ?

 Lars


 You can merge this Pull Request by running:

   git pull https://github.com/t-matsuo/resource-agents check-instance-number

 Or you can view, comment on it, or merge it online at:

   https://github.com/ClusterLabs/resource-agents/pull/159

 -- Commit Summary --

   * Low: pgsql: check existence of instance number in replication mode

 -- File Changes --

 M heartbeat/pgsql (44)

 -- Patch Links --

 https://github.com/ClusterLabs/resource-agents/pull/159.patch
 https://github.com/ClusterLabs/resource-agents/pull/159.diff


 ---
 Reply to this email directly or view it on GitHub:
 https://github.com/ClusterLabs/resource-agents/pull/159

 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [resource-agents] Low: pgsql: check existence of instance number in replication mode (#159)

2012-10-25 Thread Andrew Beekhof

On Fri, Oct 26, 2012 at 12:52 AM, Dejan Muhamedagic de...@suse.de wrote:
 On Thu, Oct 25, 2012 at 06:09:38AM -0700, Lars Ellenberg wrote:
 On Thu, Oct 25, 2012 at 03:38:47AM -0700, Takatoshi MATSUO wrote:
  Usually,  we use crm_master command instead of crm_attribute to change 
  master score in RA.
  But PostgreSQL's slave can't get own replication status, so Master changes 
  Slave's master-score
  using instance number on Pacemaker 1.0.x .
  This probably is not ordinary usage.
 
   Would the existing resource agent work with globally-unique=true ?
 
  I don't know it works with true.
  I use it with false and it dosen't need true.

 I suggested that you actually should use globally-unique clones,
 as in that case you still get those instance numbers...

 Does using different clones make sense in pgsql? What is to be
 different between them? Or would it be just for the sake of
 getting instance numbers? If so, then it somehow looks wrong to
 me :)

 But thinking about it once more, I'm not so sure anymore.

 Correct me where I'm wrong.

 This is about the master score.
 In case the Master instance fails, we preferably want to promote the
 slave instance that is as close as possible to the Master.
 We only know which *node* was best at the last monitoring interval,
 which may be good enough.

 We need to then change the master score for *all possible instances*,
 for all nodes, accordingly.

 Which is what that loop did.
 (I think skipping the current instance is actually a bug;
  If pacemaker relabeles things in a bad way, you may hit it).

 Now, with pacemaker 1.1.8, all instances become equal
 (for anonymous clones, aka globally-unique=false),
 and we only need to set the score on the resource-id,
 not for all resource-id:instance combinations.

 OK.

 Which is great. After all, the master score in this case is attached to
 the node (or, the data set accessible from that node), and not to the
 (arbitrary, potentially relabeled anytime) instance number pacemaker
 assigned to the clone instance running on that node.


 And that is exactly what your patch does:
  * detect if a version of pacemaker is in use that attaches the instance
number to the resource id
* if so, do the loop on all possible instance numbers as before
* if not, only set the master score on the resource-id


 Is my understanding correct?
 Then I think you patch is good.

 Yes, the patch seems good then. Though there is quite a bit of
 code repetition. The set attribute part should be moved to an
 extra function.

 Still, other resource agents that use master scores (or any other
 attributes that reference instance numbers of anonymous clones)
 need to be reviewed.

 Though this I'll set scores for other instances, not only myself
 logic is unique to pgsql, so most other resource agents should just
 work with whatever is present in the environment, they typically treat
 the $OCF_RESOURCE_INSTANCE as opaque.

 Seems like no other RA uses instance numbers. However, quite a
 few use OCF_RESOURCE_INSTANCE which, in case of clone/ms
 resources, may potentially lead to unpredictable results on
 upgrade to 1.1.8.

No. Otherwise all the regression tests would fail.  The PE is smart
enough to find promotion score and failcounts in either case.
Also, OCF_RESOURCE_INSTANCE contains whatever the local lrmd knows the
resource as, not what we call it internally to the PE.


 Thanks,
   Lars

 Cheers,

 Dejan
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [resource-agents] Low: pgsql: check existence of instance number in replication mode (#159)

2012-10-25 Thread Andrew Beekhof

On Fri, Oct 26, 2012 at 12:49 PM, Takatoshi MATSUO matsuo@gmail.com wrote:
 2012/10/26 Andrew Beekhof and...@beekhof.net:
 On Thu, Oct 25, 2012 at 10:01 PM, Takatoshi MATSUO matsuo@gmail.com 
 wrote:
 Usually, we use crm_master command instead of crm_attribute to
 change own master score in RA.
 But PostgreSQL's Slave can't get own replication status, so Master
 changes Slave's master-score
 using instance number on Pacemaker 1.0.x .
 This probably is not ordinary usage.

 Ouch!  No, not ordinary (or recommended) at all :-)
 What does the crm_attribute command line look like?  Maybe the --node
 option could help?

 # crm_attribute -l reboot  -N pm02 -n master-pgsql:1 -v 1000

That looks fine, just drop the :1 (or use whatever is in OCF_RESOURCE_INSTANCE)


 This line uses crm_master as a reference.
  I would like crm_master to have a parameter which can set hostname.

Probably not going to happen.  crm_master is a convenience function
for the common use case.
Its fine to switch to crm_attribute for advanced usage.



 But crm_master gets hostname using crm_node -n command in these days,
  so I think that I should fix method to get hostname for next version.
  It also needs compatible code for Pacemaker 1.0.x :(


 So if pgsql thinks it needs these instance numbers,
 maybe it is not so anonymous a clone, after all?

 Would the existing resource agent work with globally-unique=true ?

 No, I use it with false and it dosen't need true.

 --
 Takatoshi MATSUO


 2012/10/25 Lars Ellenberg lars.ellenb...@linbit.com:
 On Thu, Oct 25, 2012 at 01:24:40AM -0700, Takatoshi MATSUO wrote:
 check existence of instance number in replication mode
 because Pacemaker 1.1.8 or higher do not append instance numbers.

 I think this is wrong.

 It seems this became necessary because of

  commit 427c7fe6ea94a566aaa714daf8d214290632f837
  Author: Andrew Beekhof and...@beekhof.net
  Date:   Fri Jul 13 13:37:42 2012 +1000

 High: PE: Do not append instance numbers to anonymous clones

 Benefits:
 - they shouldnt have been exposed in the first place, but I didnt know 
 how not to back then
 - if admins don't know what they are, they can't be misunderstood or 
 misused
 - more reliable failcount and promotion scores (since you dont have to 
 check for all possible permutations)
 - smaller status section since there cant be entries for each possible 
 :N suffix
 - the name in the config corresponds to the resource in the logs


 So if pgsql thinks it needs these instance numbers,
 maybe it is not so anonymous a clone, after all?

 Would the existing resource agent work with globally-unique=true ?

 Lars


 You can merge this Pull Request by running:

   git pull https://github.com/t-matsuo/resource-agents 
 check-instance-number

 Or you can view, comment on it, or merge it online at:

   https://github.com/ClusterLabs/resource-agents/pull/159

 -- Commit Summary --

   * Low: pgsql: check existence of instance number in replication mode

 -- File Changes --

 M heartbeat/pgsql (44)

 -- Patch Links --

 https://github.com/ClusterLabs/resource-agents/pull/159.patch
 https://github.com/ClusterLabs/resource-agents/pull/159.diff


 ---
 Reply to this email directly or view it on GitHub:
 https://github.com/ClusterLabs/resource-agents/pull/159

 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

 --
 Thanks,
 Takatoshi MATSUO
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [Patch] The problem that the cord of the digest cord of crmd becomes mismatched for.

2012-10-10 Thread Andrew Beekhof

On Wed, Oct 10, 2012 at 11:21 PM, Dejan Muhamedagic de...@suse.de wrote:
 Hi Hideo-san,

 On Wed, Oct 10, 2012 at 03:22:08PM +0900, renayama19661...@ybb.ne.jp wrote:
 Hi All,

 We found pacemaker that we could not judge a result of the operation of lrmd 
 well.

 When we carry out following crm, a parameter of the operation of start is 
 given back to crmd as a result of operation of monitor.

 (snip)
 primitive prmDiskd ocf:pacemaker:Dummy \
 params name=diskcheck_status_internal device=/dev/vda 
 interval=30 \
 op start interval=0 timeout=60s on-fail=restart 
 prereq=fencing \
 op monitor interval=30s timeout=60s on-fail=restart \
 op stop interval=0s timeout=60s on-fail=block
 (snip)

 This is because lrmd gives back prereq parameter of start as a result of 
 monitor operation.
 As a result, crmd judge mismatched with a parameter of the monitor operation 
 that crmd asked lrmd for for the parameter that Irmd carried out of the 
 monitor operation.

 We can confirm this problem by the next command in Pacemaker1.0.12.

 Command 1) crm_verify command outputs the difference in digest cord.

 [root@rh63-heartbeat1 ~]# crm_verify -L
 crm_verify[19988]: 2012/10/10_20:29:58 CRIT: check_action_definition: 
 Parameters to prmDiskd:0_monitor_3 on rh63-heartbeat1 changed: recorded 
 7d7c9f601095389fc7cc0c6b29c61a7a vs. d38c85388dea5e8e2568c3d699eb9cce 
 (reload:3.0.1) 0:0;6:1:0:ca6a5ad2-0340-4769-bab7-289a00862ba6


 Command 2) The ptest command outputs the difference in digest cord, too.

 [root@rh63-heartbeat1 ~]# ptest -L -VV
 ptest[19992]: 2012/10/10_20:30:19 WARN: unpack_nodes: Blind faith: not 
 fencing unseen nodes
 ptest[19992]: 2012/10/10_20:30:19 CRIT: check_action_definition: Parameters 
 to prmDiskd:0_monitor_3 on rh63-heartbeat1 changed: recorded 
 7d7c9f601095389fc7cc0c6b29c61a7a vs. d38c85388dea5e8e2568c3d699eb9cce 
 (reload:3.0.1) 0:0;6:1:0:ca6a5ad2-0340-4769-bab7-289a00862ba6
 [root@rh63-heartbeat1 ~]#

 Command 3) By cibadmin -B command, pengine restart monitor of an unnecessary 
 resource.

 Oct 10 20:31:00 rh63-heartbeat1 pengine: [19842]: CRIT: 
 check_action_definition: Parameters to prmDiskd:0_monitor_3 on 
 rh63-heartbeat1 changed: recorded 7d7c9f601095389fc7cc0c6b29c61a7a vs. 
 d38c85388dea5e8e2568c3d699eb9cce (reload:3.0.1) 
 0:0;6:1:0:ca6a5ad2-0340-4769-bab7-289a00862ba6
 Oct 10 20:31:00 rh63-heartbeat1 pengine: [19842]: notice: RecurringOp:  
 Start recurring monitor (30s) for prmDiskd:0 on rh63-heartbeat1
 Oct 10 20:31:00 rh63-heartbeat1 pengine: [19842]: notice: LogActions: Leave  
  resource prmDiskd:0#011(Started rh63-heartbeat1)
 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: do_state_transition: 
 State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
 cause=C_IPC_MESSAGE origin=handle_response ]
 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: unpack_graph: Unpacked 
 transition 2: 1 actions in 1 synapses
 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: do_te_invoke: 
 Processing graph 2 (ref=pe_calc-dc-1349868660-20) derived from 
 /var/lib/pengine/pe-input-2.bz2
 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: te_rsc_command: 
 Initiating action 1: monitor prmDiskd:0_monitor_3 on rh63-heartbeat1 
 (local)
 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: do_lrm_rsc_op: 
 Performing key=1:2:0:ca6a5ad2-0340-4769-bab7-289a00862ba6 
 op=prmDiskd:0_monitor_3 )
 Oct 10 20:31:00 rh63-heartbeat1 lrmd: [19836]: info: cancel_op: operation 
 monitor[4] on prmDiskd:0 for client 19839, its parameters: 
 CRM_meta_clone=[0] CRM_meta_prereq=[fencing] device=[/dev/vda] 
 name=[diskcheck_status_internal] CRM_meta_clone_node_max=[1] 
 CRM_meta_clone_max=[1] CRM_meta_notify=[false] 
 CRM_meta_globally_unique=[false] crm_feature_set=[3.0.1] interval=[30] 
 prereq=[fencing] CRM_meta_on_fail=[restart] CRM_meta_name=[monitor] 
 CRM_meta_interval=[3] CRM_meta_timeout=[6]  cancelled
 Oct 10 20:31:00 rh63-heartbeat1 lrmd: [19836]: info: rsc:prmDiskd:0 
 monitor[5] (pid 20009)
 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: process_lrm_event: LRM 
 operation prmDiskd:0_monitor_3 (call=4, status=1, cib-update=0, 
 confirmed=true) Cancelled
 Oct 10 20:31:00 rh63-heartbeat1 lrmd: [19836]: info: operation monitor[5] on 
 prmDiskd:0 for client 19839: pid 20009 exited with return code 0
 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: append_digest:  
 yamauchi Calculated digest 7d7c9f601095389fc7cc0c6b29c61a7a for 
 prmDiskd:0_monitor_3 (0:0;1:2:0:ca6a5ad2-0340-4769-bab7-289a00862ba6). 
 Source: parameters device=/dev/vda name=diskcheck_status_internal 
 interval=30 prereq=fencing CRM_meta_timeout=6/
 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: process_lrm_event: LRM 
 operation prmDiskd:0_monitor_3 (call=5, rc=0, cib-update=53, 
 confirmed=false) ok
 Oct 10 20:31:00 rh63-heartbeat1 crmd: [19839]: info: match_graph_event: 
 Action prmDiskd:0_monitor_3 (1)

Re: [Linux-ha-dev] Slight bending of OCF specs: Re: Issues found in Apache resource agent

2012-09-05 Thread Andrew Beekhof


On 06/09/2012, at 12:30 AM, Lars Marowsky-Bree l...@suse.com wrote:

 On 2012-09-05T15:25:44, Dejan Muhamedagic de...@suse.de wrote:
 
 How about a new element. Something like
 
 primitive vm1 ocf:heartbeat:VirtualDomain
 require vm1 web-test dns-test
 
 How we map this into Pacemaker's dependency scheme is obviously open to
 discussion.
 
 The require would imply that the resource vm1 requires
 monitors of web-test and dns-test to succeed, in addition to its
 monitor (if defined).
 
 Perhaps. But an as-a-whole attribute for groups to restart handling
 might already be enough, since we would want the system to eventually
 stabilize at the same state it runs to today (that is, with the group
 brought up to the last non-failing resource; otherwise, admins
 couldn't login to the VM to fix the problem).

Those two requirements seem at odds with each other.  I doubt it would end well.
I suspect you really want the restart everything trigger to be attached to 
the monitor only resource (at the end).

 
 Monitor ops of web-test and dns-test will run only on the node where
 vm1 is started. They could in also get the environment (parameters) of
 vm1.
 
 That's implicit in the group.
 
 Internally, this could indeed map to a symmetric or whatever aspect of
 the order dependency, yes, that could be set for the whole group.
 
 monocf may be just like ocf, sans start and stop operations.
 That would make all ocf RA elligible for this use.
 
 None of the current resource agents would be able to cope with the use
 case I suggested, because they expect to run in the OS image where the
 service is provided - the idea of using the icinga/nagios plugins is
 exactly that they don't have this requirement, and thus can monitor the
 VM externally.
 
 For OCF agents, this sort-of already exists: meta is-managed=false.
 
 
 Regards,
Lars
 
 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde
 
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] fence_legacy patch

2012-09-03 Thread Andrew Beekhof

Thanks Piotr, I've applied your patch and it will be in 1.1.8
Sorry for the delay.

On Thu, Aug 23, 2012 at 12:04 AM, Chmylkowski, Piotr
piotr.chmylkow...@atos.net wrote:
 Dear HA-dev

 I have implemented vcenter fencing, but the following patch was required
 to get it working.
 The problem was: variable HOSTLIST was not passed from crm configuration
 correctly to vcenter stonith script.
 In crm configuration HOSTLIST was something like
 HOSTLIST=hostname1=vm1;hostname2=vm2
 But to vcenter script HOSTLIST was passed as
 HOSTLIST=hostname1

 --- 8 ---
 diff -uN /usr/sbin/fence_legacy /usr/sbin/fence_legacy-patched
 --- /usr/sbin/fence_legacy  2011-08-25 18:09:42.0 +0200
 +++ /usr/sbin/fence_legacy-patched  2012-08-22 11:55:59.0
 +0200
 @@ -83,7 +83,7 @@
  $opt=$_;
  next unless $opt;

 -($name,$val)=split /\s*=\s*/, $opt;
 +($name,$val)=split /\s*=\s*/, $opt, 2;

  if ( $name eq  )
  {
 --- 8 ---

 Regards,
 Piotr

 --

 This e-mail and the documents attached are confidential and intended solely 
 for the addressee; it may also be privileged. If you receive this e-mail in 
 error, please notify the sender immediately and destroy it. As its integrity 
 cannot be secured on the Internet, the Atos group liability cannot be 
 triggered for the message content. Although the sender endeavours to maintain 
 a computer virus-free network, the sender does not warrant that this 
 transmission is virus-free and will not be liable for any damages resulting 
 from any virus transmitted.

 Niniejsza wiadomosc oraz zalaczone dokumenty sa przeznaczone wylacznie dla 
 adresatow i moga zawierac informacje poufne nalezace do grupy Atos. Jesli nie 
 jestescie Panstwo adresatem tej wiadomosci badz otrzymaliscie ja przez 
 pomylke, prosimy o powiadomienie o tym fakcie nadawcy oraz trwale usuniecie 
 niniejszej wiadomosci wraz z zalacznikami.

 Atos IT Services Sp. z o.o., ul. Woloska 5, 02-675 Warszawa, KRS 128068, 
 Sad Rejonowy dla m. st. Warszawy w Warszawie, XIII Wydzial Gospodarczy . 
 Krajowego Rejestru Sadowego, NIP 521-32-07-288.

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] apply_xml_diff: Digest mis-match

2012-08-16 Thread Andrew Beekhof

On Mon, Aug 13, 2012 at 11:39 PM, Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de wrote:
 Hi!

 In pacemaker-1.1.6-1.29.1 (SLES11 SP2 x86_64) I see this for an idle cluster 
 with just one stonith resource being running when doing some unrelated change:

 Aug 13 15:33:19 h3 cib: [31938]: info: apply_xml_diff: Digest mis-match: 
 expected 466ee3cf78eec0772f78b6fc965e9601, calculated 
 e8d85a1134f84f8b6eb8dff8ff598f71
 Aug 13 15:33:19 h3 cib: [31938]: notice: cib_process_diff: Diff 0.14.77 - 
 0.15.1 not applied to 0.14.77: Failed application of an update diff
 Aug 13 15:33:19 h3 cib: [31938]: info: cib_server_process_diff: Requesting 
 re-sync from peer

 I have the feeling there's something wrong in either generating the diff, or 
 applying the diff.

I would only expect this if either:
- you're changing the order of resources in a group (which you aren't), or
- when a node first comes online and sees the other node.

In the latter case, the diff applies cleanly, but the digest alerts us
to the fact that we are missing other (unrelated) changes.


 Regards,
 Ulrich


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] STONITH agent for SoftLayer API

2012-06-11 Thread Andrew Beekhof

On Fri, Jun 8, 2012 at 1:19 PM, Alan Robertson al...@unix.sh wrote:
 Red Hat invented their own API then disabled the working API in their
 version of the code.  Of course, they don't have as many agents, and
 they're not as well tested

Red Hat has had their own API for a very long time.
Certainly long before Pacemaker was added to RHEL (the LHA agents
never appeared there, so your timeline is way off).
By my count there are at least 45 agents (more than double the number
of non-external agents shipping in cluster-glue) and since RH doesn't
ship things they don't test not as well tested is doubtful at best.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Pacemaker and conntrackd RA not obeying colocation constraint

2012-06-11 Thread Andrew Beekhof

On Thu, Jun 7, 2012 at 5:37 PM, aldo sarmiento sarmi...@gmail.com wrote:
 Hello,

 I'm having a problem getting conntrackd ms to work with a colocation
 constraint. I want to have conntrackd Master only on the node that has an
 IPaddr2 primitive running on it.

 Here are my specs:

 Ubuntu: 12.04
 corosync: 1.4.2
 crm: 1.1.6
 conntrackd: 1.0.0

 Here is my configuration (notice I weighted the fw02 higher than fw01 to
 test my failover):

 node fw01 \
 attributes firewall=100
 node fw02 \
 attributes firewall=150
 primitive lip ocf:heartbeat:IPaddr2 \
 meta target-role=Started \
 params ip=192.168.100.2 cidr_netmask=24 nic=eth0
 primitive updated-conntrackd ocf:updates:conntrackd
 ms conntrackd updated-conntrackd \
 meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
 notify=true target-role=Master
 location conntrackd-run conntrackd \
 rule $id=conntrackd-run-rule-0 -inf: not_defined firewall or firewall
 number:lte 0 \
 rule $id=conntrackd-run-rule-1 firewall: defined firewall
 location lip-loc lip \
 rule $id=lip-loc-rule-0 -inf: not_defined firewall or firewall
 number:lte 0 \
 rule $id=lip-loc-rule-1 firewall: defined firewall
 colocation conntrackd-loc inf: conntrackd:Master lip:Started
 property $id=cib-bootstrap-options \
 dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 symmetric-cluster=false \
 no-quorum-policy=ignore \
 stonith-enabled=false \
 last-lrm-refresh=1339040513

 So based on my configuration above, I would expect conntrackd to be Master
 on fw02, but it's not:

 root@fw01:~# crm status
 
 Last updated: Wed Jun  6 20:46:55 2012
 Last change: Wed Jun  6 20:41:53 2012 via crmd on fw01
 Stack: openais
 Current DC: fw01 - partition with quorum
 Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
 2 Nodes configured, 2 expected votes
 3 Resources configured.
 

 Online: [ fw01 fw02 ]

  Master/Slave Set: conntrackd [updated-conntrackd]
  Masters: [ fw01 ]
  Slaves: [ fw02 ]
  lip(ocf::heartbeat:IPaddr2):   Started fw02

Your config looks right to me. Can you attach the output of cibadmin
-Ql when the cluster is in this state?


 Interesting thing is... if I add lip to a group with another primitive and
 run the same logic, failover works just fine:

 root@fw01:~# crm configure show
 node fw01 \
 attributes firewall=100
 node fw02 \
 attributes firewall=150
 primitive lip ocf:heartbeat:IPaddr2 \
 params ip=192.168.100.2 cidr_netmask=24 nic=eth0
 primitive lip2 ocf:heartbeat:IPaddr2 \
 params ip=192.168.100.101 cidr_netmask=24 nic=eth0
 primitive updated-conntrackd ocf:updates:conntrackd
 group gateway lip lip2 \
 meta target-role=Started
 ms conntrackd updated-conntrackd \
 meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
 notify=true target-role=Master
 location conntrackd-run conntrackd \
 rule $id=conntrackd-run-rule-0 -inf: not_defined firewall or firewall
 number:lte 0 \
 rule $id=conntrackd-run-rule-1 firewall: defined firewall
 location gateway-loc gateway \
 rule $id=lip-loc-rule-0 -inf: not_defined firewall or firewall
 number:lte 0 \
 rule $id=lip-loc-rule-1 firewall: defined firewall
 colocation conntrackd-loc inf: conntrackd:Master gateway:Started
 property $id=cib-bootstrap-options \
 dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 symmetric-cluster=false \
 no-quorum-policy=ignore \
 stonith-enabled=false \
 last-lrm-refresh=1339041080
 root@fw01:~# crm status
 
 Last updated: Wed Jun  6 20:52:17 2012
 Last change: Wed Jun  6 20:52:03 2012 via cibadmin on fw01
 Stack: openais
 Current DC: fw01 - partition with quorum
 Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
 2 Nodes configured, 2 expected votes
 4 Resources configured.
 

 Online: [ fw01 fw02 ]

  Master/Slave Set: conntrackd [updated-conntrackd]
  Masters: [ fw02 ]
  Slaves: [ fw01 ]
  Resource Group: gateway
  lip(ocf::heartbeat:IPaddr2):   Started fw02
  lip2   (ocf::heartbeat:IPaddr2):   Started fw02

 Thanks,

 Aldo

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [rfc] SBD with Pacemaker/Quorum integration

2012-05-27 Thread Andrew Beekhof

On Sat, May 26, 2012 at 5:56 AM, Lars Marowsky-Bree l...@suse.com wrote:
 On 2012-05-25T21:44:25, Florian Haas flor...@hastexo.com wrote:

  If so, the master thread will not self-fence even if the majority of
  devices is currently unavailable.
 
  That's it, nothing more. Does that help?

 It does. One naive question: what's the rationale of tying in with
 Pacemaker's view of things? Couldn't you just consume the quorum and
 membership information from Corosync alone?

 Yes and no.

 On SLE HA 11 (which, alas, is still the prime motivator for this),
 corosync actually gets that state from Pacemaker. And, ultimately, it is
 Pacemaker's belief (from the CIB) that pengine bases its fencing
 decisions on, so that's where we need to look.

 Further, quorum isn't enough. If we have quorum, the local node could
 still be dirty (as in: stop failures, unclean, ...) that imply that it
 should self-fence, pronto.

 Since this overrides the decision to self-fence if the devices are gone,
 and thus a real poison pill may no longer be delivered, we must take
 steps to minimize that risk.

 But yes, what it does now is to sign in both with corosync/ais and
 the CIB, querying quorum state from both.

 Fun anecdote, I originally thought being notification-driven might be
 good enough - until the testers started SIGSTOPping corosync/cib and
 complaining that the pacemaker watcher didn't pick up on that ;-)

 I know this is bound to have some holes. It can't perform a
 comprehensive health check of pacemaker's stack; yet, this only matters
 for as long as the loss of devices persists. During that degraded phase,
 the system is a bit more fragile. I'm a bit weary of this, because I'm
 *sure* these will all get reported one after another and further
 contribute to the code obfuscation, but such is reality ...

  (I have opinions on particularly the last failure mode. This seems to
  arise specifically when customers have build setups with two HBAs, two
  SANs, two storages, but then cross-linked the SANs, connected the HBAs
  to each, and the storages too. That seems to frequently lead to
  hiccups where the *entire* fabric is affected. I'm thinking this
  cross-linking is a case of sham redundancy; it *looks* as if makes
  things more redundant, but in reality reduces it since faults are no
  longer independent. Alas, they've not wanted to change that.)

 Henceforth, I'm going to dangle this thread in front of everyone who
 believes their SAN can never fail. Thanks. :)

 Heh. Please dangle it in front of them and explain the benefits of
 separation/isolation to them. ;-)

 If they followed our recommendation - 2 independent SANs, and a third
 iSCSI device over the network (okok, effectively that makes 3 SANs) -
 they'd never experience this.

 (Since that's how my lab is actually set up, I had some troubles
 following the problems they reported initially. Oh, and *don't* get me
 started on async IO handling in Linux.)

 Are there any SUSEisms in SBD or would you expect it to be packageable
 on any platform?

 Should be packageable on every platform, though I admit that I've not
 tried building the pacemaker module against anything but the
 corosync+pacemaker+openais stuff we ship on SLE HA 11 so far.

 I assume that this may need further work; at least the places I stole
 code from had special treatment. And the source code to crm_node
 (ccm_epoche.c) ... I *think* this may indicate opportunities for
 improving the client libraries in pacemaker to hide all that stuff
 better.

Yep, suggestions are welcome.
In theory it shouldn't be required, but in practice there are so many
membership/quorum combinations that sadly the compatibility code has
become worthy of a real API.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] heartbeat gmain source priority inversion with rexmit and dead node detection

2012-04-29 Thread Andrew Beekhof

On Sat, Apr 28, 2012 at 12:11 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp wrote:
 Hi All,

 We gave test that assumed remote cluster environment.
 And we tested packet lost.

 You may be interested in this patch I have lying around for ages.

 It may be incomplete for one corner case:
 On a seriously misconfigured and overloaded system,
 I have seen reports for a single send_local_status()
 (that is basically one single send_cluster_msg())
 which took longer to execute than deadtime
 (without even returning to the mainloop!).

 This cornercase should be handled with a watchdog.
 But without a watchdog, and without stonith,
 the CCM was confused, because one node saw a
 leave then re-join after partition event, while the other node did not
 even notice it had left and rejoined the membership...
 and pacemaker ended up being DC on both :-/

A side effect of the ccm being really confused I assume?


 So I guess send_local_status() could do with an explicit call to
 check_for_timeouts(), but that may need recursion protection.


 I should really polish and push my queue some day soon...

 Cheers,


 diff --git a/heartbeat/hb_rexmit.c b/heartbeat/hb_rexmit.c
 --- a/heartbeat/hb_rexmit.c
 +++ b/heartbeat/hb_rexmit.c
 @@ -168,6 +168,7 @@ send_rexmit_request( gpointer data)
        if (STRNCMP_CONST(node-status, UPSTATUS) != 0 
            STRNCMP_CONST(node-status, ACTIVESTATUS) !=0) {
                /* no point requesting rexmit from a dead node. */
 +               g_hash_table_remove(rexmit_hash_table, ri);
                return FALSE;
        }

 @@ -243,7 +244,7 @@ schedule_rexmit_request(struct node_info
        ri-seq = seq;
        ri-node = node;

 -       sourceid = Gmain_timeout_add_full(G_PRIORITY_HIGH - 1, delay,
 +       sourceid = Gmain_timeout_add_full(PRI_REXMIT, delay,
                                          send_rexmit_request, ri, NULL);
        G_main_setall_id(sourceid, retransmit request, 
 config-heartbeat_ms/2, 10);

 diff --git a/heartbeat/heartbeat.c b/heartbeat/heartbeat.c
 --- a/heartbeat/heartbeat.c
 +++ b/heartbeat/heartbeat.c
 @@ -1585,7 +1585,7 @@ master_control_process(void)

        send_local_status();

 -       if (G_main_add_input(G_PRIORITY_HIGH, FALSE,
 +       if (G_main_add_input(PRI_POLL, FALSE,
                             polled_input_SourceFuncs) ==NULL){
                cl_log(LOG_ERR, master_control_process: G_main_add_input 
 failed);
        }
 diff --git a/include/hb_api_core.h b/include/hb_api_core.h
 --- a/include/hb_api_core.h
 +++ b/include/hb_api_core.h
 @@ -40,6 +40,12 @@
  #define        PRI_READPKT             (PRI_SENDPKT+1)
  #define        PRI_FIFOMSG             (PRI_READPKT+1)

 +/* PRI_POLL is where the timeout checks on deadtime happen.
 + * Better be sure rexmit requests for lost packets
 + * from a now dead node do not preempt detecting it as being dead. */
 +#define PRI_POLL               (G_PRIORITY_HIGH)
 +#define PRI_REXMIT             PRI_POLL
 +
  #define PRI_CHECKSIGS          (G_PRIORITY_DEFAULT)
  #define PRI_FREEMSG            (PRI_CHECKSIGS+1)
  #define        PRI_CLIENTMSG           (PRI_FREEMSG+1)
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH] Medium: Use the resource timeout as an override to the default dbus timeout for upstart RA

2012-02-20 Thread Andrew Beekhof

On Sat, Feb 18, 2012 at 12:00 AM, Ante Karamatic iv...@ubuntu.com wrote:
 On 17.02.2012 11:20, Andrew Beekhof wrote:

 Tangential question... but does upstart also implement the service binary?
 As in service pacemaker start ?

 It does, but the exit status is always '0', which makes 'service' binary
 unusable for monitoring the status of the service without parsing the
 command output.

10 head
20 desk
30 add
40 goto 10
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH] Medium: Use the resource timeout as an override to the default dbus timeout for upstart RA

2012-02-17 Thread Andrew Beekhof

Tangential question... but does upstart also implement the service binary?
As in service pacemaker start ?

On Fri, Feb 17, 2012 at 6:52 PM, Ante Karamatic iv...@ubuntu.com wrote:
 # HG changeset patch
 # User Ante Karamatić ante.karama...@canonical.com
 # Date 1329463546 -3600
 # Node ID 097ca775d3740a94591fbe0dd50124a51f140fff
 # Parent  d8c154589a16cb99ab16f36a27756ba94eefdbee
 Medium: Use the resource timeout as an override to the default dbus timeout 
 for upstart RA

 diff --git a/lib/plugins/lrm/raexecupstart.c b/lib/plugins/lrm/raexecupstart.c
 --- a/lib/plugins/lrm/raexecupstart.c
 +++ b/lib/plugins/lrm/raexecupstart.c
 @@ -169,7 +169,7 @@
        /* It'd be better if it returned GError, so we can distinguish
         * between failure modes (can't contact upstart, no such job,
         * or failure to do action. */
 -       if (upstart_job_do(rsc_type, cmd)) {
 +       if (upstart_job_do(rsc_type, cmd, timeout)) {
                exit(EXECRA_OK);
        } else {
                exit(EXECRA_NO_RA);
 diff --git a/lib/plugins/lrm/upstart-dbus.c b/lib/plugins/lrm/upstart-dbus.c
 --- a/lib/plugins/lrm/upstart-dbus.c
 +++ b/lib/plugins/lrm/upstart-dbus.c
 @@ -319,7 +319,7 @@
  }

  gboolean
 -upstart_job_do(const gchar *name, UpstartJobCommand cmd)
 +upstart_job_do(const gchar *name, UpstartJobCommand cmd, const int timeout)
  {
        DBusGConnection *conn;
        DBusGProxy *manager;
 @@ -342,7 +342,8 @@
                switch (cmd) {
                case UPSTART_JOB_START:
                        cmd_name = Start;
 -                       dbus_g_proxy_call (job, cmd_name, error,
 +                       dbus_g_proxy_call_with_timeout (job, cmd_name,
 +                               timeout, error,
                                G_TYPE_STRV, no_args,
                                G_TYPE_BOOLEAN, TRUE,
                                G_TYPE_INVALID,
 @@ -352,7 +353,8 @@
                        break;
                case UPSTART_JOB_STOP:
                        cmd_name = Stop;
 -                       dbus_g_proxy_call(job, cmd_name, error,
 +                       dbus_g_proxy_call_with_timeout(job, cmd_name,
 +                               timeout, error,
                                G_TYPE_STRV, no_args,
                                G_TYPE_BOOLEAN, TRUE,
                                G_TYPE_INVALID,
 @@ -360,7 +362,8 @@
                        break;
                case UPSTART_JOB_RESTART:
                        cmd_name = Restart;
 -                       dbus_g_proxy_call (job, cmd_name, error,
 +                       dbus_g_proxy_call_with_timeout (job, cmd_name,
 +                               timeout, error,
                                G_TYPE_STRV, no_args,
                                G_TYPE_BOOLEAN, TRUE,
                                G_TYPE_INVALID,
 diff --git a/lib/plugins/lrm/upstart-dbus.h b/lib/plugins/lrm/upstart-dbus.h
 --- a/lib/plugins/lrm/upstart-dbus.h
 +++ b/lib/plugins/lrm/upstart-dbus.h
 @@ -29,7 +29,7 @@
  } UpstartJobCommand;

  G_GNUC_INTERNAL gchar **upstart_get_all_jobs(void);
 -G_GNUC_INTERNAL gboolean upstart_job_do(const gchar *name, UpstartJobCommand 
 cmd);
 +G_GNUC_INTERNAL gboolean upstart_job_do(const gchar *name, UpstartJobCommand 
 cmd, const int timeout);
  G_GNUC_INTERNAL gboolean upstart_job_is_running (const gchar *name);

  #endif /* _UPSTART_DBUS_H_ */
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [Patch] OCF_RESKEY_CRM_meta_timeout not matching monitor timeout meta-data.

2011-12-15 Thread Andrew Beekhof

On Thu, Dec 15, 2011 at 8:45 PM,  renayama19661...@ybb.ne.jp wrote:
 Hi Dejan,

 Thank you for comment.

 It looks like a wrong place for a fix. Shouldn't crmd send all
 environment? It is only by chance that we have the timeout value
 available in this function.

 In the case of stop, crmd does not ask lrmd for the substitution of the 
 parameter. .

 (snip)
        /* reset the resource's parameters? */
        if(op-interval == 0) {
            if(safe_str_eq(CRMD_ACTION_START, operation)
               || safe_str_eq(CRMD_ACTION_STATUS, operation)) {
                op-copyparams = 1;
            }
        }
 (snip)

 When the parameter of the resource is changed, I think this to be because I 
 influence the stop of the resource of lrmd.
 It is necessary for the changed parameter not to copy it.

When stopping, you always want to use the old parameters (think of
someone changing 'ip' for an IPaddr resource).
Options that are interpreted by the crmd or lrmd are a different
matter which resulted in:

https://github.com/ClusterLabs/pacemaker/commit/fcfe6fe522138343e4138248829926700fac213e


 My patch is an example when I handle it in lrmd.

 Is there a better patch?
 * For example, it may be good to give copyparams a different value.

 Best Regards,
 Hideo Yamauchi.


 --- On Thu, 2011/12/15, Dejan Muhamedagic de...@suse.de wrote:

 Hi Hideo-san,

 On Thu, Dec 15, 2011 at 06:21:00PM +0900, renayama19661...@ybb.ne.jp wrote:
  Hi All,
 
  I made the patch which revised the old next problem.
 
   * http://www.gossamer-threads.com/lists/linuxha/users/70262
 
  In consideration of influence when a parameter was changed, I replace only 
  a value of timeout.
 
  Please confirm my patch.
  And please commit a patch.

 It looks like a wrong place for a fix. Shouldn't crmd send all
 environment? It is only by chance that we have the timeout value
 available in this function.

 Cheers,

 Dejan

  Best Regards,
  Hideo Yamauchi.


  ___
  Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
  Home Page: http://linux-ha.org/

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [Patch] OCF_RESKEY_CRM_meta_timeout not matching monitor timeout meta-data.

2011-12-15 Thread Andrew Beekhof

On Fri, Dec 16, 2011 at 1:21 PM,  renayama19661...@ybb.ne.jp wrote:
 Hi Andrew,

 Thank you for comment.

 When stopping, you always want to use the old parameters (think of
 someone changing 'ip' for an IPaddr resource).
 Options that are interpreted by the crmd or lrmd are a different
 matter which resulted in:
     
 https://github.com/ClusterLabs/pacemaker/commit/fcfe6fe522138343e4138248829926700fac213e


 All right.
 Will you apply this correction to 1.0 of Pacemaker?

Sure.  We'll pick it up for .13


 Best Regards,
 Hideo Yamauchi.




 --- On Fri, 2011/12/16, Andrew Beekhof and...@beekhof.net wrote:

 On Thu, Dec 15, 2011 at 8:45 PM,  renayama19661...@ybb.ne.jp wrote:
  Hi Dejan,
 
  Thank you for comment.
 
  It looks like a wrong place for a fix. Shouldn't crmd send all
  environment? It is only by chance that we have the timeout value
  available in this function.
 
  In the case of stop, crmd does not ask lrmd for the substitution of the 
  parameter. .
 
  (snip)
         /* reset the resource's parameters? */
         if(op-interval == 0) {
             if(safe_str_eq(CRMD_ACTION_START, operation)
                || safe_str_eq(CRMD_ACTION_STATUS, operation)) {
                 op-copyparams = 1;
             }
         }
  (snip)
 
  When the parameter of the resource is changed, I think this to be because 
  I influence the stop of the resource of lrmd.
  It is necessary for the changed parameter not to copy it.

 When stopping, you always want to use the old parameters (think of
 someone changing 'ip' for an IPaddr resource).
 Options that are interpreted by the crmd or lrmd are a different
 matter which resulted in:
     
 https://github.com/ClusterLabs/pacemaker/commit/fcfe6fe522138343e4138248829926700fac213e

 
  My patch is an example when I handle it in lrmd.
 
  Is there a better patch?
  * For example, it may be good to give copyparams a different value.
 
  Best Regards,
  Hideo Yamauchi.
 
 
  --- On Thu, 2011/12/15, Dejan Muhamedagic de...@suse.de wrote:
 
  Hi Hideo-san,
 
  On Thu, Dec 15, 2011 at 06:21:00PM +0900, renayama19661...@ybb.ne.jp 
  wrote:
   Hi All,
  
   I made the patch which revised the old next problem.
  
    * http://www.gossamer-threads.com/lists/linuxha/users/70262
  
   In consideration of influence when a parameter was changed, I replace 
   only a value of timeout.
  
   Please confirm my patch.
   And please commit a patch.
 
  It looks like a wrong place for a fix. Shouldn't crmd send all
  environment? It is only by chance that we have the timeout value
  available in this function.
 
  Cheers,
 
  Dejan
 
   Best Regards,
   Hideo Yamauchi.
 
 
   ___
   Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
   http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
   Home Page: http://linux-ha.org/
 
  ___
  Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
  Home Page: http://linux-ha.org/
 
  ___
  Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
  Home Page: http://linux-ha.org/

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] In RHEL5 and RHEL6 about different HA_RSCTMP

2011-12-07 Thread Andrew Beekhof

On Tue, Dec 6, 2011 at 8:00 PM, nozawat noza...@gmail.com wrote:
 Hi

  A maintenance mode in the heartbeat-stack does not work by this
 difference now in RHEL6.
  The reason is because /var/run/heartbeat/rsctmp is deleted at the
 time of initialization of Heartbeat.

Right, but the location and its deletion date back over 8 years IIRC.

  Some RA makes a status file there.

Which makes the resource seemed stopped?
At the point heartbeat starts all bets are off and the RA needs to be
able to correctly rediscover its own state.


  The resource-agents.spec files are as follows.
 --
    151 %if 0%{?fedora} = 11 || 0%{?centos_version}  5 || 0%{?rhel}  5
    152 CFLAGS=$(echo '%{optflags}')
    153 %global conf_opt_rsctmpdir
 --with-rsctmpdir=%{_var}/run/heartbeat/rsctmp
    154 %global conf_opt_fatal --enable-fatal-warnings=no
    155 %else
    156 CFLAGS=${CFLAGS} ${RPM_OPT_FLAGS}
    157 %global conf_opt_fatal --enable-fatal-warnings=yes
    158 %endif
 --
  Why is it that I use rsctmp in RHEL6?
  Should I want to delete the 153 line if possible?
  I contribute a patch if good.

 Regards,
 Tomo
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] VirtualDomain issue

2011-11-17 Thread Andrew Beekhof

On Mon, Nov 14, 2011 at 9:58 PM, Dejan Muhamedagic de...@suse.de wrote:
 Hi,

 On Thu, Jun 23, 2011 at 07:51:48AM +0200, Dominik Klein wrote:
 Hi

 code snippet from
 http://hg.linux-ha.org/agents/raw-file/7a11934b142d/heartbeat/VirtualDomain
 (which I believe is the current version)

 VirtualDomain_Validate_All() {
 snip
      if [ ! -r $OCF_RESKEY_config ]; then
       if ocf_is_probe; then
           ocf_log info Configuration file $OCF_RESKEY_config not readable
 during probe.
       else
           ocf_log error Configuration file $OCF_RESKEY_config does not exist
 or is not readable.
           return $OCF_ERR_INSTALLED
       fi
      fi
 }
 snip
 VirtualDomain_Validate_All || exit $?
 snip
 if ocf_is_probe  [ ! -r $OCF_RESKEY_config ]; then
      exit $OCF_NOT_RUNNING
 fi

 So, say one node does not have the config, but the cluster decides to
 run the vm on that node. The probe returns NOT_RUNNING, so the cluster
 tries to start the vm, that start returns ERR_INSTALLED, the cluster has
 to try to recover from the start failure, so stop it, but that stop op
 returns ERR_INSTALLED as well, so we need to be stonith'd.

 I think this is wrong behaviour.

 On stop, it should return OCF_SUCCESS. I wonder if it would be
 safe for the CRM to interpret ERR_INSTALLED on stop as resource
 stopped.

 Opinions?

Feels dangerous.
Even if the binaries are missing, the RA should arguably look for and
kill any relevant processes before returning OK.


 Cheers,

 Dejan

 P.S. Very sorry for such a delay!

 I read the comments about
 configurations being on shared storage which might not be available at
 certain points in time and I see the point. But the way this is
 implemented clearly does not work for everybody. I vote for making this
 configurable. Unfortunately, due to several reasons, I am not able to
 contribute this patch myself at the moment.

 Regards
 Dominik
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] attrd and repeated changes

2011-11-02 Thread Andrew Beekhof

On Sat, Oct 22, 2011 at 7:14 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Thu, Oct 20, 2011 at 08:48:36AM -0600, Alan Robertson wrote:
 On 10/20/2011 03:41 AM, Philipp Marek wrote:
  Hello,
 
  when constantly sending new data via attrd the changes are never used.
 
 
  Example:
      while sleep 1
        do attrd_updater -l reboot -d 5 -n rep_chg -U try$SECONDS
        cibadmin -Ql | grep rep_chg
      done
 
  This always returns the same value - the one that was given with more than 
  5
  seconds delay afterwards, so that the dampen interval wasn't broken by the
  next change.
 
 
  I've attached two draft patches; one for allowing the _first_ value in a
  dampen interval to be used (effectively ignoring changes until this value 
  is
  written), and one for using the _last_ value in the dampen interval (by not
  changing the dampen timer). [1]
 
 
  ***  Note: they are for discussion only!
  ***  I didn't test them, not even for compilation.

  What is the correct way to handle multiple updates within the dampen
  interval?

 Personally, I'd vote for the last value.  I agree with you about this
 being a bug.

 If the attribute is used to check connectivity changes (ping resource
 agent), or similar, and we have a flaky, flapping connectivity,
 it would be useful to have a max or min consolidation function
 for incoming values during a dampen interval.

 Otherwise, I get + + - + + -|+ + + +
 and if the dampen interval just happened to expire
 where I put the | above, it would have pushed a - to the cib,
 where I'd rather kept it at +.

Thats why dampen should typically be a multiple of the monitor interval.


 We likely want to add an option to attrd_updater (and to the ipc
 messages it sends to attrd, and to the rest of the chain involved),
 which can specify the consolidation function to be used.

 The initial set I suggest would be
 generic:
  oldest
  latest (default?)
 for values assumed to be numeric:
  max (also a candidate for default behaviour)
  min
  avg (with a printf like template for rounding, %.2f or similar,
      so we could even average boolean values)

For avg you'd need to specify how many values to remember.


 I suggest this behaviour:

  * If different updates request a different consolidation function,
   the last one (within the respective dampen interval) wins.

  * update with the _same_ value: Do not start or modify any timer.
   If a timer is pending, still add the value to the list of values to
   be processed by the consolidation function (relevant for avg,
   possibly not yet listed others).

  * update with a different value:
   Start a new timer, unless one is pending already.
   Do not restart/modify an already pending timer.
   Add to the list of values for the consolidation function.

  * Flush message received: expire timer. See below.

  * Timer expires:
   Apply consolidation function to list of values.  If list is empty
   (probably flush message without pending timer), use current value.
   Send that result to the cib.

Sounds reasonable, there's no way I'm going to be able to get to
implement it any time soon though.

If someone else wants to implement it, I think it would be useful to
have it be part of a larger rework that ensured atomicity of the
updates.
I.e. have all nodes send their values to a designated instance which
did all the updates.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] How to use reload action of RA agent?

2011-10-11 Thread Andrew Beekhof

On Fri, Oct 7, 2011 at 2:42 PM, Serge Dubrouski serge...@gmail.com wrote:
 Hello -
 How one supposed to use reload action of RA agent it it's supported by RA?
 When I try to set up an order like this:
 order Reload_After_Start +inf: res1:start res2:reload
 neither crm nor cibadmin allow me to define such order.

Right, its not something you specify like that.
It happens automatically when the resource definition changes:

   
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-reload.html


 --
 Serge Dubrouski.

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [Linux-HA] [ha-wg] CFP: HA Mini-Conference in Prague on Oct 25th

2011-10-09 Thread Andrew Beekhof

On Sat, Oct 8, 2011 at 6:03 AM, Digimer li...@alteeve.com wrote:
 On 10/07/2011 02:58 PM, Florian Haas wrote:
 Vienna before the early afternoon of Saturday the 29th, so if anyone has
 plans to do something interesting that Saturday morning I'd be more than
 happy to join.

 Cheers,
 Florian

 I'm going to be in the city all day Saturday as well.

 Knowing there will be at least a few who will have trouble making the
 unofficial meeting on the 26th,

The 26th is just the meeting start.

 perhaps we could have an even more
 informal meeting/talk/debrief/waffles on Saturday morning?

I'll be flying to Munich late on Friday afternoon.

 --
 Digimer
 E-Mail:              digi...@alteeve.com
 Freenode handle:     digimer
 Papers and Projects: http://alteeve.com
 Node Assassin:       http://nodeassassin.org
 At what point did we forget that the Space Shuttle was, essentially,
 a program that strapped human beings to an explosion and tried to stab
 through the sky with fire and math?
 ___
 Linux-HA mailing list
 linux...@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [Linux-HA] [ha-wg] CFP: HA Mini-Conference in Prague on Oct 25th

2011-10-06 Thread Andrew Beekhof

On Thu, Oct 6, 2011 at 1:53 AM, Lars Marowsky-Bree l...@suse.com wrote:
 On 2011-10-03T11:10:13, Andrew Beekhof and...@beekhof.net wrote:

 Based on Boston last year, I imagine the conversations will last right
 up until Lars starts presenting his talk on Friday afternoon.
 People came and went at random, and if someone essential was missing
 for a conversation we deferred it until later.

 Oh, then we're going to not stop, ever - because I don't have a talk at
 the main conference this time ;-)

The schedule has you in a friday afternoon slot iirc.


 Very informal, but it seemed to work ok.

 yes, and given that the ha mailing lists are still down, probably the
 best we can hope for ...

indeed
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [ha-wg] CFP: HA Mini-Conference in Prague on Oct 25th

2011-10-02 Thread Andrew Beekhof

On Sat, Oct 1, 2011 at 12:55 AM, Digimer li...@alteeve.com wrote:
 On 09/27/2011 07:58 AM, Lars Marowsky-Bree wrote:
 Hi all,

 it turns out that there was zero feedback about people wanting to
 present, only some about travel budget being too tight to come. So we
 had some discussions about whether to cancel this completely, as this
 made planning rather difficult.

 But just in the last few days, I got a fair share of e-mails asking if
 this still takes place, and who is going to be there. ;-)

 So: we have the room. I will be there, and it seems so will at least a
 few other people, including Andrew. I suggest we do it in an
 unconference style and draw up the agenda as we go along; you're
 welcome to stop by and discuss HA/clustering topics that are important
 to you.  It is going to be as successful as we all make it out to be.

 We share the venue with LinuxCon Europe: Clarion Congress Hotel ·
 Prague, Czech Republic, on Oct 25th.

 I suggest we start at 9:30 in the morning and go from there.


 Regards,
     Lars


 Is it possible, if this isn't set in stone, to push back to later in the
 day? I don't fly in until the 25th, and I think there is one other
 person who wants to attend in the same boat.

Based on Boston last year, I imagine the conversations will last right
up until Lars starts presenting his talk on Friday afternoon.
People came and went at random, and if someone essential was missing
for a conversation we deferred it until later.

Very informal, but it seemed to work ok.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [Pacemaker] ping RA question

2011-07-31 Thread Andrew Beekhof

Dan - any objections if I incorporate the fping parts into the ping RA?

On Fri, Jul 29, 2011 at 12:47 AM, Dan Urist dur...@ucar.edu wrote:
 Here's my fping RA, for anyone who's interested. Note that some of the
 parameters are different than ping/pingd, since fping works
 differently.

 The major advantages of fping over the system ping are that multiple
 hosts can be pinged with a single fping command, and fping will return
 as soon as all hosts succeed (the linux system ping will not return
 until it has exhausted either its count or the timeout, regardless of
 success).
 --
 Dan Urist
 dur...@ucar.edu
 303-497-2459

 ___
 Pacemaker mailing list: pacema...@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH] pacemaker-1.1.5 : fix autotools build system

2011-07-13 Thread Andrew Beekhof

applied. thanks!

On Tue, Jul 12, 2011 at 6:54 PM, Ultrabug ultra...@gentoo.org wrote:
 Hello mates,

 I would like you to consider having the attached patch committed in
 order to fix and improve the build system of pacemaker.

 We Gentoo compilation lovers have to apply this patch in order to have a
 clean and working process of building pacemaker mostly because we use
 LDFLAGS=--as-needed in our default setup. As you may already know,
 this requires the library linking to be strictly ordered and organized
 and this patch is meant to do it.

 If you remember, we Gentoo cluster team already submitted such kind of a
 patch (see attached message) but it seems some features/files slipped
 since then.

 Thanks in advance for considering.

 Ultrabug, Gentoo cluster herd.

 PS: sorry, posted this message earlier with wrong mail account

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] pacemaker - migrate RA, based on the state of other RA, w/o clone?

2011-07-13 Thread Andrew Beekhof

On Thu, Jul 14, 2011 at 1:51 AM, RNZ renoi...@gmail.com wrote:
 I make next resource agent -
 https://github.com/rnz/resource-agents/blob/master/heartbeat/couchdb
 At end of file exist next example configuration:
 
  node vub001
  node vub002
  primitive couchdb-1 ocf:heartbeat:couchdb \
  params dbuser=user dbpass=pass
 replcmds=http://admin:pass@192.168.1.2:5984/testdb,http://admin:pass@127.0.0.1:5984/testdb,true;http://admin:pass@192.168.1.2:5984/testdb1,http://admin:pass@127.0.0.1:5984/testdb1,true;;
 \
  op start interval=0 timeout=20s \
  op stop interval=0 timeout=20s \
  op monitor interval=10s \
  meta target-role=Started
  primitive couchdb-2 ocf:heartbeat:couchdb \
  params dbuser=user dbpass=pass
 replcmds=http://admin:pass@192.168.1.1:5984/testdb,http://admin:pass@127.0.0.1:5984/testdb,true;http://admin:pass@192.168.1.1:5984/testdb1,http://admin:pass@127.0.0.1:5984/testdb1,true;;
 \
  op start interval=0 timeout=20s \
  op stop interval=0 timeout=20s \
  op monitor interval=10s \
  meta target-role=Started
  primitive vIP ocf:heartbeat:IPaddr2 \
  params ip=192.168.1.10 nic=eth1 \
  op start interval=0 timeout=20s \
  op stop interval=0 timeout=20s \
  op monitor interval=5s timeout=20s depth=0 \
  meta target-role=Started
  location cdb-1-c couchdb-1 inf: vub001
  location cdb-1-p couchdb-1 -inf: vub002
  location cdb-2-c couchdb-2 inf: vub002
  location cdb-2-p couchdb-2 -inf: vub001
  location vIP_c vIP 100: vub001
  location vIP_p vIP 10: vub002
  property $id=cib-bootstrap-options \
  cluster-infrastructure=openais \
  expected-quorum-votes=2 \
  no-quorum-policy=ignore \
  stonith-enabled=false \
  symmetric-cluster=false \
  rsc_defaults $id=rsc-options \
  resource-stickiness=110
 --
 One CouchDB resources infinite stay per node, because use
 master-master/multi-master replication. And  I want to make easy couchdb
 replication, without use external files for configuration of replication.

 Need resource vIP migrate to next node by check fail/stop state resource
 couchdb (per node).
 How make it in pacemaker configuration?

I think you want to be using the Master/Slave construct of pacemaker.
That would let you colocate the vIP with instances of couchdb with the
master role.


 May be use location rule: #uname  or need additional RA for control state
 (as pingd)?
 May use next method:
 #!/bin/bash
 curl -s http://127.0.0.1:5984 | grep -q 'couchdb'
 if [ $? != 0 ]; then
 #... add control by crm_mon of started couchdb on current node
 crm resource migrate vIP
 crm configure delete cli-standby-vIP
 fi

 But I think, this is not good


 P.S. Sorry for my bad English...

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Filesystem ocf file

2011-05-06 Thread Andrew Beekhof

On Fri, May 6, 2011 at 9:37 AM, Florian Haas florian.h...@linbit.com wrote:
 On 2011-05-06 09:26, Darren Thompson wrote:
 Team

 I was reviewing some errors on a cluster mounted file-system that caused
 me to review the Filesystem ocf file.

 I notice that it uses an undeclared parameter of OCF_CHECK_LEVEL to
 determine what degree of testing of the filesystem is required in monitor

 I have now updated it to more formally work with a check_level value
 with the more obvious values of mounted, read  write ( my updated
 version attached )

 Could someone (Florian is this something you can do?) please review this
 with a view to patching the upstream Filesystem ocf file.

 NACK, sorry. The OCF_CHECK_LEVEL is specific to the monitor action and
 described as such in the OCF spec; this will not be changed without a
 change to the spec.

 To use it, set op monitor interval=X OCF_CHECK_LEVEL=Y

 Yes, it's poorly designed, it makes no sense why this is pretty much the
 only sensible time to set a parameter specifically for an operation (as
 opposed to on a resource), it's inexplicable why it's all caps, etc.,
 but that's the way it is.

Honest. It was broken when we got here.  Maybe it was the neighbor's dog?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [ha-wg] Cluster Stack - Ubuntu Developer Summit

2011-05-05 Thread Andrew Beekhof

On Thu, May 5, 2011 at 10:25 AM, Florian Haas florian.h...@linbit.com wrote:
 On 2011-04-26 19:33, Andres Rodriguez wrote:
 UDS' are open-to-public events, and I believe it would be great if
 upstream could participate and maybe even further the discussion about
 the Cluster Stack. For more information about UDS, please visit [1]. The
 specific date/time for the Cluster Stack session is not yet available.

 If you require any further information please don't hesitate to contact me.

 Andres already knows this, but FWIW I'll repost here that I'll be at UDS
 in time for the cluster stack session at 12 noon on 5/12. I'll stay in
 Budapest that evening and will probably join the Budapest sightseeing
 tour that the Hungarian Ubuntu team is organizing, so if anyone wants to
 link up with Andres and me for a few beverages please let us know.

 Andrew, interested in making a day trip to Budapest while you're still
 on this continent?

With under 4 weeks to go - not a chance :-)
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] ACLs and privilege escalation (was Re: New OCF RA: symlink)

2011-05-05 Thread Andrew Beekhof

On Thu, May 5, 2011 at 9:09 AM, Florian Haas florian.h...@linbit.com wrote:
 Rather than going into ACLs in more detail, I wanted to highlight that
 however we limit access to the CIB, the resource agents still _execute_
 as root, so we will always have what would normally be considered a
 privilege escalation issue.

 Now, we could agree on security guidelines for RAs, and some of those
 would certainly be no-brainers to define (such as, don't ever eval
 unsanitized user input), but I refuse to even suggest to tackle any such
 guidelines before the OCF spec update has gotten off the ground.

 One such thing that could be added to the spec would be optional meta
 variables named user and group, directing the LRM (or any successor)
 to execute the RA as that user rather than root. Just an idea.

Seems plausible.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] New OCF RA: symlink

2011-05-05 Thread Andrew Beekhof

On Wed, May 4, 2011 at 4:36 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
  Services running under Pacemaker control are probably critical,
  so a malicious person with even only stop access on the CIB
  can do a DoS. I guess we have to assume people with any write access
  at all to the CIB are trusted, and not malicious.

Exactly. If the cluster (or access to it) has been compromised, you're
in for so much pain that a symlink RA is the least of your problems.
A generic cluster manager is, by design, a way to run arbitrary
scripts as root - there's no coming back from there.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Translate crm_cli.txt to Japanese

2011-04-27 Thread Andrew Beekhof

On Wed, Apr 27, 2011 at 12:54 PM, Dejan Muhamedagic de...@suse.de wrote:
 Hi Junko-san,

 On Wed, Apr 27, 2011 at 06:42:52PM +0900, Junko IKEDA wrote:
 Hi,

  May I suggest that you go with the devel version, because
  crm_cli.txt was converted to crm.8.txt. There are not many
  textual changes, just some obsolete parts removed.

 OK, I got crm.8.txt from devel.

 Each directory structure for Pacemaker 1.0,1.1 and devel is just a bit
 different.
 Does 1.0 keep its doc dir structure for now?

 Until the next release I guess.

 If so, it seems that just create html file is not so difficult when
 asciidoc is available.

 No, not difficult. It just depends on the build environment. If
 asciidoc is found by configure, then it is going to be used to
 produce the html files.

Do any distros _not_ ship asciidoc?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Translate crm_cli.txt to Japanese

2011-04-27 Thread Andrew Beekhof

On Wed, Apr 27, 2011 at 3:47 PM, Dejan Muhamedagic de...@suse.de wrote:
 On Wed, Apr 27, 2011 at 02:01:40PM +0200, Andrew Beekhof wrote:
 On Wed, Apr 27, 2011 at 12:54 PM, Dejan Muhamedagic de...@suse.de wrote:
  Hi Junko-san,
 
  On Wed, Apr 27, 2011 at 06:42:52PM +0900, Junko IKEDA wrote:
  Hi,
 
   May I suggest that you go with the devel version, because
   crm_cli.txt was converted to crm.8.txt. There are not many
   textual changes, just some obsolete parts removed.
 
  OK, I got crm.8.txt from devel.
 
  Each directory structure for Pacemaker 1.0,1.1 and devel is just a bit
  different.
  Does 1.0 keep its doc dir structure for now?
 
  Until the next release I guess.
 
  If so, it seems that just create html file is not so difficult when
  asciidoc is available.
 
  No, not difficult. It just depends on the build environment. If
  asciidoc is found by configure, then it is going to be used to
  produce the html files.

 Do any distros _not_ ship asciidoc?

 AFAIK none of contemporary distributions. And going back three
 years or so, it's the other way around.


How quickly we forget.

Anyway, I advocate that the project makes decisions based on it being
around (but fails gracefully when its not) and leaves it up to older
distros to ship a pre-generated copy if they so desire.  I can't
imagine lack of HTML versions being a deal breaker.

And by fail gracefully, I mean the current behavior of just not
building those versions of the doc.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Bug in crm shell or pengine

2011-04-19 Thread Andrew Beekhof

On Mon, Apr 18, 2011 at 11:38 PM, Serge Dubrouski serge...@gmail.com wrote:
 Ok, I've read the documentation. It's not a bug, it's a feature :-)

Might be nice if the shell could somehow prevent such configs, but it
would be non-trivial to implement.


 On Mon, Apr 18, 2011 at 3:01 PM, Serge Dubrouski serge...@gmail.com wrote:
 Hello -

 Looks like there is a bug in crm shell Pacemaker version 1.1.5 or in pengine.


 primitive pg_drbd ocf:linbit:drbd \
        params drbd_resource=drbd0 \
        op monitor interval=60s role=Master timeout=10s \
        op monitor interval=60s role=Slave timeout=10s

 Log file:

 Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation
 pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s
 Apr 17 04:05:29 cs51 crmd: [5535]: info: do_state_transition: Starting
 PEngine Recheck Timer
 Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Do not use the
 same (name, interval) combination more than once per resource
 Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation
 pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s
 Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Do not use the
 same (name, interval) combination more than once per resource
 Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation
 pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s

 Plus strange behavior of the cluster like inability to mover resources
 from one node to another.

 --
 Serge Dubrouski.




 --
 Serge Dubrouski.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Dovecot OCF Resource Agent

2011-04-15 Thread Andrew Beekhof

On Fri, Apr 15, 2011 at 12:53 PM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote:
 On 04/15/2011 11:10 AM, jer...@intuxicated.org wrote:

 Yes, it does the same thing but contains some additional features, like
 logging into a mailbox.

 first of all, i do not know how the others think about a ocf ra
 implemented in c. i'll suggest waiting for comments from dejan or
 fghass.

the ipv6addr agent was written in C too
the OCF standard does not dictate the language to be used - its really
a matter of whether C is the best tool for this job


 you could then create a fork on github and make sure it integrates
 well with the current build environment.


 second, what do you think about extending this ra to be able to handle
 multiple email MDAs? deep probing routines would also be needed for
 other MDAs.


 i'm thinking about giving this ra a shot but would like to hear some
 comments on my first remark before doing so.


 thanks for your work!
 raoul
 --
 
 DI (FH) Raoul Bhatia M.Sc.          email.          r.bha...@ipax.at
 Technischer Leiter

 IPAX - Aloy Bhatia Hava OG          web.          http://www.ipax.at
 Barawitzkagasse 10/2/2/11           email.            off...@ipax.at
 1190 Wien                           tel.               +43 1 3670030
 FN 277995t HG Wien                  fax.            +43 1 3670030 15
 
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Resource agent implementing SPC-3 Persistent Reservations (contribution from Evgeny Nifontov)

2011-04-12 Thread Andrew Beekhof

Awesome. I was wondering if someone would ever write one of these :)

On Tue, Apr 12, 2011 at 10:29 AM, Florian Haas florian.h...@linbit.com wrote:
 Hi everyone,

 Evgeny Nifontov has started to implement sg_persist, a resource agent
 managing SPC-3 Persistent Reservations (PRs) using the sg_persist
 binary. He's put up a personal repo on Github and the initial commit is
 here:

 https://github.com/nif/ClusterLabs__resource-agents/commit/d0c46fb35338d28de3e2c20c11d0ad01dded13fd

 I've added some comments for an initial review. Everyone interested
 please pitch in.

 Thanks to Evgeny for an the contribution!

 Cheers,
 Florian



 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] new resource agents repository commit policy

2011-03-14 Thread Andrew Beekhof

On Mon, Mar 14, 2011 at 6:07 PM, Dejan Muhamedagic de...@suse.de wrote:
 Hello everybody,

 It's time to figure out how to maintain the new Resource Agents
 repository. Fabio and I already discussed this a bit in IRC.
 There are two options:

 a) everybody gets an account at github.com and commit rights,
   where everybody is all people who had commit rights to
   linux-ha.org and rgmanager agents repositories.

 b) several maintainers have commit rights and everybody else
   sends patches to a ML; then one of the maintainers does a
   review and commits the patch (or pulls it from the author's
   repository).

I suspect you want b) with maybe 6 people for redundancy.
The pull request workflow should be well suited to a project like this
and impose minimal overhead.

The ability to comment on patches in-line before merging them should
be pretty handy.


You're also welcome to put a copy at http://www.clusterlabs.org/git/
Its pretty easy to keep the two repos in sync, for example I have this
in .git/config for matahari:

[remote origin]
fetch = +refs/heads/*:refs/remotes/origin/*
url = g...@github.com:matahari/matahari.git
pushurl = g...@github.com:matahari/matahari.git
pushurl = ssh://beek...@git.fedorahosted.org/git/matahari.git

git push then sends to both locations


 Option a) incurs a bit less overhead and that's how our old
 repositories worked. Option b) gives, at least nominally, more
 control to the select group of maintainers, but also places even
 more burden on them.

 We are open for either of these.

 Cheers,

 Fabio and Dejan
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] new resource agents repository

2011-02-24 Thread Andrew Beekhof

On Thu, Feb 24, 2011 at 4:10 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 On Thu, Feb 24, 2011 at 03:56:27PM +0100, Andrew Beekhof wrote:
 On Thu, Feb 24, 2011 at 2:59 PM, Dejan Muhamedagic deja...@fastmail.fm 
 wrote:
  Hello,
 
  There is a new repository for Resource Agents which contains RA
  sets from both Linux HA and Red Hat projects:
 
         git://github.com/ClusterLabs/resource-agents.git
 
  The purpose of the common repository is to share maintenance load
  and try to consolidate resource agents.
 
  There were no conflicts with the rgmanager RA set and both source
  layouts remain the same. It is only that autoconf bits were
  merged. The only difference is that if you want to get Linux HA
  set of resource agents installed, configure should be run like
  this:
 
         configure --with-ras-set=linux-ha ...
 
  The new repository is git but the existing history is preserved.
  People used to Mercurial shouldn't have hard time working with
  git.
 
  We need to retire the existing repository hg.linux-ha.org. Are
  there any objections or concerns that still need to be addressed?

 Might not hurt to leave it around - there might be various URLs that
 point there.

 Yes, it will definitely remain there. What I meant with retire,
 is that the developers then start using the git repository
 exclusively.

A

Yes, and making read-only on the server it probably a good idea (to
avoid pushes).
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] New ocft config file for IBM db2 resource agent

2011-02-15 Thread Andrew Beekhof

On Tue, Feb 15, 2011 at 10:50 AM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi Holger,

 On Tue, Feb 15, 2011 at 09:49:07AM +0100, Holger Teutsch wrote:
 Hi,
 please find enclosed an ocft config for db2 for review and inclusion
 into the project if appropriate.

 Wonderful! This is the first time somebody contributed an ocft
 testcase.

Looks like lmb owes somebody lunch :-)

 The current 1.0.4 agent passes the tests 8-) .

 I've never doubted that either.

 Cheers,

 Dejan


 Regards
 Holger




 # db2
 #
 # This test assumes a db2 ESE instance with two partions and a database.
 # Default is instance=db2inst1, database=ocft
 # adapt this in set_testenv below
 #
 # Simple steps to generate a test environment (if you don't have one):
 #
 # A virtual machine with 1200MB RAM is sufficient
 #
 # - download an eval version of DB2 server from IBM
 # - create an user db2inst1 in group db2inst1
 #
 # As root
 # - install DB2 software in some location
 # - create instance
 #   cd this_location/instance
 #   ./db2icrt -s ese -u db2inst1 db2inst1
 # - adapt profile of db2inst1 as instructed by db2icrt
 #
 # As db2inst1
 #      # allow to run with small memory footprint
 #      db2set DB2_FCM_SETTINGS=FCM_MAXIMIZE_SET_SIZE:FALSE
 #      db2start
 #      db2start dbpartitionnum 1 add dbpartitionnum hostname $(uname -n) 
 port 1 without tablespaces
 #      db2stop
 #      db2start
 #      db2 create database ocft
 # Done
 # In order to install a real cluster refer to 
 http://www.linux-ha.org/wiki/db2_(resource_agent)

 CONFIG
         HangTimeout 40

 SETUP-AGENT
         # nothing

 CASE-BLOCK set_testenv
         Var OCFT_instance=db2inst1
         Var OCFT_db=ocft

 CASE-BLOCK crm_setting
         Var OCF_RESKEY_instance=$OCFT_instance
       Var OCF_RESKEY_CRM_meta_timeout=3

 CASE-BLOCK default_status
       AgentRun stop

 CASE-BLOCK prepare
         Include set_testenv
       Include crm_setting
       Include default_status

 CASE check base env
       Include prepare
       AgentRun start OCF_SUCCESS

 CASE check base env: invalid 'OCF_RESKEY_instance'
       Include prepare
       Var OCF_RESKEY_instance=no_such
       AgentRun start OCF_ERR_INSTALLED

 CASE invalid instance config
       Include prepare
       Bash eval mv ~$OCFT_instance/sqllib ~$OCFT_instance/sqllib-
       BashAtExit eval mv ~$OCFT_instance/sqllib- ~$OCFT_instance/sqllib
       AgentRun start OCF_ERR_INSTALLED

 CASE unimplemented command
       Include prepare
       AgentRun no_cmd OCF_ERR_UNIMPLEMENTED

 CASE normal start
       Include prepare
       AgentRun start OCF_SUCCESS

 CASE normal stop
       Include prepare
       AgentRun start
       AgentRun stop OCF_SUCCESS

 CASE double start
       Include prepare
       AgentRun start
       AgentRun start OCF_SUCCESS

 CASE double stop
       Include prepare
       AgentRun stop OCF_SUCCESS

 CASE started: monitor
       Include prepare
       AgentRun start
       AgentRun monitor OCF_SUCCESS

 CASE not started: monitor
       Include prepare
       AgentRun monitor OCF_NOT_RUNNING

 CASE killed instance: monitor
         Include prepare
         AgentRun start OCF_SUCCESS
         AgentRun monitor OCF_SUCCESS
         BashAtExit rm /tmp/ocft-helper1
         Bash echo su $OCFT_instance -c '. 
 ~$OCFT_instance/sqllib/db2profile; db2nkill 0 /dev/null 21'  
 /tmp/ocft-helper1
         Bash sh -x /tmp/ocft-helper1
         AgentRun monitor OCF_NOT_RUNNING

 CASE overload param instance by admin
         Include prepare
         Var OCF_RESKEY_instance=no_such
         Var OCF_RESKEY_admin=$OCFT_instance
         AgentRun start OCF_SUCCESS

 CASE check start really activates db
         Include prepare
         AgentRun start OCF_SUCCESS

         BashAtExit rm /tmp/ocft-helper2
         Bash echo su $OCFT_instance -c '. 
 ~$OCFT_instance/sqllib/db2profile; db2 get snapshot for database on 
 $OCFT_db/dev/null'  /tmp/ocft-helper2
         Bash sh -x /tmp/ocft-helper2

 CASE multipartion test
         Include prepare
         AgentRun start OCF_SUCCESS
         AgentRun monitor OCF_SUCCESS

         # start does not start partion 1
         Var OCF_RESKEY_dbpartitionnum=1
         AgentRun monitor OCF_NOT_RUNNING

         # now start 1
         AgentRun start OCF_SUCCESS
         AgentRun monitor OCF_SUCCESS

         # now stop 1
         AgentRun stop OCF_SUCCESS
         AgentRun monitor OCF_NOT_RUNNING

         # does not affect 0
         Var OCF_RESKEY_dbpartitionnum=0
         AgentRun monitor OCF_SUCCESS

 # fault injection does not work on the 1.0.4 client due to a hardcoded path
 CASE simulate hanging db2stop (not meaningful for 1.0.4 agent)
         Include prepare
         AgentRun start OCF_SUCCESS
         Bash [ ! -f /usr/local/bin/db2stop ]
         BashAtExit rm /usr/local/bin/db2stop
         Bash echo -e #!/bin/sh\necho fake db2stop\nsleep 1  
 /usr/local/bin/db2stop
         Bash chmod +x /usr/local/bin/db2stop
         AgentRun stop OCF_SUCCESS

 #

Re: [Linux-ha-dev] [PATCH] manage PostgreSQL 9.0 streaming replication using Master/Slave

2011-02-14 Thread Andrew Beekhof

On Mon, Feb 14, 2011 at 8:46 PM, Serge Dubrouski serge...@gmail.com wrote:
 On Mon, Feb 14, 2011 at 1:28 AM, Takatoshi MATSUO matsuo@gmail.com 
 wrote:
 Ideally demote operation should stop a master node and then restart it
 in hot-standby mode. It's up to administrator to make sure that no
 node with outdated data gets promoted to the master role. One should
 follow standard procedures: cluster software shouldn't be configured
 for autostart at the boot time, administrator has to make sure that
 data was refreshed if the node was down for some prolonged time.

 Hmm..
 Do you mean that RA puts recovery.conf automatically at demote op to
 start hot standby?
 Please give me some time to think it over.


 Sorry, I got the wrong idea about restoring data.
 To start as hot-standby needs restoring anytime,
 because Time-line ID of PostgreSQL is incremented.
 In addition, shutting down the PostgreSQL with immediate option causes
 inconsistent WAL  between primary and hot-standby.

 So I think it's difficult to start slave automatically at demote.
 Still, do you think it's better to implement restoring ?

 I'm afraid it's not just better, but it's a must. We have to play by
 Pacemaker's rules and that means that we have to properly implement
 demote operation and that's switching from Master to Slave, not just
 stopping Master. I do appreciate your efforts, but implementation has
 to conform to Pacemaker standards, i.e. Master has to start where it's
 configured in Pacemaker, not just where recovery.conf file exists.

Thats the ideal at least.

Most of the time it should be possible to self-promote and let
pacemaker figure out the result.
But I can easily imagine there would also be situations where this is
going blow up in your face.

 Administrator has to be able to easily switch between node roles and
 so on.

 I still need some more time to learn PostgreSQL data replication and
 do some tests. Let's think if that's possible to implement real
 Master/Slave in Pacemaker sense of things.

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/




 --
 Serge Dubrouski.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] ocft: status vs. monitor

2011-02-13 Thread Andrew Beekhof

On Sun, Feb 13, 2011 at 11:01 AM, Holger Teutsch holger.teut...@web.de wrote:
 Hi,
 to my knowledge OCF *requires* a method monitor while status is optional
 (or what is it really for? heritage, compatibility, ...)

 Shouldn't the ocft configs check for status ?

Yes, unless its trying to talk to an LSB resource.


 -holger

 diff -r 722c8a7a03e9 tools/ocft/apache
 --- a/tools/ocft/apache Fri Feb 11 18:49:09 2011 +0100
 +++ b/tools/ocft/apache Sun Feb 13 10:57:50 2011 +0100
 @@ -52,14 +52,14 @@
        Include prepare
        AgentRun stop OCF_SUCCESS

 -CASE running status
 +CASE running monitor
        Include prepare
        AgentRun start
 -       AgentRun status OCF_SUCCESS
 +       AgentRun monitor OCF_SUCCESS

 -CASE not running status
 +CASE not running monitor
        Include prepare
 -       AgentRun status OCF_NOT_RUNNING
 +       AgentRun monitor OCF_NOT_RUNNING

  CASE unimplemented command
        Include prepare
 diff -r 722c8a7a03e9 tools/ocft/mysql
 --- a/tools/ocft/mysql  Fri Feb 11 18:49:09 2011 +0100
 +++ b/tools/ocft/mysql  Sun Feb 13 10:57:50 2011 +0100
 @@ -46,14 +46,14 @@
        Include prepare
        AgentRun stop OCF_SUCCESS

 -CASE running status
 +CASE running monitor
        Include prepare
        AgentRun start
 -       AgentRun status OCF_SUCCESS
 +       AgentRun monitor OCF_SUCCESS

 -CASE not running status
 +CASE not running monitor
        Include prepare
 -       AgentRun status OCF_NOT_RUNNING
 +       AgentRun monitor OCF_NOT_RUNNING

  CASE check lib file
        Include prepare


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode

2011-02-09 Thread Andrew Beekhof

On Wed, Feb 9, 2011 at 12:15 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 On Wed, Feb 09, 2011 at 12:06:04PM +0100, Florian Haas wrote:
 On 2011-02-09 11:56, Dejan Muhamedagic wrote:
  It is plugin compatible to the old version of the agent.
 
  Great! Unfortunately, we can't replace the old db2 now, the
  number of changes is very large:
 
   db2 | 1076 
  +++-
   1 file changed, 687 insertions(+), 389 deletions(-)
 
  And the code is completely new (though I have no doubt that it is
  of excellent quality). So, I'd suggest to add this as another db2
  RA. Once it gets some field testing we can mark the old one as
  deprecated. What name would you suggest? db2db2?

 Just making sure: Is that a joke?

 A bit of a joke, yes. But the alternatives such as db22 or db2new
 looked a bit boring.

I think boring is the least of our problems with those names.
Are you going to change the name of every agent that gets a rewrite?

   IPaddr2-ng-ng-again-and-one-more-plus-one

Solicit feedback, like was done for kliend's new agent, and replace
the existing one it if/when people respond positively.
Its not like the old one disappears from the face of the earth after
you merge the new one.

   wget -o /usr/lib/ocf/resource.d/heartbeat/db2
http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2


  HADR is a very different beast from non-HADR db, right? Why not
  then add the hadr boolean parameter and use that instead of
  checking if the resource has been configured as multi-state?

 I'll take responsibility for suggesting the use of ocf_is_ms(), and I'd
 be curious to find out what you think is wrong with that approach.

 There's nothing wrong in the sense whether it is going to work.
 But someday, db2 may sport say HADR2 or VHA or whatever else
 which may run as a ms resource. I just think that it's better to
 make it obvious in the configuration that the user runs HADR.
 Does that make sense?

 Because if anything is, then the mysql RA needs fixing too.

 No idea what's up with mysql.

 Cheers,

 Dejan

 Florian




 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode

2011-02-09 Thread Andrew Beekhof

On Wed, Feb 9, 2011 at 2:17 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi Andrew,

 On Wed, Feb 09, 2011 at 01:33:03PM +0100, Andrew Beekhof wrote:
 On Wed, Feb 9, 2011 at 12:15 PM, Dejan Muhamedagic deja...@fastmail.fm 
 wrote:
  On Wed, Feb 09, 2011 at 12:06:04PM +0100, Florian Haas wrote:
  On 2011-02-09 11:56, Dejan Muhamedagic wrote:
   It is plugin compatible to the old version of the agent.
  
   Great! Unfortunately, we can't replace the old db2 now, the
   number of changes is very large:
  
    db2 | 1076 
   +++-
    1 file changed, 687 insertions(+), 389 deletions(-)
  
   And the code is completely new (though I have no doubt that it is
   of excellent quality). So, I'd suggest to add this as another db2
   RA. Once it gets some field testing we can mark the old one as
   deprecated. What name would you suggest? db2db2?
 
  Just making sure: Is that a joke?
 
  A bit of a joke, yes. But the alternatives such as db22 or db2new
  looked a bit boring.

 I think boring is the least of our problems with those names.
 Are you going to change the name of every agent that gets a rewrite?

    IPaddr2-ng-ng-again-and-one-more-plus-one

 I don't think it is going to happen that often.

It happens often enough - its just normally by a core developer.
And realistically, almost every RA is going to get similar treatment
(over time) as they're merged with the Red Hat ones.


 Solicit feedback, like was done for kliend's new agent, and replace
 the existing one it if/when people respond positively.

 That would be for the best, but it takes time. We may opt for it,
 but I wanted to add the this agent to the new release.

Understood - but I think the long-term pain that is created outweighs
any perceived benefit in the short-term.

 Also, it
 is very seldom that people test anything which is not contained
 in the release. Unless there's no alternative as was the case
 with conntrac.

 Its not like the old one disappears from the face of the earth after
 you merge the new one.

    wget -o /usr/lib/ocf/resource.d/heartbeat/db2
 http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2

 What do you suggest? That we add to the release announcement:

        The db2 RA has been rewritten and didn't get yet a lot of
        field testing. Please help test it.

So don't do that :-)
Put up a wiki page with instructions for how to download+use the new
agent and give feedback.

If the new version is significantly better, you're going to hear
people pleading for its inclusion pretty soon.

 But, if you want to keep
        the old agent, download the old one from the repository and
        use it instead of the new one. And don't forget to do the
        same when installing the next resource-agents release.

 At any rate, I wouldn't want to take responsibility for replacing
 the existing (and working RA) with a completely new and not yet
 tested code. Call me coward :)

I wouldn't either - which is why I keep saying test then replace :-)
Another alternative, create a testing provider... not sure if its a
good idea or not, just putting it out there.

 Finally, I expected that the new functionality is going to be
 added without much changes to the existing code. But it turned
 out to be a rewrite.

 Cheers,

 Dejan

   HADR is a very different beast from non-HADR db, right? Why not
   then add the hadr boolean parameter and use that instead of
   checking if the resource has been configured as multi-state?
 
  I'll take responsibility for suggesting the use of ocf_is_ms(), and I'd
  be curious to find out what you think is wrong with that approach.
 
  There's nothing wrong in the sense whether it is going to work.
  But someday, db2 may sport say HADR2 or VHA or whatever else
  which may run as a ms resource. I just think that it's better to
  make it obvious in the configuration that the user runs HADR.
  Does that make sense?
 
  Because if anything is, then the mysql RA needs fixing too.
 
  No idea what's up with mysql.
 
  Cheers,
 
  Dejan
 
  Florian
 
 
 
 
  ___
  Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
  Home Page: http://linux-ha.org/
 
  ___
  Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
  Home Page: http://linux-ha.org/
 
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http

Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode

2011-02-09 Thread Andrew Beekhof

On Wed, Feb 9, 2011 at 3:35 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Wed, Feb 09, 2011 at 02:43:17PM +0100, Andrew Beekhof wrote:
  Are you going to change the name of every agent that gets a rewrite?
 
     IPaddr2-ng-ng-again-and-one-more-plus-one
 
  I don't think it is going to happen that often.

 It happens often enough - its just normally by a core developer.
 And realistically, almost every RA is going to get similar treatment
 (over time) as they're merged with the Red Hat ones.

 
  Solicit feedback, like was done for kliend's new agent, and replace
  the existing one it if/when people respond positively.


  Its not like the old one disappears from the face of the earth after
  you merge the new one.
 
     wget -o /usr/lib/ocf/resource.d/heartbeat/db2
  http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2
 
  What do you suggest? That we add to the release announcement:
 
         The db2 RA has been rewritten and didn't get yet a lot of
         field testing. Please help test it.

 So don't do that :-)
 Put up a wiki page with instructions for how to download+use the new
 agent and give feedback.

 How about a staging area?
 /usr/lib/ocf/resource.d/staging/

I was thinking along the same lines when I said testing.
Either name works for me :-)


 we can also add a
 /usr/lib/ocf/resource.d/deprecated/

 The thing in  .../heartbeat/ can become a symlink,
 and be given config file status by the package manager?
 Something like that.

 So we have it bundled with the release,
 it is readily available without much go to that web page and download
 and save to there and make executable and then blah.

 It would simply pop up in crm shell and DRBD-MC and so on.

 We can add please give feedback to the description,
 and this will replace the current RA with release + 2
 unless we get veto-ing feedback to the release notes.

 Once settled, we copy over the staging one to the real directory,
 replacing the original one, and add a please fix your config to the
 thing that remains in staging/, so we will be able to start a further
 rewrite with the next merge window.

  * does not break existing setups
  * new RAs and rewrites are readily available


 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com

 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Patch: fix wrong variable names in Dummy agent

2011-01-20 Thread Andrew Beekhof

Applied to Pacemaker too.
Thanks!

On Wed, Jan 19, 2011 at 6:45 PM, Holger Teutsch holger.teut...@web.de wrote:
 Hi,
 small fix.
 -holger

 # HG changeset patch
 # User Holger Teutsch holger.teut...@web.de
 # Date 1295458942 -3600
 # Node ID f9bb7dc26c80aaae2711a1b66e1af7a92d33bbc6
 # Parent  2b5603283560ca1c895d610a85155ddde198019e
 Low: Dummy: migrate_from/to: correct OCF_RESKEY_CRM_meta_migrate_xxx variable 
 names

 diff -r 2b5603283560 -r f9bb7dc26c80 heartbeat/Dummy
 --- a/heartbeat/Dummy   Tue Jan 18 18:01:33 2011 +0100
 +++ b/heartbeat/Dummy   Wed Jan 19 18:42:22 2011 +0100
 @@ -143,10 +143,10 @@
  start)         dummy_start;;
  stop)          dummy_stop;;
  monitor)       dummy_monitor;;
 -migrate_to)    ocf_log info Migrating ${OCF_RESOURCE_INSTANCE} to 
 ${OCF_RESKEY_CRM_meta_migrate_to}.
 +migrate_to)    ocf_log info Migrating ${OCF_RESOURCE_INSTANCE} to 
 ${OCF_RESKEY_CRM_meta_migrate_target}.
                dummy_stop
                ;;
 -migrate_from)  ocf_log info Migrating ${OCF_RESOURCE_INSTANCE} to 
 ${OCF_RESKEY_CRM_meta_migrated_from}.
 +migrate_from)  ocf_log info Migrating ${OCF_RESOURCE_INSTANCE} from 
 ${OCF_RESKEY_CRM_meta_migrate_source}.
                dummy_start
                ;;
  reload)                ocf_log err Reloading...


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Antwort: Re: Antwort: Re: OCF RA dev guide: final heads up

2011-01-17 Thread Andrew Beekhof

On Mon, Dec 13, 2010 at 4:32 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi,

 On Fri, Dec 10, 2010 at 01:48:26PM +0100, Florian Haas wrote:
 On 2010-12-10 13:42, alexander.kra...@basf.com wrote:
  So, best thing would be, as you already said: remove it from the
  environment. I could just save your time answering stupid questions.

 Seconded.

  @Florian: Isn't OCF_CHECK_LEVEL also missing in the guide ?
  And thank you very much for section 9.4 (fits to my questions
  from yesterday) :-)

 OCF_CHECK_LEVEL is such a terrible abomination that I refuse to write
 about it. Not until lmb has written his updated OCF spec, we've
 discussed and approved of it, and it's _still_ in there (which I doubt).

 While we're at it... Andrew, could you pass the
 OCF_RESKEY_CRM_meta_depth variable? Then we can update the
 resource agents and the documentation.

You mean create one and pass it?
No such thing currently exists.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] OCF RA dev guide: final heads up

2010-12-10 Thread Andrew Beekhof

On Fri, Dec 10, 2010 at 12:06 PM, Florian Haas florian.h...@linbit.com wrote:
 On 2010-12-08 18:15, alexander.kra...@basf.com wrote:
 Hi Florian,

 Section 5.10:
 The variables are missing a notify. It is:
 OCF_RESKEY_CRM_meta_notify_start_uname
 not
 OCF_RESKEY_CRM_meta_start_uname

 Thanks! Fixed.
 http://people.linbit.com/~florian/ra-dev-guide/_literal_notify_literal_action.html

 There is also the same set of variables that end on _resource.

 I'll leave those out of now, as those have never been of any practical
 relevance to me. If you're actually using them in an agent, please do
 let me know.

 Section 6.2:
 I think this statement: 'should never be changed by a resource agent '
 is in
 conflict with Section 4.3.

 No it's not. 4.3 says you can override it _from the command line_, 6.2
 says the resource agent should not modify it.

 Section 8.4:
 Statement: 'Stateful (master/slave) resources may influence their own
 /master preference'/
 IMHO they _must_ influence thier own master preference. If not,
 they will never been promoted.

 Correct. Fixed.
 http://people.linbit.com/~florian/ra-dev-guide/_specifying_a_master_preference.html

 A Note that 'crm_mon -A' shows the current values, might be also
 very helpfull.

 Nope. I'll try to not talk about Pacemaker specific binaries too much.

 No Section:
 Is there a reason, why the environment variable
 'OCF_RESKEY_CRM_meta_role', which is set in the monitor action,
 isn't mentioned anywhere ?

 Make a good case for it to be explained, and convince me that it won't
 just serve to confuse everybody, and I'll include it. My best guess,
 however, is that once I do include it, we'll see a lot of

 monitor() {
  if [ $OCF_RESKEY_CRM_meta_role = Master ]; then
    return $OCF_RUNNING_MASTER
  fi
  ...
 }

 ... and that's clearly nonsense.

And a good way to ensure i strip it from the environment :-)
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] LF #1943: alternative implementation of an apache resource agent

2010-12-09 Thread Andrew Beekhof

On Thu, Dec 9, 2010 at 9:12 AM, Florian Haas florian.h...@linbit.com wrote:
 On 2010-12-07 22:08, Andrew Beekhof wrote:
 On Tue, Dec 7, 2010 at 4:17 PM, Florian Haas florian.h...@linbit.com wrote:
 Folks,

 shamefully we've had a resource agent contribution sitting around in the
 LF bugzilla for almost 2 1/2 years without acting on it. It's an
 alternative implementation of an Apache httpd resource agent:

 http://developerbugs.linux-foundation.org/show_bug.cgi?id=1943

 (I just renamed the bug, previously it was something totally misleading
 about Debian lenny.)

 Any objections to getting this RA into decent late-2010 shape and adding
 it to the repo as ocf:heartbeat:apache2?

 Please, not a repeat of ipaddr.
 Can we not replace the old one and include some compatibility code
 (that we can remove eventually)?

 I believe you sat across the table from me when we discussed resource
 agent deprecation? Both apache and IPaddr are storybook examples for that.

Thats fine, but you didn't actually say that you planned for the
existing one to be deprecated.
Here's a thought, why dont we hold off a fraction longer and put this
one in the new combined namespace and thus avoid needing the 2
suffix forevermore.


 I really have no intention sanitizing ocf:heartbeat:apache and then
 leaving things like 2 parameters being named testregex and
 testregex10 around for compatibility reasons. No sir I do not.

 Florian


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] LF #1943: alternative implementation of an apache resource agent

2010-12-09 Thread Andrew Beekhof

On Thu, Dec 9, 2010 at 7:28 PM, Florian Haas florian.h...@linbit.com wrote:
 On 12/09/2010 10:45 AM, Andrew Beekhof wrote:
 On Thu, Dec 9, 2010 at 9:12 AM, Florian Haas florian.h...@linbit.com wrote:
 On 2010-12-07 22:08, Andrew Beekhof wrote:
 On Tue, Dec 7, 2010 at 4:17 PM, Florian Haas florian.h...@linbit.com 
 wrote:
 Folks,

 shamefully we've had a resource agent contribution sitting around in the
 LF bugzilla for almost 2 1/2 years without acting on it. It's an
 alternative implementation of an Apache httpd resource agent:

 http://developerbugs.linux-foundation.org/show_bug.cgi?id=1943

 (I just renamed the bug, previously it was something totally misleading
 about Debian lenny.)

 Any objections to getting this RA into decent late-2010 shape and adding
 it to the repo as ocf:heartbeat:apache2?

 Please, not a repeat of ipaddr.
 Can we not replace the old one and include some compatibility code
 (that we can remove eventually)?

 I believe you sat across the table from me when we discussed resource
 agent deprecation? Both apache and IPaddr are storybook examples for that.

 Thats fine, but you didn't actually say that you planned for the
 existing one to be deprecated.

 Oh sorry, I thought the implication was obvious. :)

 Here's a thought, why dont we hold off a fraction longer and put this
 one in the new combined namespace and thus avoid needing the 2
 suffix forevermore.

 Hm. Well I think I'll grudgingly agree. Since we've already let this sit
 on the shelf for two years, I guess two more months or so doesn't make
 much difference...


Yeah, might be something we want to think about creating a strategy for though.
Not every day will we be opening up a new namespace
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] LF #1943: alternative implementation of an apache resource agent

2010-12-07 Thread Andrew Beekhof

On Tue, Dec 7, 2010 at 4:17 PM, Florian Haas florian.h...@linbit.com wrote:
 Folks,

 shamefully we've had a resource agent contribution sitting around in the
 LF bugzilla for almost 2 1/2 years without acting on it. It's an
 alternative implementation of an Apache httpd resource agent:

 http://developerbugs.linux-foundation.org/show_bug.cgi?id=1943

 (I just renamed the bug, previously it was something totally misleading
 about Debian lenny.)

 Any objections to getting this RA into decent late-2010 shape and adding
 it to the repo as ocf:heartbeat:apache2?

Please, not a repeat of ipaddr.
Can we not replace the old one and include some compatibility code
(that we can remove eventually)?


 Cheers,
 Florian


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] OCF Resource Agent Developer's Guide (Draft)

2010-11-26 Thread Andrew Beekhof

On Mon, Nov 22, 2010 at 2:40 PM, Florian Haas florian.h...@linbit.com wrote:
On 2010-11-22 11:02, alexander.kra...@basf.com wrote:
Hi Florian,

I did shortly read about your guide. Very good aggregation and definitely a
good starting point to begin with own development.

Properly you could go into more datails about Master/Slave resources ?
Concrete I am missing something about the usage of the crm_master command.

Thanks for the tip. Does this work for you?

http://people.linbit.com/~florian/ra-dev-guide/_special_considerations.html#_specifying_a_master_preference

And also something about the monitoring function in a M/S case. E.g.
should a M/S resource return OCF_NOT_RUNNING or OCF_FAILED_MASTER,
if there are no more processes of the resource on the node ?

Well, to be honest I don't know how $OCF_NOT_RUNNING and
$OCF_FAILED_MASTER are _expected_ to be handled differently, as
master/slave resources never made it back into the OCF spec.

Maybe Andrew can shed some light on this: is $OCF_FAILED_MASTER expected
to mean some failure occurred while the resource was running in master
mode, or some problem occurred so that the resource is no longer
running in master mode while it is expected to, but it is otherwise
running normally?

If only it were documented somewhere like:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-ocf-return-codes.html

In the former case, I'd expect demote-stop-start-promote as the
recovery action; in the latter, just demote-promote. Or am I completely
off the mark?

How should the agent remenber his last status ?

It really shouldn't. That's the cluster manager's job. All the agent has
to deliver is the _current_ status, by means of the monitor action.

Cheers,
Florian

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] a scalable membership and LRM proxy proposal

2010-11-18 Thread Andrew Beekhof

On Tue, Nov 16, 2010 at 4:23 PM, Alan Robertson al...@unix.sh wrote:
 Thanks for this information.  This is _SOOO_ much better than trying to
 dig it all out of the web site.

 On 11/16/2010 03:04 AM, Andrew Beekhof wrote:
 Alan Robertson wrote:
 I was hoping for something a little more lightweight - although I
 clearly understand the benefits of it already exists and having some
 credible claims to security as a goal (since nothing is ever secure).

 I wonder if you really want that kind of very strongly guaranteed
 message delivery
 Not always, possibly not ever.
 But happily this is configurable, so we'd only ask for those
 guarantees if we needed them.
 That's good to know.  Is this an Apache extension, or is this part of
 the standard?

Not sure


 Does Qpid support IPv6?

RH requires everything we ship to support v6 so I'd be highly
surprised if it didnt.

 - since messages sent to a node that crashes before
 receiving them are delivered after it comes back up.  But, of course,
 there's always a way to work around things that don't do what you need
 for them to.  Presumably you'd also need to clean up those messages out
 of the queues of all senders if the node is going away permanently - at
 least once you figure that out...  Messages to clients seem to better
 match the semantics of RDS.  Messages back to overlords could use AMQP
 without obvious corresponding issues.

 I wonder about latency - particularly when federated - and taking
 garbage collection into account...  I see that QPID claims to be
 extremely fast.  It probably is pretty fast for a large and complex
 Java program.
 Here are the numbers from their website:

 Red Hat MRG product built on Qpid has shown 760,000msg/sec ingress on
 an 8 way box or 6,000,000 msg/sec
 Is there something missing from this sentence, or am I just dense?  I'm
 guessing that this is intended to imply that it can process 760K
 msgs/sec per CPU, giving a projected 6M msgs/sec for an 8-way...
 Latencies have been recored as low as 180-250us (.18ms-.3ms) for TCP
 round trip and 60-80us for RDMA round trip using the C++ broker
 For latencies, something more like 99th percentile guarantees are a
 better measure than best case latencies.  And, if it uses TCP, then the
 overhead of holding 10K TCP connections open at once seems a bit high -
 just to do nothing most of the time...  This model is different from the
 design point for this protocol.  I expect that most of the time these
 connections would sit idle.

You'd have to talk to the qpid/qmf teams about how the numbers were obtained.
I just copied them verbatim off the website.

Current indications from them are that they can easily handle the
workloads we're planning for.

 One of the cool things about the proposal I made is that that the
 overlords incur near-zero ongoing overhead to monitor a very very big
 network, and no network congestion.  This work to do this monitoring is
 spread pretty evenly among all the nodes in the system such that no node
 has to keep track of more than a handful of peers (most only have two
 peers - it looks like it could be bounded to 4 peers worst case).
 Ring-structured heartbeat communication looks like it should work out
 very well.

Its not that I'm against your proposal, I just don't know of enough
resources to build, test and stabilize a new communication protocol.
In that context, an off-the-shelf component that gives us a couple of
magnitudes worth of additional scaling looks pretty attractive - and
should provide some valuable feedback for how to take it to the next
level.


 Nevertheless, I see the attraction.  Not sure it's what I want, but
 since I don't know yet quite what I want - that would be hard to say :-).
 Yep, nothing forcing everyone down the same path.
 Got that. I see advantages to having at least some common
 APIs/libraries/interfaces/something.  Cross-pollination of ideas is
 good.  Sharing code and having alternatives is better - if not too
 expensive in code, organizational overhead and emotional energy.

 Thanks for taking time to share ideas and educate me,

NP.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH 06 of 10] cl_log: Always print the?common log entity to syslog messages

2010-11-17 Thread Andrew Beekhof

On Wed, Nov 17, 2010 at 9:03 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Wed, Nov 17, 2010 at 08:21:03AM +0100, Andrew Beekhof wrote:
 On Tue, Nov 16, 2010 at 2:27 PM, Dejan Muhamedagic deja...@fastmail.fm 
 wrote:
  Hi,
 
  On Mon, Nov 15, 2010 at 06:18:29PM +0100, Bernd Schubert wrote:
  On Monday, November 15, 2010, Dejan Muhamedagic wrote:
 This may truncate entity, and of course breaks existing filtering
 setups that trigger on it.
  
   Right. So, this needs to be optional.
 
  Ok, any favourite option keyword in logd.cf? If it comes to me, the 
  present
  logd.cf *suggests* that not adding the commong log entity is a bug:
 
  quote
  #       Entity to be shown at beginning of a message
  #       for logging daemon
  #       Default: logd
   entity logd
  /quote
 
  Was that option only for its own messages?
 
  Yes, and the default in case the client didn't supply its entity.
 
  Isn't that a bit over-due?
 
  I guess you meant overkill? Perhaps.
 
  So maybe
  entity none should suppress it in the future?
 
  We can't change the semantic. So, we need a new option name.
  extra_entity? common_entity?

 fwiw, syslog_facility works the way Bernd proposed

 It does not, at least not if you don't have it exclusively.

Pretty sure it does, I recall adding the code since there was no other
way to turn syslog logging off from ha.cf.
Has it changed since?

 That has been his starting point.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] a scalable membership and LRM proxy proposal

2010-11-16 Thread Andrew Beekhof

On Tue, Nov 16, 2010 at 7:06 AM, Alan Robertson al...@unix.sh wrote:
 Hi,

 I missed the through federation part.  Sorry...   As a point of
 comparison - the proposal as described on my blog does not require
 federation.  Probably at least as scalable, and it's very probable that
 it's lower latency - and it's pretty much dead certain that it's lower
 traffic on the network.

https://cwiki.apache.org/qpid/faq.html#FAQ-Heartbeats

Heartbeats are sent by the broker at a client specified,
per-connection frequency

I doubt that the granularity would be 1s for most use-cases.
Certainly there is scope to tune it to match your network.


 I assume that QMF is the Qpid Management Framework found here?
     https://cwiki.apache.org/qpid/qpid-management-framework.html

Correct


 I was hoping for something a little more lightweight - although I
 clearly understand the benefits of it already exists and having some
 credible claims to security as a goal (since nothing is ever secure).

 I wonder if you really want that kind of very strongly guaranteed
 message delivery

Not always, possibly not ever.
But happily this is configurable, so we'd only ask for those
guarantees if we needed them.

 - since messages sent to a node that crashes before
 receiving them are delivered after it comes back up.  But, of course,
 there's always a way to work around things that don't do what you need
 for them to.  Presumably you'd also need to clean up those messages out
 of the queues of all senders if the node is going away permanently - at
 least once you figure that out...  Messages to clients seem to better
 match the semantics of RDS.  Messages back to overlords could use AMQP
 without obvious corresponding issues.

 I wonder about latency - particularly when federated - and taking
 garbage collection into account...  I see that QPID claims to be
 extremely fast.  It probably is pretty fast for a large and complex
 Java program.

Here are the numbers from their website:

Red Hat MRG product built on Qpid has shown 760,000msg/sec ingress on
an 8 way box or 6,000,000 msg/sec

Latencies have been recored as low as 180-250us (.18ms-.3ms) for TCP
round trip and 60-80us for RDMA round trip using the C++ broker


 Nevertheless, I see the attraction.  Not sure it's what I want, but
 since I don't know yet quite what I want - that would be hard to say :-).

Yep, nothing forcing everyone down the same path.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH 06 of 10] cl_log: Always print the?common log entity to syslog messages

2010-11-16 Thread Andrew Beekhof

On Tue, Nov 16, 2010 at 2:27 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi,

 On Mon, Nov 15, 2010 at 06:18:29PM +0100, Bernd Schubert wrote:
 On Monday, November 15, 2010, Dejan Muhamedagic wrote:
This may truncate entity, and of course breaks existing filtering
setups that trigger on it.
 
  Right. So, this needs to be optional.

 Ok, any favourite option keyword in logd.cf? If it comes to me, the present
 logd.cf *suggests* that not adding the commong log entity is a bug:

 quote
 #       Entity to be shown at beginning of a message
 #       for logging daemon
 #       Default: logd
  entity logd
 /quote

 Was that option only for its own messages?

 Yes, and the default in case the client didn't supply its entity.

 Isn't that a bit over-due?

 I guess you meant overkill? Perhaps.

 So maybe
 entity none should suppress it in the future?

 We can't change the semantic. So, we need a new option name.
 extra_entity? common_entity?

fwiw, syslog_facility works the way Bernd proposed
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] a scalable membership and LRM proxy proposal

2010-11-11 Thread Andrew Beekhof

Some of your thinking mirrors our own.

What we're moving towards is indeed two tiers of membership.
One being a small but fully meshed set of, to use your terminology,
Overlords running a traditional cluster stack.
The other being a much larger set of independent nodes or VMs running
only an lrm-like proxy.

Members of the second tier have no knowledge of each other's
existence, nor even of the cluster itself.

The transport layer we plan on using to talk to these nodes is QMF
(which implements AMQP).
QMF has the nice properties of being cross-platform (ie. windows),
standards based and something that already exists.
We also know that it is secure, fast, and scales well through federation.
Happily it also gives us node up/down information for free.

As Lars mentioned, a Matahari agent (essentially the lrm with a QMF
interface on top) is intended to act as the proxy.
He also mentioned container resources, but this was a red herring.
Whether the entities running Matahari are also guests being managed by
Pacemaker is irrelevant. They can equally be physical machines or
cloud instances.

The Matahari and QMF pieces are both generically useful components
with no ties to Pacemaker.
There will still need to be integration done to hook up the node
liveliness and add the ability to send resource commands via the QMF
bus. What form this work takes will depend on which parts of
Pacemaker are being used in the overall architecture.

On Thu, Nov 4, 2010 at 3:48 PM, Alan Robertson al...@unix.sh wrote:
I've been thinking about the idea of very highly scalable membership, and
also about the LRM proxy function which is currently being performed by
Pacemaker. Towards this end I wrote up a high-level design (or
architecture, or design philosophy or something) for such a scalable
membership/LRM proxy service. The design is not specific to working with
Pacemaker - it could work with Pacemaker, or a number of other kinds of
management entities.

The kind of membership outlined here would be (in Pacemaker terms) sort of a
second-class membership - which has advantages and disadvantages.

The blog post can be found here:
http://techthoughts.typepad.com/managing_computers/2010/10/big-clusters-scalable-membership-proposal.html

Please feel to comment on it on the blog, or on the mailing list. I've
reproduced the blog posting below:

Really Big Clusters: A Scalable membership proposal

This blog entry is a bit different than previous entries - I'm proposing
some enhanced capabilities to go with the LRM and friends from the Linux-HA
project. I will update this entry on an ongoing basis to match my current
thinking about this proposal.

This post outlines a proposed server liveness (membership) design which is
intended to scale up to tens of thousands of servers to be managed as an
entity.

Scalability depends on a lot of factors - processor overhead, network
bandwidth, and network load. A highly scalable system will take all of
these factors into account. From the perspective of the server software
author (like, for example, me), one of the easiest to overlook is network
load. Network load depends on a number of factors - number of packets, size
of the packets, how many switches or routers it has to go through, and how
many endpoints will receive the packet. To best accomplish this task, it is
desirable that the majority of normal traffic be network topology aware.
To scale up to very large collections of computers, it also necessary that
as much as possible be monitored as locally as possible. In addition, since
switching gear is not optimized for multicast packets, and multicast packets
consume significant resources when compared to unicast packets, it is
desirable to avoid using multicast packets during normal operation.

The Basic Concept - network aware liveness

Although the LRM in Linux-HA is not network-enabled, it tries to minimize
monitoring overhead by distributing the task of monitoring resources to the
machine providing the resource, and only reporting failures upstream.

To extend this idea of local monitoring into system liveness monitoring, one
might imagine a standard 48-port network switch with 48 servers attached to
it. If one were to choose a server to act as proxy for monitoring the
servers on that switch, then the other 47 nodes on the switch would send
that unicast heartbeats to that node. That node in turn would report on
failure-to-receive-heartbeat events from the other 47 nodes on the switch.

To ensure detection of failures of the monitoring node, that node could send
its heartbeat upstream to a process monitoring it. In order to ensure
continual service, it may be desirable to have two monitoring nodes per
switch, with each one also monitoring the other. This results in a
24-to-one reduction in traffic going off the switch, and a corresponding
decrease in workload to monitor these 48 servers.

If one were to implement this in the context

Re: [Linux-ha-dev] incorrect diff of sysinfo.txt

2010-09-22 Thread Andrew Beekhof

On Wed, Sep 22, 2010 at 4:07 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi,

 On Wed, Sep 22, 2010 at 12:50:53PM +0200, Raoul Bhatia [IPAX] wrote:
 running debian 5.0.6 and cluster-glue 1.0.6-1~bpo50+1, hb_report's
 analysis.txt shows:

  Diff sysinfo.txt... --- /root/move_ftp_group/wc01/sysinfo.txt       
  2010-09-22 12:17:21.0 +0200
  +++ /root/move_ftp_group/wc02/sysinfo.txt   2010-09-22 12:17:21.0 
  +0200
  @@ -2,7 +2,7 @@
   cluster-glue: 1.0.6 (1c87a0c58c59fc384b93ec11476cefdbb6ddc1e1)
   resource-agents: # Build version: 5ae70412eec8099b25e352110596dd279d267a8a
   CRM Version: 1.0.9 (74392a28b7f31d7ddc86689598bd23114f58978b)
  - 1.0.9.1+hg15626-1~bpo50+1   1.2.1-1~bpo50+1 1.0.6-1~bpo50+1  
  1:3.0.3-2~bpo50+1  1:3.0.3-2~bpo50+1   2.02.39-8Platform: Linux
  + 1.0.9.1+hg15626-1~bpo50+1   1.2.1-1~bpo50+1 1.0.6-1~bpo50+1  
  1:3.0.3-2~bpo50+1  1:3.0.3-2~bpo50+1   2.02.39-8 Platform: Linux
   Kernel release: 2.6.27.54+ipax
   Architecture: x86_64
   Distribution: Description: Debian GNU/Linux 5.0.6 (lenny)


 there is one space missing before Platform: Linux.

 i do not know where this difference comes from, but what about
 using diff -wu instead of diff -u in txtdiff() ?

 We could try that, but better fixing the actual output. It seems
 like part of the output of crmd version (perhaps others too) goes
 to stderr which is then mixed randomly with the stdout material.

 Andrew:

        fprintf(stderr, CRM Version: );
                fprintf(stdout, %s (%s)\n, VERSION, BUILD_VERSION);

 Is that intentional?

I probably had some notion of allowing just the version to be captured
in a shell variable.
Happy if you want to change to to stdout instead.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] About movement of the Quorum control.

2010-08-27 Thread Andrew Beekhof

On Fri, Aug 27, 2010 at 8:12 AM,  renayama19661...@ybb.ne.jp wrote:
 Hi Andrew,

 Thank you for comment.

 I doubt it, I think development on heartbeat is at an end and
 maintenance is limited to regressions.

 We understand it enough.

The biggest problem is that none of us actually understand the CCM.
I looked at it once a long time ago and I've never been so frightened
in my life.

But there's nothing stoping someone at NTT becoming a CCM expert ;-)

 However, it is very difficult for us to wait for corosync to be stable.

Its pretty close these days.
It does lack ucast and bcast support, but there are plans to address
that outside of corosync.


 But maybe lge would like to comment further.

 Well.
 Let's wait for more comment.

 Best Regards,

 --- Andrew Beekhof and...@beekhof.net wrote:

 On Wed, Aug 25, 2010 at 3:20 AM,  renayama19661...@ybb.ne.jp wrote:
  Hi Developers of Heartbeat,
 
  When we combined Pacemaker with Heartbeat, we understand that Quorum 
  control does not work
 well.
 
  For example, it occurs when a cluster consisted of plural nodes when I set 
  it besides
  no-quorum-policy=ignore.
 
  We know that this is considerably always the problem that was already 
  known.
 
  The problem occurs by the difference of the timing of the detection when a 
  node was divided.
 
  We think that there is the problem in Heartbeat.
  #65533;* There may be the problem in CCM.
  #65533;* Because the reason is because the problem does not occur when it 
  combined Pacemaker with
 corosync
  to notify of node division definitely.
 
  Our many users are going to use the Quorum control in Heartbeat.
 
  Heartbeat has to notify pacemaker of a change of right node constitution 
  like corosync.
 
  Is there the plan when Quorum control of Heartbeat becomes right?

 I doubt it, I think development on heartbeat is at an end and
 maintenance is limited to regressions.
 But maybe lge would like to comment further.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] About movement of the Quorum control.

2010-08-27 Thread Andrew Beekhof

On Fri, Aug 27, 2010 at 4:04 PM, Lars Marowsky-Bree l...@novell.com wrote:
 On 2010-08-27T09:53:32, Andrew Beekhof and...@beekhof.net wrote:

  However, it is very difficult for us to wait for corosync to be stable.
 Its pretty close these days.
 It does lack ucast and bcast support, but there are plans to address
 that outside of corosync.

 The current corosync is really quite stable, certainly as stable as CCM.
 If it isn't, there's at least someone we can ask for active help.

 bcast is handled by corosync, by the way; and unicast support may also
 be coming.

 I'm curious what you mean by outside of corosync, though - how would
 the network protocol be addressed outside of corosync?

http://www.linuxplumbersconf.org/2010/ocw/proposals/1065

Basically its intended to be a generically useful layer sitting
underneath corosync.
Steve was skeptical at first but is apparently on board with the idea
now - to the point where its the preferred approach for providing
ucast capabilities.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] About movement of the Quorum control.

2010-08-27 Thread Andrew Beekhof

On Fri, Aug 27, 2010 at 4:04 PM, Lars Marowsky-Bree l...@novell.com wrote:
 bcast is handled by corosync, by the way; and unicast support may also
 be coming.

Yeah, but its not supported on RhEL which means he's not actively
testing it or fixing bugs.
Your milage may vary :-)
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] About movement of the Quorum control.

2010-08-26 Thread Andrew Beekhof

On Wed, Aug 25, 2010 at 3:20 AM,  renayama19661...@ybb.ne.jp wrote:
 Hi Developers of Heartbeat,

 When we combined Pacemaker with Heartbeat, we understand that Quorum control 
 does not work well.

 For example, it occurs when a cluster consisted of plural nodes when I set it 
 besides
 no-quorum-policy=ignore.

 We know that this is considerably always the problem that was already known.

 The problem occurs by the difference of the timing of the detection when a 
 node was divided.

 We think that there is the problem in Heartbeat.
  * There may be the problem in CCM.
  * Because the reason is because the problem does not occur when it combined 
 Pacemaker with corosync
 to notify of node division definitely.

 Our many users are going to use the Quorum control in Heartbeat.

 Heartbeat has to notify pacemaker of a change of right node constitution like 
 corosync.

 Is there the plan when Quorum control of Heartbeat becomes right?

I doubt it, I think development on heartbeat is at an end and
maintenance is limited to regressions.
But maybe lge would like to comment further.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH] IPv6addr: removing libnet dependency

2010-07-30 Thread Andrew Beekhof

On Fri, Jul 30, 2010 at 11:42 AM, Keisuke MORI
keisuke.mori...@gmail.com wrote:
 2010/7/27 Andrew Beekhof and...@beekhof.net:
 On Tue, Jul 27, 2010 at 8:44 AM, Keisuke MORI keisuke.mori...@gmail.com 
 wrote:
 For heartbeat, I personally like pacemaker on in ha.cf :)

 One thing thats coming in 1.1.3 is an mcp (master control process) and
 associated init script for pacemaker.
 This means that Pacemaker is started/stopped independently of the
 messaging layer.

 Currently this is only written for corosync[1], but I've been toying
 with the idea of extending it to Heartbeat.
 In which case, if you're already changing the option, you might want
 to make it: legacy on/off
 Where off would be the equivalent of starting with -M (no resource
 management) but wouldn't spawn any daemons.

 Thoughts?

 I have a several concerns with that change,

 1) Is it possible to recover or cause a fail-over correctly when any
 of the Pacemaker/Heartbeat process was failed?
   (In particular, for the failure of the new mcp process of pacemaker
 and for the current heartbeat's MCP process failure)

If the MCP dies, so will the crmd and cib (and by extension,
everything else except the PE and LRMd).
The types of failures are well tested.

Failure of heartbeat also result in the same types of secondary
failures and recovery as we see now.


 2) Would the daemons used with respawn directive such as hbagent(SNMP
 daemon) or pingd work as compatible?

Pingd will no longer exist in 1.1, if I've not removed it already.
It is completely replaced by ocf:pacemaker:ping

hbagent might have to be a bit smarter about what to do when the cib
isn't around, but otherwise it shouldn't be a problem.


 3) After all, what would be the benefit for end users with the change?

For corosync based clusters its a clear win - we get a much more
reliable startup/shutdown sequence.
Its also far more obvious what is happening at each stage, one can
also stop all resources on a node without taking the node offline.

If we changed it for heartbeat, it would be mostly for consistency.

   I feel like it's only adding some complexity to the operations and
 the diagnostics by the end users.

 I guess that I would only use legacy on on the heartbeat stack...

Correct.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH] IPv6addr: removing libnet dependency

2010-07-27 Thread Andrew Beekhof

On Tue, Jul 27, 2010 at 8:44 AM, Keisuke MORI keisuke.mori...@gmail.com wrote:
 2010/7/26 Lars Ellenberg lars.ellenb...@linbit.com:
 On Mon, Jul 26, 2010 at 06:39:50PM +0900, Keisuke MORI wrote:
 By the way, do we have any plan to release the next
 agents/glue/heartbeat packages from the Linux-HA project?
 I think it's good time to consider them for the best use of pacemaker-1.0.9.

 I think glue was released by dejan just before he went on vacation,
 though the release announcement is missing (1.0.6).

 Heartbeat does not have many changes (appart from some cleanup in the
 build dependencies), so there is no urge to release a 3.0.4, but we
 could do so any time.

 Agents has a few fixes, but also has some big changes.
 I have to take an other close look, but yes, I think we should release
 an agents 1.0.4 within the next few weeks.

 Great! Then let's go for the next release for agents/heartbeat along with 
 glue.

 My most concern about agents is LF#2378:
 http://developerbugs.linux-foundation.org/show_bug.cgi?id=2378
 It is a change but it's a necessary change to make the maintenance
 mode work fine.

 For heartbeat, I personally like pacemaker on in ha.cf :)

One thing thats coming in 1.1.3 is an mcp (master control process) and
associated init script for pacemaker.
This means that Pacemaker is started/stopped independently of the
messaging layer.

Currently this is only written for corosync[1], but I've been toying
with the idea of extending it to Heartbeat.
In which case, if you're already changing the option, you might want
to make it: legacy on/off
Where off would be the equivalent of starting with -M (no resource
management) but wouldn't spawn any daemons.

Thoughts?


[1] Which has an API call which allows Pacemaker to prevent Corosync
from shutting down if its still running.
 So service corosync stop will fail if you didnt already run
service pacemaker stop
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH] IPv6addr: removing libnet dependency

2010-07-23 Thread Andrew Beekhof

On Fri, Jul 23, 2010 at 5:09 AM, Simon Horman ho...@verge.net.au wrote:
 On Fri, Jul 23, 2010 at 09:19:44AM +0900, Keisuke MORI wrote:
 The attached patch is to remove libnet dependency from IPv6addr RA
 by replacing the same functionality using the standard socket API.

 Currently there are following problems with resource-agents package:

  - IPv6addr RA requires an extra libnet package on the run-time environment.
   That is pretty inconvenient particularly for RHEL users because
   it's not included in the standard distribution.

  - The pre-built RPMs from ClusterLabs does not include IPv6addr RA.
   This was once reported in the pacemaker list:
   http://www.gossamer-threads.com/lists/linuxha/pacemaker/64295#64295

 The patch will resolve those issues.
 I believe that none of Pacemaker/Heartbeat related packages would be
 depending on libnet library any more after patched.

 Hi Mori-san,

 I will add that libnet seems to be more or less unmaintained.

Someone recently picked it up again, but I'm in favor of the patch for
the reasons Mori-san already stated.

 You seem to make using libnet optional, is there a reason
 not to just remove it? portability?

Agreed, lets just drop it.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Upstart RA

2010-05-17 Thread Andrew Beekhof

2010/5/16 Ante Karamatić iv...@ubuntu.com:
 Hi all

 I'm working on an OCF RA for upstart. In its current state, upstart
 doesn't return exit codes for 'start', 'stop' or 'status'. Or, to be
 precise, exit code is always 0.

 Exit codes weren't implemented since upstart knows a bit more states
 than just 'running' or 'not running', i.e. it knows distinction between
 running, but stopping and running.

Which is still no excuse for them not doing exit codes properly.
They should have just added a few more not thrown them out and made
automation that much harder.

I'm pretty sure that internally they're not using regex's to parse the
state of services :-/

 Never the less, it has exit statuses
 which are machine readable with grep/awk/whatever. Exit codes will be
 implemented in feature (probably in couple of months).

 So, to create an resource agent that would utilize upstart we could
 relay on greping the output of initctl commands or we could relay on dbus.

 Approach I've taken was to utilize python interface to dbus. Reason for
 this is that upstream prefers communication over dbus, as explained at
 http://upstart.ubuntu.com/wiki/DBusInterface.

 I should have this RA done very soon and my plan was to name it
 upstart-dbus, since it would depend on dbus.

 dbus isn't installed by default on ubuntu server and probably it isn't
 installed on other server distributions (correct me if I'm wrong). Would
 depending on dbus be a problem

I think I'd not make it a strict dependency, and instead make sure the
RA checked for dbus and produced OCF_NOT_INSTALLED if it wasn't
available.

 and if not, would python based RA be
 acceptable at all?

I think we're supposed to be language agnostic so I'd not imagine that
to be a problem.

Being a plugin is probably a better solution in the long term though,
since then we might be able to take advantage of the upstart events.
It also uses 0.1% fewer characters to configure too I guess :-)

 I'm aware that it is a bit slower than just greping,
 but measured with time(1) worst result for 'status' was 0,05 seconds.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Monitor Operation on MySQL Master/Slave Group

2010-05-15 Thread Andrew Beekhof

Try removing all the operations in p_mysql.
They should only be defined in ms_mysql

On Fri, May 14, 2010 at 9:20 PM, John Ratz john.mark.r...@gmail.com wrote:
 Hi,
 I did have different interval times, but the problem seems to be that
 a separate monitor function (monitor_Master_0) in the mysql script is needed
 in order to support a separate monitor op.

 Here's my config before and after the change:

 crm(live)configure# show
 node $id=58397566-b452-43f3-ac8e-45549776d863 node2
 node $id=611d50bd-62c1-4629-bb16-c29ec6210807 node1
 primitive p_clusterip ocf:heartbeat:IPaddr2 \
     params ip=10.176.1.167 cidr_netmask=32 \
     op monitor interval=3s \
     meta target-role=Started
 primitive p_mysql ocf:heartbeat:mysql \
     params binary=/usr/bin/mysqld_safe config=/etc/my.cnf
 socket=/var/run/mysqld/mysql.sock datadir=/var/lib/mysql
 replication_user=root replication_passwd=
 pid=/var/run/mysqld/mysqld.pid test_user=root test_passwd= \
     op start interval=0 timeout=120s \
     op stop interval=0 timeout=120s \
     op monitor interval=30 timeout=30 OCF_LEVEL_CHECK=1
 ms ms_mysql p_mysql \
     meta notify=true master-max=1 is-managed=true
 target-role=Started
 location l_master ms_mysql \
     rule $id=l_master-rule $role=Master 50: #uname eq node1
 colocation mysql_master_on_ip inf: p_clusterip ms_mysql:Master
 order mysql_before_ip inf: ms_mysql:promote p_clusterip:start
 property $id=cib-bootstrap-options \
     dc-version=1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7 \
     cluster-infrastructure=Heartbeat \
     stonith-enabled=false \
     no-quorum-policy=ignore \
     last-lrm-refresh=1273611238
 rsc_defaults $id=rsc-options \
     resource-stickiness=100
 crm(live)configure# edit p_mysql
 WARNING: p_mysql: action monitor_Master_0 not advertised in meta-data, it
 may not be supported by the RA
 crm(live)configure# show
 node $id=58397566-b452-43f3-ac8e-45549776d863 node2
 node $id=611d50bd-62c1-4629-bb16-c29ec6210807 node1
 primitive p_clusterip ocf:heartbeat:IPaddr2 \
     params ip=10.176.1.167 cidr_netmask=32 \
     op monitor interval=3s \
     meta target-role=Started
 primitive p_mysql ocf:heartbeat:mysql \
     params binary=/usr/bin/mysqld_safe config=/etc/my.cnf
 socket=/var/run/mysqld/mysql.sock datadir=/var/lib/mysql
 replication_user=root replication_passwd=
 pid=/var/run/mysqld/mysqld.pid test_user=root test_passwd= \
     op start interval=0 timeout=120s \
     op stop interval=0 timeout=120s \
     op monitor interval=20 role=Slave timeout=20
 OCF_LEVEL_CHECK=1 \
     op monitor interval=10 role=Master timeout=20
 OCF_LEVEL_CHECK=1
 ms ms_mysql p_mysql \
     meta notify=true master-max=1 is-managed=true
 target-role=Started
 location l_master ms_mysql \
     rule $id=l_master-rule $role=Master 50: #uname eq node1
 colocation mysql_master_on_ip inf: p_clusterip ms_mysql:Master
 order mysql_before_ip inf: ms_mysql:promote p_clusterip:start
 property $id=cib-bootstrap-options \
     dc-version=1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7 \
     cluster-infrastructure=Heartbeat \
     stonith-enabled=false \
     no-quorum-policy=ignore \
     last-lrm-refresh=1273611238
 rsc_defaults $id=rsc-options \
     resource-stickiness=100
 crm(live)configure# commit
 WARNING: p_mysql: action monitor_Master_0 not advertised in meta-data, it
 may not be supported by the RA
 crm(live)configure#

 I know that the drbd script file supports seperate a seperate monitor op for
 the Master.  I tried changing

 action name=monitor depth=0 timeout=30 interval=10 /

 in the mysql file to

 action name=monitor depth=0 timeout=20 interval=20 role=Slave /
 action name=monitor depth=0 timeout=20 interval=10 role=Master /

 so that it matched what's in the drbd script file.  Doing this prevents the
 above warning, but this still fails and crashes the Master/Slave set.  I
 think that it's attempting to run the same monitor op with the same interval
 on both nodes despite having specified different intervals in the config.
  Perhaps a fix is needed somewhere in Pacemaker itself?

 It seems that it is not conforming
 to http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch10s03s03s03.html

 Thanks,

 John
 On Fri, May 14, 2010 at 3:37 AM, Raoul Bhatia [IPAX] r.bha...@ipax.at
 wrote:

 On 05/12/2010 09:30 PM, linuxha@gishpuppy.com wrote:
  Hello All,
 
  I've been testing out the newly added Master/Slave capability for MySQL,
  but it seems to be missing the monitor operation for the Master instance,
  i.e. the monitor op for the primitive mysql object only runs on the slave
  resource, and an added monitor op with role=Master fails.  Is this
  intended behavior?

 hi,

 did you specify two monitor operations with different intervals?

 e.g. [1]

  crm configure
 
  primitive drbd0 ocf:heartbeat:drbd \
   params drbd_resource=drbd0 \

Re: [Linux-ha-dev] Monitor Operation on MySQL Master/Slave Group [GishPuppy]

2010-05-13 Thread Andrew Beekhof

On Wed, May 12, 2010 at 9:30 PM,  linuxha@gishpuppy.com wrote:
 Hello All,

 I've been testing out the newly added Master/Slave capability for MySQL, but 
 it seems to be missing the monitor operation for the Master instance, i.e. 
 the monitor op for the primitive mysql object only runs on the slave 
 resource, and an added monitor op with role=Master fails.  Is this intended 
 behavior?

Expected, yes.
But not really ideal, we really need to fix that in pacemaker one day :-(
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Deprecated resource agents

2010-04-20 Thread Andrew Beekhof

On Tue, Apr 20, 2010 at 8:15 AM, Florian Haas florian.h...@linbit.com wrote:
 On 04/20/2010 07:03 AM, Tim Serong wrote:
 On 4/20/2010 at 06:48 AM, Lars Marowsky-Bree l...@novell.com wrote:
 In general, I think the ability to depreciate functionality is needed,
 but shouldn't be slip-streamed into a minor dot release, and we first
 need to do some more home work to get our infrastructure right before we
 should consider breaking customer configurations.

 This'd be easiest if the metadata explicitly said an RA was deprecated,
 for example something like:

   ?xml version=1.0?
   !DOCTYPE resource-agent SYSTEM ra-api-1.dtd
   resource-agent name=Evmsd version=0.9 deprecated=true
   ...

 ATM, the deprecated RAs all seem to follow the same convention of using
 (deprecated) in the shortdesc, e.g.:

   shortdesc lang=enControls clustered EVMS volume management
   (deprecated)/shortdesc

 ...but grepping arbitrary text out of a description always irks me.
 It's a little inexact.

 I'll shoulder the blame for that. I came up with that (deprecated)
 kludge for fear of lmb jumping in circles about an unauthorized
 modification of the RA metadata schema. But now you started it! Which
 allows me to wholeheartedly second your motion.

Which brings up another good point...
Can we please make OCF relevant again by converting the repo to Hg and
allowing access?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Deprecated resource agents

2010-04-19 Thread Andrew Beekhof

On Mon, Apr 19, 2010 at 2:05 PM, Florian Haas florian.h...@linbit.com wrote:
 Hello,

 in case you haven't yet noticed: as of resource-agents 1.0.2, several
 Linux-HA resource agents are marked as deprecated:

 - EvmsSCC and
 - Evmsd (both apply to EVMS, which is no longer maintained);

 - LinuxSCSI (superseded by SCSI reservations and SF-EX);

 - drbd (superseded by ocf:linbit:drbd);

 - pingd (superseded by ocf:pacemaker:pingd, which in turn is now
 considered obsolete and superseded by ocf:pacemaker:ping if I understand
 correctly).

Correct.
Everyone should be using ocf:pacemaker:ping

Its highly likely that pingd will redirect to ping one of these days.


 Since nobody should be using these anymore, and we've already kept them
 for two releases, I suggest that we drop these from 1.0.4.

 In addition, I suggest that we deprecate (and after a couple more
 releases, drop) the Linux-HA incarnations of Dummy and Stateful, as
 duplicates of these already exist in Pacemaker, and Andrew has indicated
 he wants to maintain them there rather than fix them in the Linux-HA repo.

 Thoughts and comments appreciated.

 Cheers,
 Florian


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] proposed fix for the ABI extension of cluster-glue

2010-04-17 Thread Andrew Beekhof

On Sat, Apr 17, 2010 at 11:56 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Sat, Apr 17, 2010 at 11:40:36AM +0200, Lars Marowsky-Bree wrote:
 Lars, I have no other way of saying this, but I still think you're
 completely misguided in this desire to preserve binary compatibility.

 What's the point in preserving local ABI compatibility if they have to
 restart everything anyway?

 Situation is:
 I had pacemaker 1.0.8 installed.
 There is no pacemaker 1.0.9 yet.
 Cluster glue is updated.
 I install updated cluster glue,
 as it better supports pacemaker 1.0.8.
 I do that, and boom, all my stack segfaults.

No, because 1.0.8-4 was rebuilt for the new version of glue.

Look, we all know lmb has some crazy-ass ideas, but I'm hard pressed
to disagree with anything he's said in this thread.
I vote for reapplying the patch, bumping the SO name and forgetting
about the whole thing.



 Why would I require my users to fetch new builds of the
 very same version of heartbeat and pacemaker,
 if it is easily avoided?

 Why would I knowingly break ABI compatibility, if I can avoid it, just
 for two ints added at the end of a struct instead of in the middle?

 I have absolutely no understanding for your desire to keep this ABI
 compatible and make code more complicated by needing to support

 You don't need to support different semantics.
 If you want to only support the new semantics,
 require the 2.1 library. Done.
 Whether you require 3.0 or 2.1 does not make a difference to you,
 does it.

 Anyway, I've had my say.

 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com

 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] proposed fix for the ABI extension of cluster-glue

2010-04-17 Thread Andrew Beekhof

On Sat, Apr 17, 2010 at 7:58 PM, Andrew Beekhof and...@beekhof.net wrote:
 On Sat, Apr 17, 2010 at 11:56 AM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
 On Sat, Apr 17, 2010 at 11:40:36AM +0200, Lars Marowsky-Bree wrote:
 Lars, I have no other way of saying this, but I still think you're
 completely misguided in this desire to preserve binary compatibility.

 What's the point in preserving local ABI compatibility if they have to
 restart everything anyway?

 Situation is:
 I had pacemaker 1.0.8 installed.
 There is no pacemaker 1.0.9 yet.
 Cluster glue is updated.
 I install updated cluster glue,
 as it better supports pacemaker 1.0.8.
 I do that, and boom, all my stack segfaults.

 No, because 1.0.8-4 was rebuilt for the new version of glue.

Actually, I should probably be clearer on this point in advance...

If someone has installed a different version of glue to the one
Pacemaker is built with, then I'm not at all interested in looking at
any problems the cluster is experiencing.
If it works, great.  Otherwise the very first thing I'm going to say
is to rebuild or update Pacemaker.

I burnt WAY too many hours on weird-ass bugs resulting from the debian
packaging to even think about condoning this.


 Look, we all know lmb has some crazy-ass ideas, but I'm hard pressed
 to disagree with anything he's said in this thread.
 I vote for reapplying the patch, bumping the SO name and forgetting
 about the whole thing.



 Why would I require my users to fetch new builds of the
 very same version of heartbeat and pacemaker,
 if it is easily avoided?

 Why would I knowingly break ABI compatibility, if I can avoid it, just
 for two ints added at the end of a struct instead of in the middle?

 I have absolutely no understanding for your desire to keep this ABI
 compatible and make code more complicated by needing to support

 You don't need to support different semantics.
 If you want to only support the new semantics,
 require the 2.1 library. Done.
 Whether you require 3.0 or 2.1 does not make a difference to you,
 does it.

 Anyway, I've had my say.

 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com

 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] proposed fix for the ABI extension of cluster-glue

2010-04-17 Thread Andrew Beekhof

On Sat, Apr 17, 2010 at 8:13 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Sat, Apr 17, 2010 at 07:58:38PM +0200, Andrew Beekhof wrote:
 I vote for reapplying the patch, bumping the SO name and forgetting
 about the whole thing.

 The only thing I do is move the two new members to the end of the struct,

Oh without question we should do that.

Sorry, didn't mean to imply that this was waste of time.
Just assumed that would be part of reapplying.

 keeping backwards compatibility, before bumping the SO name anyways,
 though not from 2.0.0 to 3.0.0, but only to 2.1.0.

 Because I don't see why we would insist on breaking backwards
 compatibility, if keeping it is that cheap.

 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com

 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] proposed fix for the ABI extension of cluster-glue

2010-04-17 Thread Andrew Beekhof

Sorry, pressed send too quickly...

On Sat, Apr 17, 2010 at 8:13 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Sat, Apr 17, 2010 at 07:58:38PM +0200, Andrew Beekhof wrote:
 I vote for reapplying the patch, bumping the SO name and forgetting
 about the whole thing.

 The only thing I do is move the two new members to the end of the struct,
 keeping backwards compatibility, before bumping the SO name anyways,
 though not from 2.0.0 to 3.0.0, but only to 2.1.0.

 Because I don't see why we would insist on breaking backwards
 compatibility, if keeping it is that cheap.

I'll admit to not following the conversation that closely, but the
revised patch you posted didn't look all that cheap to me.
Or is that a different conversation?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [Pacemaker] Announcement: new releases for cluster-glue (1.0.4), resource-agents (1.0.3), and heartbeat (3.0.3)

2010-04-15 Thread Andrew Beekhof

On Wed, Apr 14, 2010 at 4:01 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hello,

 The new releases of cluster glue (1.0.4), resource agents
 (1.0.3), and Heartbeat (3.0.3) are finally ready.

Nice.
The repos up on clusterabs.org are being rebuilt with all three now
and should be ready in an hour or so.

 It took us a
 whole week more to put the final touches, apologies if that
 caused any inconvenience. I guess that the schedule was a bit too
 tight this time.

 The highlights:

 - cluster-glue

  - interaction between crmd and lrmd changed a bit as of
        Pacemaker 1.0.8 (affects only resource cleanups)
  - new external/ippower9258 stonith plugin (thanks to Helmut
        Weymann)
  - hb_report now creates .dot and .png files for the PE input
        files and packing CTS tests is more convenient

 - resource-agents

  - timeouts in meta-data were reviewed and adjusted in all
        resource agents; that should make the Pacemaker 1.0.8 crm
        shell less noisy
  - all agents now print messages which would normally go to the
        system logs to the terminal if invoked from the command line;
        that should make resource debugging easier
  - a new ocft RA test suite (thanks to John Shi)

 - Heartbeat

  - support for setting the lrmd max children parameter
  - support for sbd fencing

 Of course, there's also a bunch of bug fixes, in particular in
 the agents and glue packages. See the changelogs for more
 details.

 Links to the new tarballs are at

 http://www.linux-ha.org/wiki/Download

 If you're already running Pacemaker 1.0.8 or planning to upgrade
 to that release, it would be a good idea to upgrade these
 releases as well. Of course, don't forget to first test the new
 packages on your test clusters.

 Enjoy!

 Lars Ellenberg
 Florian Haas
 Dejan Muhamedagic

 ___
 Pacemaker mailing list: pacema...@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] Purpose of HA_LOGD in .ocf-shellfuncs?

2010-03-16 Thread Andrew Beekhof

Can anyone explain the purpose of this block?

if [ x${HA_LOGD} = xyes ] ; then 
ha_logger -t ${HA_LOGTAG} $@
if [ $? -eq 0 ] ; then
return 0
fi
fi

I ask because I cant find anything that actually sets HA_LOGD anywhere.

-- Andrew





smime.p7s
Description: S/MIME cryptographic signature
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] announcement: cluster-glue 1.0.3 release

2010-02-03 Thread Andrew Beekhof

On Wed, Feb 3, 2010 at 10:46 AM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi,

 On Tue, Feb 02, 2010 at 10:41:34AM -0800, Bob Schatz wrote:
 Does this bug only apply to the 1.0.2 release or was it also in
 the 1.0.0 release used with fc12?

 Don't know. The bug was introduced on Dec 07 2009. If you unpack
 the source tar archive, there should be a file called
 .hg_archival.txt. Paste the content and I'll be able to tell you
 if it is affected.

Should be unaffected:

Name: cluster-glue Relocations: (not relocatable)
Version : 1.0   Vendor: Fedora Project
Release : 0.11.b79635605337.hg.fc12 Build Date: Mon 12 Oct
2009 06:06:22 PM CEST

Build date is prior to the fix.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] ulimit in ocf scripts

2010-01-12 Thread Andrew Beekhof

On Tue, Jan 12, 2010 at 10:43 AM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote:
 On 01/12/2010 10:39 AM, Florian Haas wrote:
 Why not simply set that for root at boot? (it rhymes too :)

 because i do not like the idea that each and every process gets
 elevated limits by default.

 i think that there *should* be a generic way to configure ulimits an a
 per resource basis.

 I'm confident Dejan would be happy to accept a patch in which you add
 such a parameter to each resource agent where it makes sense.

 of course this would be possible. but i *think* it is more helpful to
 add this to e.g. the cib/lrmd/you name it.

 so before i/we implement the ulimit stuff *inside* lots of different
 RAs, i'd like to hear beekhof's or lars' comments.

If you want a configurable per-resource limit - thats a resource parameter.
Why would we want to implement another mechanism?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] glue cs#7505f2e115c5 - reintroducing heartbeat name into paths?

2009-12-11 Thread Andrew Beekhof

On Fri, Dec 11, 2009 at 1:51 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi,

 On Fri, Dec 11, 2009 at 11:50:10AM +0100, Florian Haas wrote:
 On 2009-12-10 22:34, Lars Marowsky-Bree wrote:
  On 2009-12-10T21:45:34, Dejan Muhamedagic deja...@fastmail.fm wrote:
 
  There are several packages using /usr/lib/heartbeat and similar.
 
  Yeah, but that was mostly a legacy thing, I thought - on a system
  without heartbeat installed, this is sort of a confusing artifact. The
  only thing were we have a hard time changing it are binary names (such
  as hb_report) or public interfaces (provider=heartbeat).
 
  Everything else is supposed to use %{name} under share/lib etc, I think
  FHS suggests that.
 
  What's the point of confusing people? There could also be a
  possibility of a regression.

 FWIW, I agree with Lars here. I guess it's much more confusing to have
 directories that belong to one package, yet carry the name of another.

 I agree in principle too, but the following is really ugly:

 /usr/lib64/cluster-glue/plugins/test/test.so
 /usr/lib64/heartbeat/base64_md5_test
 /usr/share/cluster-glue/lrmtest
 /usr/lib64/heartbeat/lrmd
 /usr/lib64/heartbeat/ha_logd

 Don't care about names, it's just that we should use either one
 or the other and not both. Moving lrmd would break pacemaker. In
 general, moving things could possibly break some software
 developed elsewhere. I'd prefer not to cause other users
 headache, I think they had enough of chasing moving targets
 produced by us.

 So, let's choose one or the other and move finally on, we've
 really got better things to do, and ask Andrew to change the
 pacemaker sources (which would make our package incompatible with
 pacemaker = 1.0.6, but oh well, not the first time we break
 things).

Its not an either-or proposition.
Switch completely to the new naming scheme and use symlinks for compatibility.
Then after a conservative period of time you can just drop the symlinks.


 Thanks,

 Dejan

  That's how I stumbled across this, actually; it's changing path names
  for the SLE packages ;-)
 
  Must say that I was also not aware of the change, it probably
  happened while I was not around.
 
  The move happened when the packages were split, I think.
 
  P.S. glue doesn't carry much meaning.

 Correct, but the package name is now cluster-glue (both the autofoo
 $(PACKAGE_NAME) and the RPM %{name} and while not perfect, that is much
 more meaningful.


 Florian




 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Make error (Reusable-Cluster-Components-d3036c574587)

2009-12-07 Thread Andrew Beekhof

Sorry about that.
Fix is in changeset 3a005082d4cb

On Fri, Dec 4, 2009 at 2:28 AM,  renayama19661...@ybb.ne.jp wrote:
 Hi,

 An error happens by a make of Reusable-Cluster-Components-d3036c574587.

 #cc1: warnings being treated as errors
 #cl_log.c:127: warning: no previous prototype for 'cl_log_enable_stdout'

 There is not prototype declaration in a header.
 I ask for a revision.

 void            cl_log_enable_stderr(int truefalse);
 void            cl_log_enable_stdout(int truefalse); - Required

 Best Regards,
 Hideo Yamauchi.

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Problems starting heartbeat 3.0.1-1 - /etc/ha.d/shellfuncs No such file or directory [In reply to

2009-11-17 Thread Andrew Beekhof

On Mon, Nov 16, 2009 at 6:49 PM, Bob Schatz bsch...@yahoo.com wrote:
 I missed Andrew's reply so I am including his comment and my results below:

 On my cluster I did:

    [r...@fc11-1 ~]# rpm -qi resource-agents | grep ersion
    Version     : 3.0.4                             Vendor: Fedora Project

ah, thats the problem.

on fc11 resource-agents-3.0.4 doesn't yet have the heartbeat agents.
with fc12 that problem goes away.

for now, you'll have to remove 3.0.4 and specify 1.0.1 on the command
line when you install.

so:
  yum install resource-agents = 1.0.1 pacemaker

    [r...@fc11-1 ~]# rpm -ql resource-agents | grep shellfunc
    /usr/share/cluster/ocf-shellfuncs
    [r...@fc11-1 ~]#

 Thanks,

 Bob

 --

 [05:08 PM] root[at]f12 ~ # rpm -qi resource-agents | grep ersion
 Version     : 3.0.4                             Vendor: Fedora Project
 [05:07 PM] root[at]f12 ~ # rpm -ql resource-agents | grep shellfunc
 /etc/ha.d/shellfuncs     there
 /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs
 /usr/share/cluster/ocf-shellfuncs

 What version of resource-agents do you have installed?

 On Fri, Nov 13, 2009 at 3:52 AM, Bob Schatz bschatz[at]yahoo.com wrote:
 For some reason I did not receive the email from Andrew so I am including it 
 below.

 Cluster glue was installed and I have attached the output from yum at the 
 end of this email.


 Also, I noticed that the files that used to reside in
 /usr/lib/ocf/resource.d/heartbeat/* are no longer there.  I could not
 configure an IPaddr resource.


 Thanks in advance

 Bob


 that file should be part of cluster-glue... was that package not installed?

 On Wed, Nov 11, 2009 at 8:19 PM, Bob Schatz bschatz[at]yahoo.com wrote:
 Hi,

 I am new to Linux HA and I am having a problem with heartbeat 3.0.1.

 It appears that /etc/ha.d/shellfuncs is no longer in the release but it is 
 still called from /etc/init.d/heartbeat.

 I reloaded a system with FC11 and then downloaded the pacemaker/heartbeat 
 binaries as follows:

   # wget -O /etc/yum.repos.d/pacemaker.repo 
 http://clusterlabs.org/rpm/fedora-11/clusterlabs.repo
   # yum install -y pacemaker corosync heartbeat

 I copied a ha.cf to /etc/ha.d/ha.cf and attempted to start heartbeat as 
 follows:

 root[at]fc11-2:# sh -x /etc/init.d/heartbeat start
 + '[' -f /etc/sysconfig/heartbeat ']'
 + HA_DIR=/etc/ha.d
 + export HA_DIR
 + CONFIG=/etc/ha.d/ha.cf
 + . /etc/ha.d/shellfuncs
 /etc/init.d/heartbeat: line 51: /etc/ha.d/shellfuncs: No such file or 
 directory


 I did not see this as a known problem on the mailing lists.


 Thanks,

 Bob


 # yum install -y pacemaker corosync heartbeat

 Loaded plugins: refresh-packagekit
 clusterlabs                                              | 1.2 kB     00:00
 clusterlabs/primary                                      |  14 kB     00:00
 clusterlabs                                                               
 47/47
 Setting up Install Process
 Resolving Dependencies
 -- Running transaction check
 --- Package corosync.x86_64 0:1.1.2-1.fc11 set to be updated
 -- Processing Dependency: corosynclib = 1.1.2-1.fc11 for package: 
 corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: 
 libvotequorum.so.4(COROSYNC_VOTEQUORUM_1.0)(64bit) for package: 
 corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libconfdb.so.4(COROSYNC_CONFDB_1.0)(64bit) for 
 package: corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libcfg.so.4(COROSYNC_CFG_0.82)(64bit) for 
 package: corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libquorum.so.4(COROSYNC_QUORUM_1.0)(64bit) for 
 package: corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libpload.so.4(COROSYNC_PLOAD_1.0)(64bit) for 
 package: corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libcoroipcs.so.4()(64bit) for package: 
 corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: liblogsys.so.4()(64bit) for package: 
 corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libquorum.so.4()(64bit) for package: 
 corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libconfdb.so.4()(64bit) for package: 
 corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libvotequorum.so.4()(64bit) for package: 
 corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libcfg.so.4()(64bit) for package: 
 corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libtotem_pg.so.4()(64bit) for package: 
 corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libcoroipcc.so.4()(64bit) for package: 
 corosync-1.1.2-1.fc11.x86_64
 -- Processing Dependency: libpload.so.4()(64bit) for package: 
 corosync-1.1.2-1.fc11.x86_64
 --- Package heartbeat.x86_64 0:3.0.1-1.fc11 set to be updated
 -- Processing Dependency: resource-agents for package: 
 heartbeat-3.0.1-1.fc11.x86_64
 -- Processing Dependency: cluster-glue-libs for package: 
 heartbeat-3.0.1-1.fc11.x86_64
 -- Processing Dependency: PyXML for package: heartbeat-3.0.1-1.fc11.x86_64
 --

Re: [Linux-ha-dev] suggestions and patch for meatclient

2009-11-15 Thread Andrew Beekhof

On Sat, Nov 14, 2009 at 6:58 PM, Lars Marowsky-Bree l...@suse.de wrote:
 On 2009-11-13T11:42:31, Dejan Muhamedagic deja...@fastmail.fm wrote:

  I would like to hear any opinion.

 Great idea! But I'd like to suggest a bit different execution,
 i.e. to have usage like this:

 The idea is nice, but what we actually want is a crm node
 clean-down-confirmation XXX command, that clears the CIB accordingly.

 So I think stonithd isn't actually the best place to implement this.

Agreed.
Someone create a bug for this?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [Linux-HA] Problem with gratuitous arps in IPaddr2

2009-09-16 Thread Andrew Beekhof

On Wed, Sep 16, 2009 at 6:18 PM, Lars Marowsky-Bree l...@suse.de wrote:

 The send_arp.linux.c code file has different semantics than
 send_arp.libnet.c

I added extra handling to try and make them match.
Did I miss something?

Basically I took the arping source code and added extra option handling.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Build dependency issue with heartbeat/cluster-glue

2009-08-31 Thread Andrew Beekhof

export CFLAGS=$CFLAGS -I$PREFIX/include -L$PREFIX/lib

On Mon, Aug 31, 2009 at 10:01 AM, Florian Haasflorian.h...@linbit.com wrote:
 Andrew, Lars, Dejan,

 As I'm trying to fix the heartbeat init script (and struggling with
 automake in the process), I believe I've run into a heartbeat build
 issue where I'm guessing at some point $PREFIX isn't evaluated correctly.

 Here's what I do (in my hg clone for glue)
 export PREFIX=/tmp/cluster-build/usr
 export LCRSODIR=$PREFIX/libexec/lcrso
 export CLUSTER_USER=hacluster
 export CLUSTER_GROUP=haclient
 ./configure --prefix=$PREFIX
 --with-initdir=/tmp/cluster-build/etc/init.d
 --sysconfdir=/tmp/cluster-build/etc --localstatedir=/tmp/cluster-build/var

 Configure now reports:
 glue configuration:
  Version                  = 1.0.0 (Build: unknown)
  Features                 =

  Prefix                   = /tmp/cluster-build/usr
  Executables              = /tmp/cluster-build/usr/sbin
  Man pages                = /tmp/cluster-build/usr/share/man
  Libraries                = /tmp/cluster-build/usr/lib
  Header files             = /tmp/cluster-build/usr/include
  Arch-independent files   = /tmp/cluster-build/usr/share
  State information        = /tmp/cluster-build/var
  System configuration     = /tmp/cluster-build/etc

  Use system LTDL          = yes

  HA group name            = haclient
  HA user name             = hacluster

  CFLAGS                   = -g -O2 -ggdb3 -O0  -fstack-protector-all
 -Wall -Waggregate-return -Wbad-function-cast -Wcast-qual -Wcast-align
 -Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2
 -Wformat-security -Wformat-nonliteral -Winline -Wmissing-prototypes
 -Wmissing-declarations -Wmissing-format-attribute -Wnested-externs
 -Wno-long-long -Wno-strict-aliasing -Wpointer-arith -Wstrict-prototypes
 -Wwrite-strings -ansi -D_GNU_SOURCE -DANSI_ONLY -Werror
  Libraries                = -lbz2 -lxml2 -lc -luuid -lrt -ldl
 -lglib-2.0   -lltdl
  Stack Libraries          =

 ... which I am happy with. I now do make and make install, and
 everything correctly installs into my directory hierarchy under
 /tmp/cluster-build.

 I now change to my hg clone for heartbeat (dev repo from
 hg.linux-ha.org) and try bootstrap with the same flags as for glue:

 ./bootstrap --prefix=$PREFIX
 --with-initdir=/tmp/cluster-build/etc/init.d
 --sysconfdir=/tmp/cluster-build/etc --localstatedir=/tmp/cluster-build/var

 ... which bombs out with:
 checking heartbeat/glue_config.h usability... no
 checking heartbeat/glue_config.h presence... no
 checking for heartbeat/glue_config.h... no
 configure: error: Core development headers were not found
 See `config.log' for more details.

 config.log lists includedir as being '/tmp/cluster-build/usr/include',
 which does contain heartbeat/glue_config.h.

 What am I doing wrong?

 Any advice much appreciated. Thanks!

 Cheers,
 Florian


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] ANNOUNCE: New Linux-HA repository structure

2009-08-05 Thread Andrew Beekhof

Lars has asked me to announce that at long last, we have finalized the  
new Linux-HA repository/project structure.

Effective immediately, Heartbeat 2.x has been split into the following  
projects:
* cluster-glue 1.0
* resource-agents 1.0
* heartbeat 3.0-beta

### Cluster Glue 1.0 - http://hg.linux-ha.org/glue/ - 
http://hg.linux-ha.org/glue/archive/glue-1.0.tar.gz
A collection of common tools that are useful for writing cluster  
stacks such as Heartbeat and cluster managers such as Pacemaker.
Provides a local resource manager that understands the OCF and LSB  
standards, and an interface to common STONITH devices.

### Resource Agents 1.0 - http://hg.linux-ha.org/agents/ - 
http://hg.linux-ha.org/agents/archive/agents-1.0.tar.gz
OCF compliant scripts to allow common services to operate in a High  
Availability environment.

### Heartbeat 3.0-beta - http://hg.linux-ha.org/dev/
A cluster stack providing messaging and membership services that can  
be used by resource managers such as Pacemaker.
Heartbeat still contains the simple 2-node resource manager (aka.  
haresources) from before version 2.
The board will release 3.0-final at a time of its choosing.


These changes have been put in place to allow the group to release  
updates at interval that are suitable to each individual project.
This also makes better use of our limited QA resources as we are no  
longer forced to test the entire stack in order to release an updated  
set of resource agents.

Additionally, the changes aim to increase the usage of the individual  
components by allowing them to be used independently.


Preliminary packages for the most recent openSUSE, SLES, Fedora and  
RHEL releases are currently available at
   http://download.opensuse.org/repositories/server:/ha-clustering:/NG

Older distros can be added if there is sufficient demand.
The existing repositories will be migrated to the new package layout  
over the coming days and weeks.

-- Andrew


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] cleanup, coding style, checkpatch.pl

2009-07-28 Thread Andrew Beekhof

On Mon, Jul 27, 2009 at 11:19 AM, Hannes Ederhe...@google.com wrote:
 On Wed, Jul 22, 2009 at 17:13, Lars Marowsky-Breel...@suse.de wrote:
 On 2009-07-21T17:24:56, Hannes Eder he...@google.com wrote:

 Hi,

 Some parts of the linux-ha code base might benefit from a little code
 cleanup.  In this case the question arises which coding style should
 be applied.  I did not find any documentation on that in the linux-ha
 source tree.  Did I miss something?

 What about obeying to Documentation/CodingStyle from the linux-kernel?
 By that means tools like scripts/checkpatch.pl could be used.

 Comments?

 I won't mind, but style cleanups for their own sake don't really
 convince me. If they come as a pre-requisite for a bugfix sure, but
 remember that basically the only bits of heartbeat that are still
 actively maintained is the LRM + resource agents.

 Agree, but other parts of linux-ha are still in use, no?  So, I think
 for maintainability

This isn't much of a concern.
Apart from clplumbing and the pieces lars mentioned, the rest of code
is essentially unmaintained.

 its worthwhile spending some effort tidying up the
 code.  I do not ask you to do it, it's mere the question if one,
 e.g. me, spends some time cleaning up code, what style should be
 applied and if it is likely to be merged?

I'd say not likely.
We very rarely look at that code and, to me, the chance that the
cleanup would introduce bugs offsets any positive aspect.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] What's going to happen with the heartbeat RAs?

2009-06-26 Thread Andrew Beekhof

On Thu, Jun 25, 2009 at 1:48 PM, Florian Haasflorian.h...@linbit.com wrote:
 Hello everyone,

 This is something that's been on my mind for a while, and I'm still
 looking for a definitive answer. :)

 Just what exactly is the current plan for the recent changes to the RAs
 provided by Heartbeat (i.e. the ones that install into
 /usr/lib/ocf/resource.d/heartbeat)?  I understand there will be no
 further Heartbeat releases beyond the current 2.99, so those changed
 (and new) RAs won't ever be released as part of Heartbeat. Yet AFAICS
 there is no ongoing effort to move them to Pacemaker. What's the plan?

Basically:

http://hg.clusterlabs.org/extra/agents/ +
http://download.opensuse.org/repositories/server:/ha-clustering:/NG

The packaging is still a bit of a work in progress, but the full stack
did seem to be working before I left for italy.

 Should the submitters of these new RAs re-submit to Pacemaker?

No need, http://hg.clusterlabs.org/extra/agents is still a filtered
copy of the ha dev repo.
We'll make an announcement when everything is ready.

 Andrew
 seems to not be so fond of that idea, but I wonder what the alternative is.

 At this point I guess Lars' idea of all sorts of third parties
 contributing and maintaining their own RAs, all of them installing into
 separate provider directories, is just that: a good idea, with little
 chance of being widely adopted anytime soon.

 So what should we do? Keep preparing patches against the linux-ha
 Mercurial repo, and submitting them to linux-ha-dev, or create patches
 against http://hg.clusterlabs.org/extra/agents, and submit them to the
 Pacemaker list, or something completely different?

 Comments appreciated. Thanks!

 Cheers,
 Florian



 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] ocf (non) unique parameter

2009-05-25 Thread Andrew Beekhof

On Fri, May 22, 2009 at 5:52 PM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote:
 can someone explain the practical usage of non-unique attributes?

 in the ocf ra specs [1], one can read
 The meta data allows the RA to flag one or more instance parameters as
 'unique'. This is a hint to the RM or higher level configuration tools
 that the combination of these parameters must be unique to the given
 resource type.

 speaking in the v2 configuration language, i thought that this means
 that one can specify multiple nvpairs with the same name to the cib.xml

but for different resources.
ie. multiple ip addresses can use the same netmask (unique=0) but each
must have a different ip address (unique=1)

 file. e.g. looking at the pingd ra:

 parameter name=pidfile unique=0
 longdesc lang=enPID file/longdesc
 shortdesc lang=enPID file/shortdesc
 content type=string default=$HA_RSCTMP/pingd-${OCF_RESOURCE_INSTANCE} /
 /parameter

 - e.g.
 nvpair id=pidfile1 name=pidfile value=/tmp/pid1.pid /
 nvpair id=pidfile2 name=pidfile value=/tmp/pid2.pid /

 as far as i can see, the first nvpair seems to win though and
 OCF_RESKEY_pidfile is set to /tmp/pid1.pid.

yep

 can someone please shed a light on how to (not) use unique=0
 attributes and how they are passed to the ocf ra scripts?

as above.  a resource can only have one value for any given attribute
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] Re: [Pacemaker] Heartbeat 2.99.2+sles11-rc3 and Pacemaker 1.0.2 packages for Debian (experimental)

2009-02-23 Thread Andrew Beekhof

On Sat, Feb 21, 2009 at 00:59, Simon Horman ho...@verge.net.au wrote:
 Hi,

 I have heartbeat 2.99.2+sles11-rc3-1 and pacemaker 1.0.2-2 packages
 prepared for Debian Experimental. They should install on top
 of the current Debian Unstable (Sid).

 The packages are available at:
 http://packages.vergenet.net/experimental/

 There are source packages and binary packages for i386 and amd64.

 At this stage the major problem that I am facing is a failure report from
 /usr/lib/heartbeat/BasicSanityCheck which looks like the log below.

BSC is really only a heartbeat thing and hasn't been maintained since the split.
It will get resurrected in Pacemaker in some form, but its not a high priority.

I'd suggest the crm parts of the Heartbeat BSC be removed.


 Full logs are available at:
 http://packages.vergenet.net/experimental/pacemaker/pacemaker-1.0.2-2_heartbeat-2.99.2+sle11-rc3_1_linux-ha_i386.testlog
 http://packages.vergenet.net/experimental/pacemaker/pacemaker-1.0.2-2_heartbeat-2.99.2+sle11-rc3_1_linux-ha_amd64.testlog
 http://packages.vergenet.net/experimental/pacemaker/pacemaker-1.0.1-1+heartbeat-2.99.2-1_linux-ha_i386.testlog

 I would really appreciate some advice on what if anything to do about this.
 In particular, is this a problem?

 Thanks

 --
 Simon Horman
  VA Linux Systems Japan K.K., Sydney, Australia Satellite Office
  H: www.vergenet.net/~horms/             W: www.valinux.co.jp/en

 pengine: [27031]: ERROR: IDREF attribute rsc references an unknown ID 
 bsc-rsc-yukiko.kent.sydney.vergenet.net-1
 pengine: [27031]: ERROR: update_validation: Transformation 
 /usr/share/pacemaker/upgrade06.xsl did not produce a valid configuration
 pengine: [27031]: ERROR: Element cluster_property_set has extra content: 
 attributes
 pengine: [27031]: ERROR: Element crm_config has extra content: 
 cluster_property_set
 pengine: [27031]: ERROR: Invalid sequence in interleave
 pengine: [27031]: ERROR: Element configuration failed to validate content
 pengine: [27031]: ERROR: Element cib failed to validate content
 cib: [27082]: ERROR: validate_cib_digest: Digest comparision failed: expected 
 2d643f7c4206c8d16db0331dd98c367d (/var/lib/heartbeat/crm/cib.xml.sig), 
 calculated d18f2b4b9762eed144fe06a91e950b16
 cib: [27082]: ERROR: retrieveCib: Checksum of /var/lib/heartbeat/crm/cib.xml 
 failed!  Configuration contents ignored!
 cib: [27082]: ERROR: retrieveCib: Usually this is caused by manual changes, 
 please refer to http://linux-ha.org/v2/faq/cib_changes_detected
 cib: [27082]: ERROR: crm_abort: write_cib_contents: Triggered fatal assert at 
 io.c:698 : retrieveCib(CIB_FILENAME, CIB_FILENAME.sig, FALSE) != NULL
 cib: [27002]: ERROR: Managed write_cib_contents process 27082 dumped core
 cib: [27002]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, 
 signo=6, exitcode=0
 cib: [27002]: ERROR: cib_diskwrite_complete: Disabling disk writes after 
 write failure
 cib: [27085]: ERROR: validate_cib_digest: Digest comparision failed: expected 
 2d643f7c4206c8d16db0331dd98c367d (/var/lib/heartbeat/crm/cib.xml.sig), 
 calculated d18f2b4b9762eed144fe06a91e950b16
 cib: [27085]: ERROR: write_cib_contents: /var/lib/heartbeat/crm/cib.xml was 
 manually modified while Heartbeat was active!
 cib: [27002]: ERROR: cib_diskwrite_complete: Disk write failed: status=256, 
 signo=0, exitcode=1
 pengine: [27031]: ERROR: Element cluster_property_set has extra content: 
 attributes
 pengine: [27031]: ERROR: Element crm_config has extra content: 
 cluster_property_set
 pengine: [27031]: ERROR: Invalid sequence in interleave
 pengine: [27031]: ERROR: Element configuration failed to validate content
 pengine: [27031]: ERROR: Element cib failed to validate content
 pengine: [27031]: ERROR: process_pe_message: Your current configuration could 
 only be upgraded to transitional-0.6... the minimum requirement is 
 pacemaker-1.0.


 ___
 Pacemaker mailing list
 pacema...@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] HA Version 2 for monitoring apache mysql

2008-12-15 Thread Andrew Beekhof

Logs?

On Mon, Dec 15, 2008 at 05:03, Tanveer Chowdhury
tanveer.chowdh...@gmail.com wrote:
 Hi,
 I have configured the below in RHEL but it doesn't start the Virtual
 IP address or even the services.
 Below is the settings I took. Most probably the cib.xml file is not in
 right format.

 Thanks in advance.

 # cat /etc/ha.d/ha.cf

 debugfile /var/log/ha-debug
 logfile /var/log/ha-log
 logfacility local0
 keepalive 1
 deadtime 5
 warntime 3
 initdead 10
 udpport 694
 bcast   eth0
 auto_failback on
 nodeclusternode1
 nodeclusternode2
 crm on


 # vi /etc/ha.d/haresources
 clusternode110.10.4.xxx httpd


 # vi /etc/ha.d/authkeys
 auth 1
 1 crc

 # ll /etc/init.d/apache
 lrwxrwxrwx 1 root root 31 Feb  1 00:20 /etc/init.d/apache -
 /usr/local/apache/bin/apachectl
 # ll /etc/init.d/httpd
 lrwxrwxrwx 1 root root 31 Oct 22  2008 /etc/init.d/httpd -
 /usr/local/apache/bin/apachectl

 # rpm -qa | grep heartbeat
 heartbeat-pils-2.1.3-3.el5.centos
 heartbeat-stonith-2.1.3-3.el5.centos
 heartbeat-2.1.3-3.el5.centos

 # uname -a
 Linux clusternode2 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 EDT 2007
 i686 i686 i386 GNU/Linux


 # cat /etc/issue
 Red Hat Enterprise Linux Server release 5.1 (Tikanga)

 This cib.xml file was auto generated after I started heartbeat service.

 # vi /var/lib/heartbeat/crm/cib.xml

 cib generated=true admin_epoch=0 epoch=1 num_updates=1
 have_quorum=true ignore_dtd=false ccm_transition=1 num_peers=1
 cib_feature_revision=2.0 cib-last-written=Fri Dec 15 00:15:48
 2008
  configuration
crm_config
  cluster_property_set id=cib-bootstrap-options
attributes
  nvpair id=cib-bootstrap-options-dc-version
 name=dc-version value=2.1.3-node:
 552305612591183b1628baa5bc6e903e0f1e26a3/
/attributes
  /cluster_property_set
/crm_config
nodes
node id=d354acf4-568d-4583-9fe9-72eabb2835b1
 uname=clusternode2 type=normal/
/nodes
resources/
constraints/
  /configuration
 /cib

 Then I added these lines with it as follows
resources
 group id=apache_group
   primitive id=ip_resource_1 class=ocf type=IPaddr
 provider=heartbeat
 instance_attributes
   attributes
 nvpair name=ip value=10.10.4.xxx/
   /attributes
 /instance_attributes
   /primitive
   primitive id=apache class=heartbeat type=apache/
 /group

 So it became like
 cib generated=true admin_epoch=0 epoch=1 num_updates=1
 have_quorum=true ignore_dtd=false ccm_transition=1 num_peers=1
 cib_feature_revision=2.0 cib-last-written=Fri Dec 15 00:15:48
 2008
  configuration
crm_config
  cluster_property_set id=cib-bootstrap-options
attributes
  nvpair id=cib-bootstrap-options-dc-version
 name=dc-version value=2.1.3-node:
 552305612591183b1628baa5bc6e903e0f1e26a3/
/attributes
  /cluster_property_set
/crm_config
nodes
  node id=d354acf4-568d-4583-9fe9-72eabb2835b1
 uname=clusternode2 type=normal/
/nodes
resources
 group id=apache_group
   primitive id=ip_resource_1 class=ocf type=IPaddr
 provider=heartbeat
 instance_attributes
   attributes
 nvpair name=ip value=10.10.4.xxx/
   /attributes
 /instance_attributes
   /primitive
   primitive id=apache class=heartbeat type=apache/
 /group
resources/
constraints/
  /configuration
 /cib
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Need to detect ethN down (but NOT ping)

2008-10-26 Thread Andrew Beekhof

On Sun, Oct 26, 2008 at 19:37, tje [EMAIL PROTECTED] wrote:
 Through an interface that *must* be up (or we should fail over), there is no
 address that can be the subject of a ping. Everything is dynamically
 assigned via DHCP, all L2 switching on that side is incapable of having even
 a loopback address.

ping the dhcp server?
ping google.com?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [patch 0/6] Assorted build cleanups and fixes

2008-10-15 Thread Andrew Beekhof

Looks fine to me

On Wed, Oct 15, 2008 at 09:02, Simon Horman [EMAIL PROTECTED] wrote:
 Hi,

 I noticed a couple of things while building heartbeat
 earlier today. Please let me know if there are any
 objections to any of these.

 --
 Simon Horman
  VA Linux Systems Japan K.K., Sydney, Australia Satellite Office
  H: www.vergenet.net/~horms/ W: www.valinux.co.jp/en

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Re: Announcing: 2.99.1 (beta! release)

2008-10-07 Thread Andrew Beekhof



On Oct 7, 2008, at 1:28 PM, Dejan Muhamedagic wrote:


Hi,

On Tue, Oct 07, 2008 at 01:30:13PM +0800, Yan Gao wrote:

Hi Lars,
On Mon, 2008-10-06 at 12:30 +0200, Lars Marowsky-Bree wrote:

On 2008-10-06T12:32:11, Yan Gao [EMAIL PROTECTED] wrote:


Hi,
On Sat, 2008-10-04 at 17:39 +0900, [EMAIL PROTECTED]  
wrote:

Hi Xinwei,

I understood it.

Then I will use latest GUI next week.
If there is a problem in GUI, I report it in a mailing list of  
Pacemaker.

???Any comments or suggestions are welcome;-)


I think it definetely needs to be renamed to at least pacemaker- 
pygui,

the package just being called pygui doesn't work.

Yes, definetely. I've renamed it to pacemaker-pygui when I added  
1.4

revision tag. Actually I don't think pacemaker-pygui is ideal
either:-)
Because it includes the snmp subagent and the backend of the pygui.
Maybe pacemaker-mgmt or something is better?


-mgmt sounds good to me. It would also be good if the gui is
in a separate package, because of numerous dependencies which
typically aren't needed on cluster nodes (e.g. gtk stuff).


agreed on both counts

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] To avoid STONITH for a node which is doing kdump

2008-10-06 Thread Andrew Beekhof

On Tue, Sep 30, 2008 at 12:24, Satomi TANIGUCHI
[EMAIL PROTECTED] wrote:
 Hi Dejan,


 Thank you for letting me know!
 I'll test it.

 Now, may I ask you a question?
 cluster-delay seems to still require the value
 which is longer than the maximum possible stonith timeout for tengine.

Is this with pacemaker 0.7?
If so, can you open a bug for this please?

The CRM should be waiting forever.

 If cluster-delay is shorter than sum total of plugins' timeout values,
 then tengine detect a STONITH op timed out
 and new STONITH op is executed.
 Then two or more plugins are executed in parallel.

 Will it change in future?
 (We are just in a transition period?)
 Or did it lose possibility because of my insistence that
 the way to add fence(stonith)-timeuot is better?
 I'm afraid that because Andrew said that
 the cluster can just forever for stonithd to return
 if stonithd no longer needs a timeout value from crmd...


 Regards,
 Satomi TANIGUCHI



 Dejan Muhamedagic wrote:

 Hi,

 Just to let you know that I renamed fence-timeout to
 stonith-timeout, because there are already stonith-this and
 stonith-that in crm_cluster_properties. Still better to be
 consistently stonith: naming this fence-... would most
 probably confuse people. As if they weren't confused enough ;-)

 Thanks,

 Dejan

 On Fri, Sep 26, 2008 at 11:00:53AM +0200, Dejan Muhamedagic wrote:

 Hi Satomi-san,

 On Fri, Sep 26, 2008 at 05:39:53PM +0900, Satomi TANIGUCHI wrote:

 Hi Dejan,

 I found some bugs.

 1) When fence-timeout is not set and priority is set,
   priority's value is used as both fence_timeout and priority.
   The patch for this bug is fence-timeout.patch

 Right. An oversight while pondering whether to have it in a loop
 or wait and see if there'll be more stonithd attributes coming :)

 2) Stonithd can execute only 2 or less plugins.
   With 3 or more plugins, priority is ignored.
   The patch for this is stonith_rsc_priorities.patch

 Oh, that code was tricky. Strange that it fails on more than two
 plugins.

 I hope they are helpful to you.

 Of course. Thanks for the patches and testing!

 Cheers,

 Dejan


 Best Regards,
 Satomi TANIGUCHI

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Re: Announcing: 2.99.1 (beta! release)

2008-10-02 Thread Andrew Beekhof



On Oct 2, 2008, at 3:50 AM, HIDEO YAMAUCHI wrote:


Hi,

I tested GUI with this package.

But, I was going to add a dummy resource newly, but was not able to  
do it.


In addition, I took a feeling in the version that GUI was very old.
It is because it seemed to be inferior to GUI of 2.1.4 versions in a  
function.


Will GUI of this version be usable with this package?


Only if you set validate-with=pacemaker-0.6 (which also prevents you  
from using some of the new pacemaker features) because the GUI doesn't  
understand the new syntax in 1.0


I believe the Novell China guys are rewriting the GUI and it will  
eventually be able to understand the new syntax.

No idea what the ETA on that is though.





 Contents of the log when I failed in the addition of the  
dummy　---
mgmtd[12142]: 2008/10/01_13:15:11 info: on_add_rsc:primitive  
id=resource_ class=ocf type=Dummy
provider=heartbeatmeta_attributes id=resource__meta_attrs  
attributesnvpair
id=resource__metaattr_target_role name=target_role  
value=stopped//attributes

/meta_attributes/primitive
cib[12137]: 2008/10/01_13:15:11 ERROR: Element meta_attributes has  
extra content: attributes


cib[12137]: 2008/10/01_13:15:11 ERROR: Extra element meta_attributes  
in interleave


cib[12137]: 2008/10/01_13:15:11 ERROR: Element primitive failed to  
validate content


cib[12137]: 2008/10/01_13:15:11 ERROR: Element resources has extra  
content: primitive


cib[12137]: 2008/10/01_13:15:11 ERROR: Invalid sequence in interleave

cib[12137]: 2008/10/01_13:15:11 ERROR: Element cib failed to  
validate content


cib[12137]: 2008/10/01_13:15:11 ERROR: cib_perform_op: Updated CIB  
does not validate against

pacemaker-1.0 schema/dtd
cib[12137]: 2008/10/01_13:15:11 WARN: cib_diff_notify: Update  
(client: mgmtd, call:3): 0.13.3 -

0.14.1 (Update does not conform to the configured schema/DTD)
cib[12137]: 2008/10/01_13:15:11 ERROR: cib_process_request:  
Operation complete: op cib_create for
section resources (origin=local/4a6ff9c3-26e0-4788-a081-fab048da199f/ 
3): Update does not conform to

the configured schema/DTD (rc=-47)
mgmtd[12142]: 2008/10/01_13:15:11 WARN: cib_native_perform_op: Call  
failed: Update does not conform to

the configured schema/DTD
mgmtd[12142]: 2008/10/01_13:15:12 WARN: unpack_resources: No STONITH  
resources have been defined
mgmtd[12142]: 2008/10/01_13:15:12 info: determine_online_status:  
Node rh52-1hb3 is online

-

Regards,

Hideo Yamauchi.

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Re: Bug#500219: heartbeat: please don't include hidden files

2008-09-29 Thread Andrew Beekhof

On Mon, Sep 29, 2008 at 13:29, Ferenc Wagner [EMAIL PROTECTED] wrote:
 Andrew Beekhof [EMAIL PROTECTED] writes:

 On Mon, Sep 29, 2008 at 01:05, Simon Horman [EMAIL PROTECTED] wrote:

 On Fri, Sep 26, 2008 at 12:57:44PM +0200, Ferenc Wagner wrote:

 Chkrootkit stumbles upon the hidden files under /usr/lib:

 /etc/cron.daily/chkrootkit:
 The following suspicious files and directories were found:
 /usr/lib/ocf/resource.d/heartbeat/.ocf-binaries
 /usr/lib/ocf/resource.d/heartbeat/.ocf-directories
 /usr/lib/ocf/resource.d/heartbeat/.ocf-returncodes
 /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs

 Please avoid using such names if possible.

 that sounds like a reasonable request to me. I am passing it
 on to the upstream development mailing list for comment there.

 I think chkrootkit is being a little over-protective here.

 Sure it is, by design.

 These files aren't meant to be included directly by the user and by
 naming them with a leading dot, we avoid the issue of them showing up
 as resources.

 I see.  Isn't it possible to move them into a different directory then?

Possible yes, but this isn't nearly a good enough reason to do so.
Their current location is the most appropriate.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Re: Bug#500219: heartbeat: please don't include hidden files

2008-09-29 Thread Andrew Beekhof

On Mon, Sep 29, 2008 at 15:24, Dejan Muhamedagic [EMAIL PROTECTED] wrote:
 On Mon, Sep 29, 2008 at 01:10:28PM +0200, Andrew Beekhof wrote:
 On Mon, Sep 29, 2008 at 01:05, Simon Horman [EMAIL PROTECTED] wrote:
 
  On Fri, Sep 26, 2008 at 12:57:44PM +0200, Ferenc Wagner wrote:
  Package: heartbeat
  Version: 2.1.3-6
  Severity: wishlist
 
  Hi,
 
  Chkrootkit stumbles upon the hidden files under /usr/lib:
 
  /etc/cron.daily/chkrootkit:
  The following suspicious files and directories were found:
  /usr/lib/ocf/resource.d/heartbeat/.ocf-binaries
  /usr/lib/ocf/resource.d/heartbeat/.ocf-directories
  /usr/lib/ocf/resource.d/heartbeat/.ocf-returncodes
  /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs
 
  Please avoid using such names if possible.
 
  Hi Ferenc,
 
  that sounds like a reasonable request to me. I am passing it
  on to the upstream development mailing list for comment there.

 I think chkrootkit is being a little over-protective here.
 These files aren't meant to be included directly by the user and by
 naming them with a leading dot, we avoid the issue of them showing up
 as resources.

 They won't show as resource agents if the scripts are not
 executable. Sourcing such files would work in that case too.

Yes but its needlessly confusing (even if in a small way) for anyone
looking in that directory.

Audit tools such as the one mentioned have their place, but moving
files around for no other reason than to keep the tool from
complaining is a step too far.  Particularly when there are sane
enough reasons for the files to be located and named as they are.

At any rate, even if they were relocated, we'd still have to provide
links to the current pathname for compatibility.


As an aside, are rootkit writers really that lame that they rely on a
leading dot to hide the presence of a file?
Even my little sister wouldn't be fooled by that.

-EMOVEALONG
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

1 2 3 4 5 6 >

1 - 100 of 544 matches

Mail list logo