Re: [Pacemaker] rename all nodes of a cluster

2010-10-05 Thread Andrew Beekhof
Just bring up the cluster with the new names and use crm configure
node delete to remove the old names.

On Mon, Oct 4, 2010 at 2:12 PM, Karl Rößmann k.roessm...@fkf.mpg.de wrote:
 Hi Clusterlabs mailing list,

 I have a running cluster with three nodes.
 For some reason I had to change all host names
 and their IP address for the interface eth0

 The communication channel is not affected,
 bindnetaddr mcastaddr mcastport will be the same.

 is there an easy way to rename the nodes ?

 We have SuSE SLES11 SP1, including:

 corosync-1.2.1-0.5.1
 openais-1.1.2-0.5.19



 Karl Roessmann
 --
 Karl Rößmann                            Tel. +49-711-689-1657
 Max-Planck-Institut FKF                 Fax. +49-711-689-1198
 Postfach 800 665
 70506 Stuttgart                         email k.roessm...@fkf.mpg.de

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fail over algorithm used by Pacemaker

2010-10-05 Thread Andrew Beekhof
On Sun, Oct 3, 2010 at 4:01 PM, hudan studiawan studia...@gmail.com wrote:
 Hi,

 I want to start to contribute to Pacemaker project. I start to read
 Documentation and try some basic configurations. I have a question: what
 kind of algorithm used by Pacemaker to choose another node when a node die
 in a cluster? Is there any manual or documentation I can read?

We figure out the next best bode based on the location and colocation
constraints you specified in the configuration.
See the Pacemaker Explained and Cluster from Scratch docs at
http://www.clusterlabs.org/doc/


 Thank you,
 Hudan

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib

2010-10-05 Thread Andrew Beekhof
On Fri, Oct 1, 2010 at 3:45 PM, Shravan Mishra shravan.mis...@gmail.com wrote:
 Hi,

 Just a quick question, who generates the very first cib.xml when
 pacemaker processes are initialized?

The cib


 Thanks
 Shravan

 On Thu, Sep 30, 2010 at 4:22 AM, Andrew Beekhof and...@beekhof.net wrote:
 On Tue, Sep 28, 2010 at 11:47 AM, Andrew Beekhof and...@beekhof.net wrote:
 On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
 shravan.mis...@gmail.com wrote:
 Thanks Raoul for the response.

 Changing the permission to hacluster:haclient did stop that error.

 Now I'm hitting another problem whereby cib is failing to start

 Very strange logs.
 Which distribution is this?

   

 What does your corosync.conf look like?


 =
 Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
 ha2.itactics.com now has process list:
 00110012 (1114130)
 Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
 ha2.itactics.com now has 1 quorum votes (was 0)
 Sep 27 00:16:29 corosync [pcmk  ] info: send_member_notification:
 Sending membership update 100 to 0 children
 Sep 27 00:16:29 corosync [MAIN  ] Completed service synchronization,
 ready to provide service.
 Sep 27 00:16:30 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
 process cib exited (pid=14889, rc=127)
 Sep 27 00:16:30 corosync [pcmk  ] notice: pcmk_wait_dispatch:
 Respawning failed child process: cib
 Sep 27 00:16:30 corosync [pcmk  ] info: spawn_child: Forked child
 14896 for process cib
 crmd[14893]: 2010/09/27_00:16:30 WARN: do_cib_control: Couldn't
 complete CIB registration 1 times... pause and retry
 Sep 27 00:16:31 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
 process cib exited (pid=14896, rc=127)
 Sep 27 00:16:31 corosync [pcmk  ] notice: pcmk_wait_dispatch:
 Respawning failed child process: cib
 Sep 27 00:16:31 corosync [pcmk  ] info: spawn_child: Forked child
 14901 for process cib
 Sep 27 00:16:32 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
 process cib exited (pid=14901, rc=1
 ==


 I have attached the full logs.

 We are using  corosync 1.2.8 and pacemaker 1.1.3.


  Thanks.
 Shravan



 On Sat, Sep 25, 2010 at 4:36 AM, Raoul Bhatia [IPAX] r.bha...@ipax.at 
 wrote:
 On 24.09.2010 21:41, Shravan Mishra wrote:

 crmd[20612]: 2010/09/24_15:29:57 ERROR: crm_log_init_worker: Cannot
 change active directory to /var/lib/heartbeat/cores/hacluster:
 Permission denied (13)

 ls -ald /var/lib/heartbeat/cores/hacluster /var/lib/heartbeat/cores/
 /var/lib/heartbeat/ /var/lib/ /var/

 is haclient allowed to cd all the way into
 /var/lib/heartbeat/cores/hacluster ?

 cheers,


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fail-count and failure timeout

2010-10-05 Thread Andrew Beekhof
On Fri, Oct 1, 2010 at 3:40 PM,  holger.teut...@fresenius-netcare.com wrote:
 Hi,
 I observed the following in pacemaker Versions 1.1.3 and tip up to patch
 10258.

 In a small test environment to study fail-count behavior I have one resource

 anything
 doing sleep 600 with monitoring interval 10 secs.

 The failure-timeout is 300.

 I would expect to never see a failcount higher than 1.

Why?

The fail-count is only reset when the PE runs... which is on a failure
and/or after the cluster-recheck-interval
So I'd expect a maximum of two.

   cluster-recheck-interval = time [15min]
  Polling interval for time based changes to options,
resource parameters and constraints.

  The Cluster is primarily event driven, however the
configuration can have elements that change based on time. To ensure
these changes take effect, we can optionally poll  the  cluster’s
  status for changes. Allowed values: Zero disables
polling. Positive values are an interval in seconds (unless other SI
units are specified. eg. 5min)




 I observed some sporadic clears but mostly the count is increasing by 1 each
 10 minutes.

 Am I mistaken or is this a bug ?

Hard to say without logs.  What value did it reach?


 Regards
 Holger

 -- complete cib for reference ---

 cib epoch=32 num_updates=0 admin_epoch=0
 validate-with=pacemaker-1.2 crm_feature_set=3.0.4 have-quorum=0
 cib-last-written=Fri Oct  1 14:17:31 2010 dc-uuid=hotlx
   configuration
     crm_config
       cluster_property_set id=cib-bootstrap-options
         nvpair id=cib-bootstrap-options-dc-version name=dc-version
 value=1.1.3-09640bd6069e677d5eed65203a6056d9bf562e67/
         nvpair id=cib-bootstrap-options-cluster-infrastructure
 name=cluster-infrastructure value=openais/
         nvpair id=cib-bootstrap-options-expected-quorum-votes
 name=expected-quorum-votes value=2/
         nvpair id=cib-bootstrap-options-no-quorum-policy
 name=no-quorum-policy value=ignore/
         nvpair id=cib-bootstrap-options-stonith-enabled
 name=stonith-enabled value=false/
         nvpair id=cib-bootstrap-options-start-failure-is-fatal
 name=start-failure-is-fatal value=false/
         nvpair id=cib-bootstrap-options-last-lrm-refresh
 name=last-lrm-refresh value=1285926879/
       /cluster_property_set
     /crm_config
     nodes
       node id=hotlx uname=hotlx type=normal/
     /nodes
     resources
       primitive class=ocf id=test provider=heartbeat type=anything
         meta_attributes id=test-meta_attributes
           nvpair id=test-meta_attributes-target-role name=target-role
 value=started/
           nvpair id=test-meta_attributes-failure-timeout
 name=failure-timeout value=300/
         /meta_attributes
         operations id=test-operations
           op id=test-op-monitor-10 interval=10 name=monitor
 on-fail=restart timeout=20s/
           op id=test-op-start-0 interval=0 name=start
 on-fail=restart timeout=20s/
         /operations
         instance_attributes id=test-instance_attributes
           nvpair id=test-instance_attributes-binfile name=binfile
 value=sleep 600/
         /instance_attributes
       /primitive
     /resources
     constraints/
   /configuration
 /cib

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resources are restarted without obvious reasons

2010-10-05 Thread Andrew Beekhof
On Fri, Oct 1, 2010 at 9:53 AM, Pavlos Parissis
pavlos.paris...@gmail.com wrote:
 Hi,
 It seams that it happens every time PE wants to check the conf
 09:23:55 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer
 (I_PE_CALC) just popped!

 and then check_rsc_parameters() wants to reset my resources

 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of
 pbx_02 on node-02, provider changed: heartbeat - null
 09:23:55 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02
 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of
 pbx_01 on node-01, provider changed: heartbeat - null

Could be a bug in the code that detects changes to the resource definition.
Could you file a bug please?
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

 looking at the code I can't conclude where the issue could  be, in the
 actual conf or  I am hitting a bug
 static gboolean
 check_rsc_parameters(resource_t *rsc, node_t *node, xmlNode *rsc_entry,
  pe_working_set_t *data_set)
 {
     int attr_lpc = 0;
     gboolean force_restart = FALSE;
     gboolean delete_resource = FALSE;

     const char *value = NULL;
     const char *old_value = NULL;
     const char *attr_list[] = {
     XML_ATTR_TYPE,
     XML_AGENT_ATTR_CLASS,
     XML_AGENT_ATTR_PROVIDER
     };

     for(; attr_lpc  DIMOF(attr_list); attr_lpc++) {
     value = crm_element_value(rsc-xml, attr_list[attr_lpc]);
     old_value = crm_element_value(rsc_entry, attr_list[attr_lpc]);
     if(value == old_value /* ie. NULL */
    || crm_str_eq(value, old_value, TRUE)) {
     continue;
     }

     force_restart = TRUE;
     crm_notice(Forcing restart of %s on %s, %s changed: %s - %s,
    rsc-id, node-details-uname, attr_list[attr_lpc],
    crm_str(old_value), crm_str(value));
     }
     if(force_restart) {
     /* make sure the restart happens */
     stop_action(rsc, node, FALSE);
     set_bit(rsc-flags, pe_rsc_start_pending);
     delete_resource = TRUE;
     }
     return delete_resource;
 }


 On 1 October 2010 09:13, Pavlos Parissis pavlos.paris...@gmail.com wrote:

 Hi
 Could be related to a possible bug mentioned here[1]?

 BTW here is the conf of pacemaker
 node $id=b8ad13a6-8a6e-4304-a4a1-8f69fa735100 node-02
 node $id=d5557037-cf8f-49b7-95f5-c264927a0c76 node-01
 node $id=e5195d6b-ed14-4bb3-92d3-9105543f9251 node-03
 primitive drbd_01 ocf:linbit:drbd \
     params drbd_resource=drbd_pbx_service_1 \
     op monitor interval=30s \
     op start interval=0 timeout=240s \
     op stop interval=0 timeout=120s
 primitive drbd_02 ocf:linbit:drbd \
     params drbd_resource=drbd_pbx_service_2 \
     op monitor interval=30s \
     op start interval=0 timeout=240s \
     op stop interval=0 timeout=120s
 primitive fs_01 ocf:heartbeat:Filesystem \
     params device=/dev/drbd1 directory=/pbx_service_01
 fstype=ext3 \
     meta migration-threshold=3 failure-timeout=60 \
     op monitor interval=20s timeout=40s OCF_CHECK_LEVEL=20 \
     op start interval=0 timeout=60s \
     op stop interval=0 timeout=60s
 primitive fs_02 ocf:heartbeat:Filesystem \
     params device=/dev/drbd2 directory=/pbx_service_02
 fstype=ext3 \
     meta migration-threshold=3 failure-timeout=60 \
     op monitor interval=20s timeout=40s OCF_CHECK_LEVEL=20 \
     op start interval=0 timeout=60s \
     op stop interval=0 timeout=60s
 primitive ip_01 ocf:heartbeat:IPaddr2 \
     params ip=192.168.78.10 cidr_netmask=24
 broadcast=192.168.78.255 \
     meta failure-timeout=120 migration-threshold=3 \
     op monitor interval=5s
 primitive ip_02 ocf:heartbeat:IPaddr2 \
     params ip=192.168.78.20 cidr_netmask=24
 broadcast=192.168.78.255 \
     op monitor interval=5s
 primitive pbx_01 lsb:test-01 \
     meta failure-timeout=60 migration-threshold=3
 target-role=Started \
     op monitor interval=20s \
     op start interval=0 timeout=60s \
     op stop interval=0 timeout=60s
 primitive pbx_02 lsb:test-02 \
     meta failure-timeout=60 migration-threshold=3
 target-role=Started \
     op monitor interval=20s \
     op start interval=0 timeout=60s \
     op stop interval=0 timeout=60s
 group pbx_service_01 ip_01 fs_01 pbx_01 \
     meta target-role=Started
 group pbx_service_02 ip_02 fs_02 pbx_02 \
     meta target-role=Started
 ms ms-drbd_01 drbd_01 \
     meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ms-drbd_02 drbd_02 \
     meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
 location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
 location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
 location 

Re: [Pacemaker] [Problem or Enhancement]When attrd reboots, a fail count is initialized.

2010-10-05 Thread renayama19661014
Hi Andrew,

I registered these contents with Bugzilla as enhancement of the functions.

 * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2501

Thanks,
Hideo Yamauchi.


--- renayama19661...@ybb.ne.jp wrote:

 Hi Andrew,
 
 Thank you for comment.
 
   Is the change of this attrd and crmd difficult?
  
  I dont think so.
  But its not a huge priority because I've never heard of attrd actually 
  crashing.
  
  So while I agree that its theoretically a problem, in practice no-one
  is going to hit this in production.
  Even if they were unlucky enough to see it, at worst the resource is
  able to run on the node again - which doesn't seem that bad for a HA
  cluster :-)
 
 
 All right.
 
 I register this problem with Bugzilla as a demand first of all. 
 I will wait for the opinion from other users already appearing a little.
 
 Thanks,
 Hideo Yamauchi.
 
 --- Andrew Beekhof and...@beekhof.net wrote:
 
  On Fri, Oct 1, 2010 at 4:00 AM,  renayama19661...@ybb.ne.jp wrote:
   Hi Andrew,
  
   Thank you for comment.
  
   During crmd startup, one could read all the values from attrd into the
   hashtable.
   So the hashtable would only do something if only attrd went down.
  
   If attrd communicates with crmd at the time of start and reads the data 
   of the hash table,
 the
  problem
   seems to be able to be settled.
  
   Is the change of this attrd and crmd difficult?
  
  I dont think so.
  But its not a huge priority because I've never heard of attrd actually 
  crashing.
  
  So while I agree that its theoretically a problem, in practice no-one
  is going to hit this in production.
  Even if they were unlucky enough to see it, at worst the resource is
  able to run on the node again - which doesn't seem that bad for a HA
  cluster :-)
  
  
  
   I mean: did you see this behavior in a production system, or only
   during testing when you manually killed attrd?
  
   We carry out kill-command by manual operation as one of the tests of the 
   trouble of the
  processes.
   Our user minds behavior of the process trouble very much.
  
   Best Regards,
   Hideo Yamauchi.
  
   --- Andrew Beekhof and...@beekhof.net wrote:
  
   On Wed, Sep 29, 2010 at 3:59 AM, nbsp;renayama19661...@ybb.ne.jp 
   wrote:
Hi Andrew,
   
Thank you for comment.
   
The problem here is that attrd is supposed to be the authoritative
source for this sort of data.
   
Yes. I understand.
   
Additionally, you don't always want attrd reading from the status
section - like after the cluster restarts.
   
The problem seems to be able to solve even that it retrieves a status 
section from cib
  after
   attrd
rebooted.
method2 which I suggested is such a meaning.
 method 2)When attrd started, Attrd communicates with cib and 
 receives fail-count.
   
For failcount, the crmd could keep a hashtable of the current values
which it could re-send to attrd if it detects a disconnection.
But that might not be a generic-enough solution.
   
If a Hash table of crmd can maintain it, it may be a good thought.
However, I have a feeling that the same problem happens when crmd 
causes trouble and
  rebooted.
  
   During crmd startup, one could read all the values from attrd into the
   hashtable.
   So the hashtable would only do something if only attrd went down.
  
   
The chance that attrd dies _and_ there were relevant values for
fail-count is pretty remote though... is this a real problem you've
experienced or a theoretical one?
   
I did not understand meanings well.
Does this mean that there is fail-count of attrd in the other node?
  
   I mean: did you see this behavior in a production system, or only
   during testing when you manually killed attrd?
  
   
Best Regards,
Hideo Yamauchi.
   
--- Andrew Beekhof and...@beekhof.net wrote:
   
On Mon, Sep 27, 2010 at 7:26 AM, #65533;renayama19661...@ybb.ne.jp 
wrote:
 Hi,

 When I investigated another problem, I discovered this phenomenon.
 If attrd causes process trouble and does not restart, the problem 
 does not occur.

 Step1) After start, it causes a monitor error in UmIPaddr twice.

 Online: [ srv01 srv02 ]

 #65533;Resource Group: UMgroup01
 #65533; #65533; UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
 #65533; #65533; UmIPaddr #65533; (ocf::heartbeat:Dummy2): 
 #65533; #65533;
 #65533;
#65533;Started srv01

 Migration summary:
 * Node srv02:
 * Node srv01:
 #65533; UmIPaddr: migration-threshold=10 fail-count=2

 Step2) Kill Attrd and Attrd reboots.

 Online: [ srv01 srv02 ]

 #65533;Resource Group: UMgroup01
 #65533; #65533; UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
 #65533; #65533; UmIPaddr #65533; (ocf::heartbeat:Dummy2): 
 #65533; #65533;
 #65533;
#65533;Started srv01

 Migration summary:
 * Node srv02:
 * Node srv01:
 

Re: [Pacemaker] Missing lrm_opstatus

2010-10-05 Thread Andrew Beekhof
Dejan: looks like something in the lrm library.
Any idea why the message doesn't contain lrm_opstatus?
lrm_targetrc also looks strange.

On Thu, Sep 30, 2010 at 9:41 PM, Ron Kerry rke...@sgi.com wrote:
 Folks -

 I am seeing the following message sequence that results in a bogus
 declaration of monitor failures for two resources very quickly after a
 failover completes (from hendrix to genesis) with all resources coming up.
 The scenario is the same for both resources.

 CXFS resource monitor invoked after a successful start but the response is
 faked likely due to the start-delay defined for monitoring.

 Sep 30 10:23:33 genesis crmd: [12176]: info: te_rsc_command: Initiating
 action 8: monitor CXFS_monitor_3 on genesis (local)
 Sep 30 10:23:33 genesis crmd: [12176]: info: do_lrm_rsc_op: Performing
 key=8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c op=CXFS_monitor_3 )
 Sep 30 10:23:33 genesis lrmd: [12173]: debug: on_msg_perform_op: an
 operation operation monitor[15] on ocf::cxfs::CXFS for client 12176, its
 parameters: CRM_meta_name=[monitor] CRM_meta_start_delay=[60]
 crm_feature_set=[3.0.2] CRM_meta_on_fail=[restar..
 Sep 30 10:23:33 genesis crmd: [12176]: info: do_lrm_rsc_op: Faking
 confirmation of CXFS_monitor_3: execution postponed for over 5 minutes
 Sep 30 10:23:33 genesis crmd: [12176]: info: send_direct_ack: ACK'ing
 resource op CXFS_monitor_3 from
 8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c: lrm_invoke-lrmd-1285860213-14
 Sep 30 10:23:33 genesis crmd: [12176]: info: process_te_message: Processing
 (N)ACK lrm_invoke-lrmd-1285860213-14 from genesis
 Sep 30 10:23:33 genesis crmd: [12176]: info: match_graph_event: Action
 CXFS_monitor_3 (8) confirmed on genesis (rc=0)

 Similar sequence for the TMF resource ...

 Sep 30 10:23:44 genesis crmd: [12176]: info: te_rsc_command: Initiating
 action 12: monitor TMF_monitor_6 on genesis (local)
 Sep 30 10:23:44 genesis crmd: [12176]: info: do_lrm_rsc_op: Performing
 key=12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c op=TMF_monitor_6 )
 Sep 30 10:23:44 genesis lrmd: [12173]: debug: on_msg_perform_op: an
 operation operation monitor[19] on ocf::tmf::TMF for client 12176, its
 parameters: admin_emails=[rke...@sgi.com] loader_hosts=[ibm3494cps]
 devgrpnames=[ibm3592] loader_names=[ibm3494] loa...
 Sep 30 10:23:44 genesis crmd: [12176]: info: do_lrm_rsc_op: Faking
 confirmation of TMF_monitor_6: execution postponed for over 5 minutes
 Sep 30 10:23:44 genesis crmd: [12176]: info: send_direct_ack: ACK'ing
 resource op TMF_monitor_6 from
 12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c: lrm_invoke-lrmd-1285860224-19
 Sep 30 10:23:44 genesis crmd: [12176]: info: process_te_message: Processing
 (N)ACK lrm_invoke-lrmd-1285860224-19 from genesis
 Sep 30 10:23:44 genesis crmd: [12176]: info: match_graph_event: Action
 TMF_monitor_6 (12) confirmed on genesis (rc=0)

 TMF monitor operation state gets an error. Note that the operation id
 matches the above invocation.

 Sep 30 10:26:12 genesis lrmd: [12173]: debug: on_msg_get_state:state of rsc
 TMF is LRM_RSC_IDLE
 Sep 30 10:26:12 genesis crmd: [12176]: WARN: msg_to_op(1326): failed to get
 the value of field lrm_opstatus from a ha_msg
 Sep 30 10:26:12 genesis crmd: [12176]: info: msg_to_op: Message follows:
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG: Dumping message with 16
 fields
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[0] : [lrm_t=op]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[1] : [lrm_rid=TMF]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[2] : [lrm_op=monitor]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[3] : [lrm_timeout=63]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[4] : [lrm_interval=6]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[5] : [lrm_delay=60]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[6] : [lrm_copyparams=0]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[7] : [lrm_t_run=0]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[8] : [lrm_t_rcchange=0]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[9] : [lrm_exec_time=0]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[10] : [lrm_queue_time=0]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[11] : [lrm_targetrc=-2]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[12] : [lrm_app=crmd]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[13] :
 [lrm_userdata=12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c]
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[14] :
 [(2)lrm_param=0x60081260(331 413)]

 Same for the CXFS monitor operation state ...

 Sep 30 10:26:12 genesis lrmd: [12173]: debug: on_msg_get_state:state of rsc
 CXFS is LRM_RSC_IDLE
 Sep 30 10:26:12 genesis crmd: [12176]: WARN: msg_to_op(1326): failed to get
 the value of field lrm_opstatus from a ha_msg
 Sep 30 10:26:12 genesis crmd: [12176]: info: msg_to_op: Message follows:
 Sep 30 10:26:12 genesis crmd: [12176]: info: MSG: Dumping message with 16
 fields
 Sep 30 10:26:12 genesis crmd: [12176]: 

Re: [Pacemaker] crm_mon SNMP function

2010-10-05 Thread Michael Schhwartzkopff
On Monday 04 October 2010 15:00:25 mathias.enzensber...@knapp.com wrote:
 Hi all,
 
 I use openais/pacemaker v.1.1.2 on SLES 11.1 and would like to use the
 SNMP function of crm_mon.
 But this part is documented really scanty (e.g. the part for configuring
 SNMP notifications is blank).
 I found out that there is a special MIB named linux-ha-mib but I don´t
 know how to use this MIB in connection with the crm_mon command and its
 SNMP function.
 
 Does anyone of you have experience with that, or can someone document it
 shortly for me?
 
 Thank you in advance.
 
 Mit freundlichen Grüßen / Best Regards
 
 Mathias Enzensberger
 Systemadministration

Hi,

they say that there is a good German about Linux Clusters published by 
O'Reilly.  ;-)
The book deals with the SNMP part of the clustersoftware in a seperate 
chapter.

As far a I know there are two SNMP Agents implemented in the clustersoftware 
at the moment:

1) One within the crm_mon program. This one is only able to send out traps. It 
does it in case of an unexpected event. Configure it with the -s (?) option 
of the crm_mon program. Please check because I'm not quite sure about the 
option.

2) A full blown SNMP subagent program called hbagent. It uses the AgentX 
socket of the netsnmp agent. It provides a complete information MIB including 
useful things like failcounter.

Please mail again if you are stuck using the agent and want to know more.

Greetings,


-- 
Dr. Michael Schwartzkopff
Guardinistr. 63
81375 München

Tel: (0163) 172 50 98

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resources are restarted without obvious reasons

2010-10-05 Thread Pavlos Parissis
On 5 October 2010 11:15, Andrew Beekhof and...@beekhof.net wrote:

 On Fri, Oct 1, 2010 at 9:53 AM, Pavlos Parissis
 pavlos.paris...@gmail.com wrote:
  Hi,
  It seams that it happens every time PE wants to check the conf
  09:23:55 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer
  (I_PE_CALC) just popped!
 
  and then check_rsc_parameters() wants to reset my resources
 
  09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart
 of
  pbx_02 on node-02, provider changed: heartbeat - null
  09:23:55 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02
  09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart
 of
  pbx_01 on node-01, provider changed: heartbeat - null

 Could be a bug in the code that detects changes to the resource definition.
 Could you file a bug please?

 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


here it is http://developerbugs.linux-foundation.org/show_bug.cgi?id=2504
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] init Script fails in 1 of LSB Compatible test

2010-10-05 Thread Pavlos Parissis
Hi,

I am thinking to put under cluster control the sshd and I am checking if the
/etc/init.d/sshd supplied by RedHat 5.4 is compatible with LSB.
So, I run the test mentioned here [1] and it fails at test 6, it returns 1
and failed message.
Could this create problems within pacemaker?

Regards,
Pavlos




[1]
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fail-count and failure timeout

2010-10-05 Thread Holger . Teutsch
The resource failed when the sleep expired, i.e. each 600 secs.
Now I changed the resource to

sleep 7200, failure-timeout 3600

i.e. to values far beyond the recheck-interval opf 15m.

Now everything behaves as expected.
 
Mit freundlichen Grüßen / Kind regards 

Holger Teutsch 





From:   Andrew Beekhof and...@beekhof.net
To: The Pacemaker cluster resource manager 
pacemaker@oss.clusterlabs.org
Date:   05.10.2010 11:09
Subject:Re: [Pacemaker] Fail-count and failure timeout



On Tue, Oct 5, 2010 at 11:07 AM, Andrew Beekhof and...@beekhof.net 
wrote:
 On Fri, Oct 1, 2010 at 3:40 PM,  holger.teut...@fresenius-netcare.com 
wrote:
 Hi,
 I observed the following in pacemaker Versions 1.1.3 and tip up to 
patch
 10258.

 In a small test environment to study fail-count behavior I have one 
resource

 anything
 doing sleep 600 with monitoring interval 10 secs.

 The failure-timeout is 300.

 I would expect to never see a failcount higher than 1.

 Why?

 The fail-count is only reset when the PE runs... which is on a failure
 and/or after the cluster-recheck-interval
 So I'd expect a maximum of two.

Actually this is wrong.
There is no maximum, because there needs to have been 300s since the
last failure when the PE runs.
And since it only runs when the resource fails, it is never reset.


   cluster-recheck-interval = time [15min]
  Polling interval for time based changes to options,
 resource parameters and constraints.

  The Cluster is primarily event driven, however the
 configuration can have elements that change based on time. To ensure
 these changes take effect, we can optionally poll  the  cluster’s
  status for changes. Allowed values: Zero disables
 polling. Positive values are an interval in seconds (unless other SI
 units are specified. eg. 5min)




 I observed some sporadic clears but mostly the count is increasing by 1 
each
 10 minutes.

 Am I mistaken or is this a bug ?

 Hard to say without logs.  What value did it reach?


 Regards
 Holger

 -- complete cib for reference ---

 cib epoch=32 num_updates=0 admin_epoch=0
 validate-with=pacemaker-1.2 crm_feature_set=3.0.4 have-quorum=0
 cib-last-written=Fri Oct  1 14:17:31 2010 dc-uuid=hotlx
   configuration
 crm_config
   cluster_property_set id=cib-bootstrap-options
 nvpair id=cib-bootstrap-options-dc-version name=dc-version
 value=1.1.3-09640bd6069e677d5eed65203a6056d9bf562e67/
 nvpair id=cib-bootstrap-options-cluster-infrastructure
 name=cluster-infrastructure value=openais/
 nvpair id=cib-bootstrap-options-expected-quorum-votes
 name=expected-quorum-votes value=2/
 nvpair id=cib-bootstrap-options-no-quorum-policy
 name=no-quorum-policy value=ignore/
 nvpair id=cib-bootstrap-options-stonith-enabled
 name=stonith-enabled value=false/
 nvpair id=cib-bootstrap-options-start-failure-is-fatal
 name=start-failure-is-fatal value=false/
 nvpair id=cib-bootstrap-options-last-lrm-refresh
 name=last-lrm-refresh value=1285926879/
   /cluster_property_set
 /crm_config
 nodes
   node id=hotlx uname=hotlx type=normal/
 /nodes
 resources
   primitive class=ocf id=test provider=heartbeat 
type=anything
 meta_attributes id=test-meta_attributes
   nvpair id=test-meta_attributes-target-role 
name=target-role
 value=started/
   nvpair id=test-meta_attributes-failure-timeout
 name=failure-timeout value=300/
 /meta_attributes
 operations id=test-operations
   op id=test-op-monitor-10 interval=10 name=monitor
 on-fail=restart timeout=20s/
   op id=test-op-start-0 interval=0 name=start
 on-fail=restart timeout=20s/
 /operations
 instance_attributes id=test-instance_attributes
   nvpair id=test-instance_attributes-binfile name=binfile
 value=sleep 600/
 /instance_attributes
   /primitive
 /resources
 constraints/
   /configuration
 /cib

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 

Re: [Pacemaker] init Script fails in 1 of LSB Compatible test

2010-10-05 Thread Andrew Beekhof
On Tue, Oct 5, 2010 at 12:51 PM, Pavlos Parissis
pavlos.paris...@gmail.com wrote:
 Hi,

 I am thinking to put under cluster control the sshd and I am checking if the
 /etc/init.d/sshd supplied by RedHat 5.4 is compatible with LSB.
 So, I run the test mentioned here [1] and it fails at test 6, it returns 1
 and failed message.
 Could this create problems within pacemaker?

yes


 Regards,
 Pavlos




 [1]
 http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] init Script fails in 1 of LSB Compatible test

2010-10-05 Thread Pavlos Parissis
On 5 October 2010 13:19, Andrew Beekhof and...@beekhof.net wrote:

 On Tue, Oct 5, 2010 at 12:51 PM, Pavlos Parissis
 pavlos.paris...@gmail.com wrote:
  Hi,
 
  I am thinking to put under cluster control the sshd and I am checking if
 the
  /etc/init.d/sshd supplied by RedHat 5.4 is compatible with LSB.
  So, I run the test mentioned here [1] and it fails at test 6, it returns
 1
  and failed message.
  Could this create problems within pacemaker?

 yes


what kind of prolems and why?

Regards,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Dependency on either of two resources

2010-10-05 Thread Vladislav Bogdanov
05.10.2010 12:12, Andrew Beekhof wrote:
 On Mon, Oct 4, 2010 at 6:31 AM, Vladislav Bogdanov bub...@hoster-ok.com 
 wrote:
 Hi all,

 just wondering, is there a way to make resource depend on (be colocated
 with) either of two other resources?
 
 Not yet.  Its something we want to support eventually though.

That would be a killer feature. Hope you will find time for it until 1.2.

So now I probably should exploit monitor op to try to repair failed
portals. That is a hack and it doesn't let me see in cib if there are
actual problems, but it is better than nothing.

 
 Use case is iSCSI initiator connection to iSCSI target with two portals.
 Idea is to have f.e. device manager multipath resource depend on both
 iSCSI connection resources, but in a soft way, so fail of any single
 iSCSI connection will not cause multipath resource to stop, but fail of
 both connections will cause it.

 I should be missing something but I cannot find answer is it possible
 with current pacemaker. Can someone bring some light?

 Best,
 Vladislav

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] pacemaker version

2010-10-05 Thread Shravan Mishra
Hi,

I was interested in knowing that if I have to choose between pacemaker
1.0 vs 1.1 which one should I use.

Thanks
Shravan

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Online and Offline status when doing crm_mon

2010-10-05 Thread Mike A Meyer
We are setup in a two node active/passive
cluster using pacemaker/corosync. We shutdown the pacemaker/corosync
on both nodes and changed the uname -n on our nodes to show the short name
instead of the FQDN. Started up pacemaker/corosync and ever since
we done that, when we run the crm_mon command, we see this below.


Last updated: Tue Oct 5 13:28:16
2010
Stack: openais
Current DC: e-magdb2 - partition with
quorum
Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
4 Nodes configured, 2 expected votes
2 Resources configured.


Online: [ e-magdb2 e-magdb1 ]
OFFLINE: [ e-magdb1.testingpcmk.com
e-magdb2.testingpcmkr.com ]

We did edit the crm configuration file
to use short names for both nodes.  We can ping both the short name
and the FQDN on our internal network and both come back with the right
IP address. We are running on RHEL 5. Anybody have any ideas
why the FQDN shows offline since this change since we configured pacemaker/corosync
to use short names? Is it grabbing it from internal DNS from the
IP address we have in the /etc/corosync.conf file? Everything seems
to be working correctly and failing over correctly. Should this be
something to worry about though or is it a display bug maybe? Below
is the corosync.conf file.

# Please read the corosync.conf.5 manual
page  
  
  
  
  
  
  
  
   
compatibility: whitetank  
  
  
  
  
  
  
  
  
  

 
  
  
  
  
  
  
  
  
  
  
 
totem {
version:
2  
  
  
  
  
  
  
  
  
  
secauth:
off  
  
  
  
  
  
  
  
  
 
threads:
0  
  
  
  
  
  
  
  
  
  
interface
{
 
  ringnumber: 0   
  
  
  
  
  
  
  
  
   
 
  bindnetaddr: 172.26.5.167 
  
  
  
  
  
  
  
  
   
 
  mcastaddr: 226.94.5.1
 
  mcastport: 5405
}
}

logging {
fileline:
off
to_stderr:
no
to_logfile:
yes
to_syslog:
yes
logfile:
/var/log/cluster/corosync.log
debug: off
timestamp:
on
logger_subsys
{
 
  subsys: AMF
 
  debug: off
}
}

amf {
mode: disabled
}


Thanks,
Mike
-

Thise-mailmessageisintendedonlyforthepersonaluseoftherecipient(s)
namedabove.Ifyouarenotanintendedrecipient,youmaynotreview,copyor
distributethismessage.Ifyouhavereceivedthiscommunicationinerror,
pleasenotifytheCDSGlobalHelpDesk(cdshelpd...@cds-global.com)immediately
bye-mailanddeletetheoriginalmessage.

-

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] how to test network access and fail over accordingly?

2010-10-05 Thread Craig Hurley
Hello,

I have a 2 node cluster, running DRBD, heartbeat and pacemaker in
active/passive mode.  On both nodes, eth0 is connected to the main
network, eth1 is used to connect the nodes directly to each other.
The nodes share a virtual IP address on eth0.  Pacemaker is also
controlling a custom service with an LSB compliant script in
/etc/init.d/.  All of this is working fine and I'm happy with it.

I'd like to configure the nodes so that they fail over if eth0 goes
down (or if they cannot access a particular gateway), so I tried
adding the following (as per
http://www.clusterlabs.org/wiki/Example_configurations#Set_up_pingd)

primitive p_pingd ocf:pacemaker:pingd params host_list=172.20.0.254 op
monitor interval=15s timeout=5s
clone c_pingd p_pingd meta globally-unique=false
location loc_pingd g_cluster_services rule -inf: not_defined p_pingd
or p_pingd lte 0

... but when I do add that, all resource are stopped and they don't
come back up on either node.  Am I making a basic mistake or do you
need more info from me?

All help is appreciated,
Craig.


pacemaker
Version: 1.0.8+hg15494-2ubuntu2

heartbeat
Version: 1:3.0.3-1ubuntu1

drbd8-utils
Version: 2:8.3.7-1ubuntu2.1


r...@rpalpha:~$ sudo crm configure show
node $id=32482293-7b0f-466e-b405-c64bcfa2747d rpalpha
node $id=3f2aac12-05aa-4ac7-b91f-c47fa28efb44 rpbravo
primitive p_drbd_data ocf:linbit:drbd \
params drbd_resource=data \
op monitor interval=30s
primitive p_fs_data ocf:heartbeat:Filesystem \
params device=/dev/drbd/by-res/data directory=/mnt/data
fstype=ext4
primitive p_ip ocf:heartbeat:IPaddr2 \
params ip=172.20.50.3 cidr_netmask=255.255.0.0 nic=eth0 \
op monitor interval=30s
primitive p_rp lsb:rp \
op monitor interval=30s \
meta target-role=Started
group g_cluster_services p_ip p_fs_data p_rp
ms ms_drbd p_drbd_data \
meta master-max=1 master-node-max=1 clone-max=2
clone-node-max=1 notify=true
location loc_preferred_master g_cluster_services inf: rpalpha
colocation colo_mnt_on_master inf: g_cluster_services ms_drbd:Master
order ord_mount_after_drbd inf: ms_drbd:promote g_cluster_services:start
property $id=cib-bootstrap-options \
dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \
cluster-infrastructure=Heartbeat \
no-quorum-policy=ignore \
stonith-enabled=false \
expected-quorum-votes=2 \


r...@rpalpha:~$ sudo cat /etc/ha.d/ha.cf
node rpalpha
node rpbravo

keepalive 2
warntime 5
deadtime 15
initdead 60

mcast eth0 239.0.0.43 694 1 0
bcast eth1

use_logd yes
autojoin none
crm respawn


r...@rpalpha:~$ sudo cat /etc/drbd.conf
global {
usage-count no;
}
common {
protocol C;

handlers {}

startup {}

disk {}

net {
cram-hmac-alg sha1;
shared-secret foobar;
}

syncer {
verify-alg sha1;
rate 100M;
}
}
resource data {
device /dev/drbd0;
meta-disk internal;
on rpalpha {
disk /dev/mapper/rpalpha-data;
address 192.168.1.1:7789;
}
on rpbravo {
disk /dev/mapper/rpbravo-data;
address 192.168.1.2:7789;
}
}

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker