Re: [Pacemaker] failure handling on a cloned resource

2013-05-15 Thread Andrew Beekhof

On 16/05/2013, at 12:45 AM, Johan Huysmans  wrote:

> Hi Andrew,
> 
> Thx!
> 
> I tested your github pacemaker repository by building an rpm from it and 
> installing it on my testsetup.
> 
> Before I could build the rpm I had to change 2 things in the GNUmakefile:
> * --without=doc should be --without doc

That would be a dependancy issue, I something needed was not installed.

> * --target i686 was missing
> If I didn't make these modification the rpmbuild command failed (on CentOS6)

What was the command you ran?

> 
> I performed the test which failed before and everything seems OK.
> Once the failing resource was restored the depending resources were 
> automatically started.
> 
> Thanks for this fast fix!
> 
> 
> I which release can I expect this fix? and when is it planned?

1.1.10 planned for as soon as all the bugs are fixed :)
we're at rc2 now, rc3 should be today/tomorrow

> I will currently use the head build I created. This is ok for my testsetup
> but I don't want to run this version in production
> 
> Greetings,
> Johan Huysmans
> 
> On 2013-05-10 06:55, Andrew Beekhof wrote:
>> Fixed!
>> 
>>   https://github.com/beekhof/pacemaker/commit/d87de1b
>> 
>> On 10/05/2013, at 11:59 AM, Andrew Beekhof  wrote:
>> 
>>> On 07/05/2013, at 5:15 PM, Johan Huysmans  wrote:
>>> 
 Hi,
 
 I only keep a couple of pe-input file, and that pe-inpurt-1 version was 
 already overwritten.
 I redid my tests as describe in my previous mails.
 
 At the end of the test it was again written to pe-input1, which is 
 included as attachment.
>>> Perfect.
>>> Basically the PE doesn't know how to correctly recognise that 
>>> d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0:
>>> 
>>>>> operation_key="d_tomcat_monitor_15000" operation="monitor" 
>>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" 
>>> transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" 
>>> transition-magic="0:0;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" 
>>> call-id="44" rc-code="0" op-status="0" interval="15000" 
>>> last-rc-change="1367910303" exec-time="0" queue-time="0" 
>>> op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/>
>>>>> operation_key="d_tomcat_monitor_15000" operation="monitor" 
>>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" 
>>> transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" 
>>> transition-magic="0:1;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" 
>>> call-id="44" rc-code="1" op-status="0" interval="15000" 
>>> last-rc-change="1367909258" exec-time="0" queue-time="0" 
>>> op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/>
>>> 
>>> which would allow it to recognise that the resource is healthy one again.
>>> 
>>> I'll see what I can do...
>>> 
 gr.
 Johan
 
 On 2013-05-07 04:08, Andrew Beekhof wrote:
> I have a much clearer idea of the problem you're seeing now, thankyou.
> 
> Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?
> 
> On 03/05/2013, at 10:40 PM, Johan Huysmans  
> wrote:
> 
>> Hi,
>> 
>> Below you can see my setup and my test, this shows that my cloned 
>> resource with on-fail=block does not recover automatically.
>> 
>> My Setup:
>> 
>> # rpm -aq | grep -i pacemaker
>> pacemaker-libs-1.1.9-1512.el6.i686
>> pacemaker-cluster-libs-1.1.9-1512.el6.i686
>> pacemaker-cli-1.1.9-1512.el6.i686
>> pacemaker-1.1.9-1512.el6.i686
>> 
>> # crm configure show
>> node CSE-1
>> node CSE-2
>> primitive d_tomcat ocf:ntc:tomcat \
>>   op monitor interval="15s" timeout="510s" on-fail="block" \
>>   op start interval="0" timeout="510s" \
>>   params instance_name="NMS" monitor_use_ssl="no" 
>> monitor_urls="/cse/health" monitor_timeout="120" \
>>   meta migration-threshold="1"
>> primitive ip_11 ocf:heartbeat:IPaddr2 \
>>   op monitor interval="10s" \
>>   params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" 
>> iflabel="ha" \
>>   meta migration-threshold="1" failure-timeout="10"
>> primitive ip_19 ocf:heartbeat:IPaddr2 \
>>   op monitor interval="10s" \
>>   params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" 
>> iflabel="ha" \
>>   meta migration-threshold="1" failure-timeout="10"
>> group svc-cse ip_19 ip_11
>> clone cl_tomcat d_tomcat
>> colocation colo_tomcat inf: svc-cse cl_tomcat
>> order order_tomcat inf: cl_tomcat svc-cse
>> property $id="cib-bootstrap-options" \
>>   dc-version="1.1.9-1512.el6-2a917dd" \
>>   cluster-infrastructure="cman" \
>>   pe-warn-series-max="9" \
>>   no-quorum-policy="ignore" \
>>   stonith-enabled="false" \
>>   pe-input-series-max="9" \
>>   pe-error-series-max="9" \
>>   last-lrm-refresh="1367582088"
>> 
>> Currently only 1 node is available, CSE-1.
>> 
>> 
>> This is how I am currently 

Re: [Pacemaker] failure handling on a cloned resource

2013-05-15 Thread Johan Huysmans

Hi Andrew,

Thx!

I tested your github pacemaker repository by building an rpm from it and 
installing it on my testsetup.


Before I could build the rpm I had to change 2 things in the GNUmakefile:
* --without=doc should be --without doc
* --target i686 was missing
If I didn't make these modification the rpmbuild command failed (on CentOS6)

I performed the test which failed before and everything seems OK.
Once the failing resource was restored the depending resources were 
automatically started.


Thanks for this fast fix!


I which release can I expect this fix? and when is it planned?
I will currently use the head build I created. This is ok for my testsetup
but I don't want to run this version in production

Greetings,
Johan Huysmans

On 2013-05-10 06:55, Andrew Beekhof wrote:

Fixed!

   https://github.com/beekhof/pacemaker/commit/d87de1b

On 10/05/2013, at 11:59 AM, Andrew Beekhof  wrote:


On 07/05/2013, at 5:15 PM, Johan Huysmans  wrote:


Hi,

I only keep a couple of pe-input file, and that pe-inpurt-1 version was already 
overwritten.
I redid my tests as describe in my previous mails.

At the end of the test it was again written to pe-input1, which is included as 
attachment.

Perfect.
Basically the PE doesn't know how to correctly recognise that 
d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0:




which would allow it to recognise that the resource is healthy one again.

I'll see what I can do...


gr.
Johan

On 2013-05-07 04:08, Andrew Beekhof wrote:

I have a much clearer idea of the problem you're seeing now, thankyou.

Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?

On 03/05/2013, at 10:40 PM, Johan Huysmans  wrote:


Hi,

Below you can see my setup and my test, this shows that my cloned resource with 
on-fail=block does not recover automatically.

My Setup:

# rpm -aq | grep -i pacemaker
pacemaker-libs-1.1.9-1512.el6.i686
pacemaker-cluster-libs-1.1.9-1512.el6.i686
pacemaker-cli-1.1.9-1512.el6.i686
pacemaker-1.1.9-1512.el6.i686

# crm configure show
node CSE-1
node CSE-2
primitive d_tomcat ocf:ntc:tomcat \
   op monitor interval="15s" timeout="510s" on-fail="block" \
   op start interval="0" timeout="510s" \
   params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" 
monitor_timeout="120" \
   meta migration-threshold="1"
primitive ip_11 ocf:heartbeat:IPaddr2 \
   op monitor interval="10s" \
   params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" 
iflabel="ha" \
   meta migration-threshold="1" failure-timeout="10"
primitive ip_19 ocf:heartbeat:IPaddr2 \
   op monitor interval="10s" \
   params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" 
iflabel="ha" \
   meta migration-threshold="1" failure-timeout="10"
group svc-cse ip_19 ip_11
clone cl_tomcat d_tomcat
colocation colo_tomcat inf: svc-cse cl_tomcat
order order_tomcat inf: cl_tomcat svc-cse
property $id="cib-bootstrap-options" \
   dc-version="1.1.9-1512.el6-2a917dd" \
   cluster-infrastructure="cman" \
   pe-warn-series-max="9" \
   no-quorum-policy="ignore" \
   stonith-enabled="false" \
   pe-input-series-max="9" \
   pe-error-series-max="9" \
   last-lrm-refresh="1367582088"

Currently only 1 node is available, CSE-1.


This is how I am currently testing my setup:

=> Starting point: Everything up and running

# crm resource status
Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Started
ip_11(ocf::heartbeat:IPaddr2):Started
Clone Set: cl_tomcat [d_tomcat]
Started: [ CSE-1 ]
Stopped: [ d_tomcat:1 ]

=> Causing failure: Change system so tomcat is running but has a failure (in 
attachment step_2.log)

# crm resource status
Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Stopped
ip_11(ocf::heartbeat:IPaddr2):Stopped
Clone Set: cl_tomcat [d_tomcat]
d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
Stopped: [ d_tomcat:1 ]

=> Fixing failure: Revert system so tomcat is running without failure (in 
attachment step_3.log)

# crm resource status
Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Stopped
ip_11(ocf::heartbeat:IPaddr2):Stopped
Clone Set: cl_tomcat [d_tomcat]
d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
Stopped: [ d_tomcat:1 ]

As you can see in the logs the OCF script doesn't return any failure. This is 
noticed by pacemaker,
however it doesn't reflect in crm_mon and it doesn't start the depending 
resources.

Gr.
Johan

On 2013-05-03 03:04, Andrew Beekhof wrote:

On 02/05/2013, at 5:45 PM, Johan Huysmans  wrote:


On 2013-05-01 05:48, Andrew Beekhof wrote:

On 17/04/2013, at 9:54 PM, Johan Huysmans  wrote:


Hi All,

I'm trying to setup a specific configuration in our cluster, however I'm 
struggling with my configuration.

This is what I'm trying to achieve:
On both nodes of the cluster a daemon must be running (tomcat).
Some failover addresses are configured and must 

Re: [Pacemaker] failure handling on a cloned resource

2013-05-09 Thread Andrew Beekhof
Fixed!

  https://github.com/beekhof/pacemaker/commit/d87de1b

On 10/05/2013, at 11:59 AM, Andrew Beekhof  wrote:

> 
> On 07/05/2013, at 5:15 PM, Johan Huysmans  wrote:
> 
>> Hi,
>> 
>> I only keep a couple of pe-input file, and that pe-inpurt-1 version was 
>> already overwritten.
>> I redid my tests as describe in my previous mails.
>> 
>> At the end of the test it was again written to pe-input1, which is included 
>> as attachment.
> 
> Perfect.
> Basically the PE doesn't know how to correctly recognise that 
> d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0:
> 
> operation_key="d_tomcat_monitor_15000" operation="monitor" 
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" 
> transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" 
> transition-magic="0:0;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" 
> call-id="44" rc-code="0" op-status="0" interval="15000" 
> last-rc-change="1367910303" exec-time="0" queue-time="0" 
> op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/>
> operation_key="d_tomcat_monitor_15000" operation="monitor" 
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" 
> transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" 
> transition-magic="0:1;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" 
> call-id="44" rc-code="1" op-status="0" interval="15000" 
> last-rc-change="1367909258" exec-time="0" queue-time="0" 
> op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/>
> 
> which would allow it to recognise that the resource is healthy one again.
> 
> I'll see what I can do...
> 
>> 
>> gr.
>> Johan
>> 
>> On 2013-05-07 04:08, Andrew Beekhof wrote:
>>> I have a much clearer idea of the problem you're seeing now, thankyou.
>>> 
>>> Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?
>>> 
>>> On 03/05/2013, at 10:40 PM, Johan Huysmans  wrote:
>>> 
 Hi,
 
 Below you can see my setup and my test, this shows that my cloned resource 
 with on-fail=block does not recover automatically.
 
 My Setup:
 
 # rpm -aq | grep -i pacemaker
 pacemaker-libs-1.1.9-1512.el6.i686
 pacemaker-cluster-libs-1.1.9-1512.el6.i686
 pacemaker-cli-1.1.9-1512.el6.i686
 pacemaker-1.1.9-1512.el6.i686
 
 # crm configure show
 node CSE-1
 node CSE-2
 primitive d_tomcat ocf:ntc:tomcat \
   op monitor interval="15s" timeout="510s" on-fail="block" \
   op start interval="0" timeout="510s" \
   params instance_name="NMS" monitor_use_ssl="no" 
 monitor_urls="/cse/health" monitor_timeout="120" \
   meta migration-threshold="1"
 primitive ip_11 ocf:heartbeat:IPaddr2 \
   op monitor interval="10s" \
   params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" 
 iflabel="ha" \
   meta migration-threshold="1" failure-timeout="10"
 primitive ip_19 ocf:heartbeat:IPaddr2 \
   op monitor interval="10s" \
   params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" 
 iflabel="ha" \
   meta migration-threshold="1" failure-timeout="10"
 group svc-cse ip_19 ip_11
 clone cl_tomcat d_tomcat
 colocation colo_tomcat inf: svc-cse cl_tomcat
 order order_tomcat inf: cl_tomcat svc-cse
 property $id="cib-bootstrap-options" \
   dc-version="1.1.9-1512.el6-2a917dd" \
   cluster-infrastructure="cman" \
   pe-warn-series-max="9" \
   no-quorum-policy="ignore" \
   stonith-enabled="false" \
   pe-input-series-max="9" \
   pe-error-series-max="9" \
   last-lrm-refresh="1367582088"
 
 Currently only 1 node is available, CSE-1.
 
 
 This is how I am currently testing my setup:
 
 => Starting point: Everything up and running
 
 # crm resource status
 Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Started
ip_11(ocf::heartbeat:IPaddr2):Started
 Clone Set: cl_tomcat [d_tomcat]
Started: [ CSE-1 ]
Stopped: [ d_tomcat:1 ]
 
 => Causing failure: Change system so tomcat is running but has a failure 
 (in attachment step_2.log)
 
 # crm resource status
 Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Stopped
ip_11(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: cl_tomcat [d_tomcat]
d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
Stopped: [ d_tomcat:1 ]
 
 => Fixing failure: Revert system so tomcat is running without failure (in 
 attachment step_3.log)
 
 # crm resource status
 Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Stopped
ip_11(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: cl_tomcat [d_tomcat]
d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
Stopped: [ d_tomcat:1 ]
 
 As you can see in the logs the OCF script doesn't return any failure. This 
 is noticed by pacemaker,
 h

Re: [Pacemaker] failure handling on a cloned resource

2013-05-09 Thread Andrew Beekhof

On 07/05/2013, at 5:15 PM, Johan Huysmans  wrote:

> Hi,
> 
> I only keep a couple of pe-input file, and that pe-inpurt-1 version was 
> already overwritten.
> I redid my tests as describe in my previous mails.
> 
> At the end of the test it was again written to pe-input1, which is included 
> as attachment.

Perfect.
Basically the PE doesn't know how to correctly recognise that 
d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0:




which would allow it to recognise that the resource is healthy one again.

I'll see what I can do...

> 
> gr.
> Johan
> 
> On 2013-05-07 04:08, Andrew Beekhof wrote:
>> I have a much clearer idea of the problem you're seeing now, thankyou.
>> 
>> Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?
>> 
>> On 03/05/2013, at 10:40 PM, Johan Huysmans  wrote:
>> 
>>> Hi,
>>> 
>>> Below you can see my setup and my test, this shows that my cloned resource 
>>> with on-fail=block does not recover automatically.
>>> 
>>> My Setup:
>>> 
>>> # rpm -aq | grep -i pacemaker
>>> pacemaker-libs-1.1.9-1512.el6.i686
>>> pacemaker-cluster-libs-1.1.9-1512.el6.i686
>>> pacemaker-cli-1.1.9-1512.el6.i686
>>> pacemaker-1.1.9-1512.el6.i686
>>> 
>>> # crm configure show
>>> node CSE-1
>>> node CSE-2
>>> primitive d_tomcat ocf:ntc:tomcat \
>>>op monitor interval="15s" timeout="510s" on-fail="block" \
>>>op start interval="0" timeout="510s" \
>>>params instance_name="NMS" monitor_use_ssl="no" 
>>> monitor_urls="/cse/health" monitor_timeout="120" \
>>>meta migration-threshold="1"
>>> primitive ip_11 ocf:heartbeat:IPaddr2 \
>>>op monitor interval="10s" \
>>>params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" 
>>> iflabel="ha" \
>>>meta migration-threshold="1" failure-timeout="10"
>>> primitive ip_19 ocf:heartbeat:IPaddr2 \
>>>op monitor interval="10s" \
>>>params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" 
>>> iflabel="ha" \
>>>meta migration-threshold="1" failure-timeout="10"
>>> group svc-cse ip_19 ip_11
>>> clone cl_tomcat d_tomcat
>>> colocation colo_tomcat inf: svc-cse cl_tomcat
>>> order order_tomcat inf: cl_tomcat svc-cse
>>> property $id="cib-bootstrap-options" \
>>>dc-version="1.1.9-1512.el6-2a917dd" \
>>>cluster-infrastructure="cman" \
>>>pe-warn-series-max="9" \
>>>no-quorum-policy="ignore" \
>>>stonith-enabled="false" \
>>>pe-input-series-max="9" \
>>>pe-error-series-max="9" \
>>>last-lrm-refresh="1367582088"
>>> 
>>> Currently only 1 node is available, CSE-1.
>>> 
>>> 
>>> This is how I am currently testing my setup:
>>> 
>>> => Starting point: Everything up and running
>>> 
>>> # crm resource status
>>> Resource Group: svc-cse
>>> ip_19(ocf::heartbeat:IPaddr2):Started
>>> ip_11(ocf::heartbeat:IPaddr2):Started
>>> Clone Set: cl_tomcat [d_tomcat]
>>> Started: [ CSE-1 ]
>>> Stopped: [ d_tomcat:1 ]
>>> 
>>> => Causing failure: Change system so tomcat is running but has a failure 
>>> (in attachment step_2.log)
>>> 
>>> # crm resource status
>>> Resource Group: svc-cse
>>> ip_19(ocf::heartbeat:IPaddr2):Stopped
>>> ip_11(ocf::heartbeat:IPaddr2):Stopped
>>> Clone Set: cl_tomcat [d_tomcat]
>>> d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
>>> Stopped: [ d_tomcat:1 ]
>>> 
>>> => Fixing failure: Revert system so tomcat is running without failure (in 
>>> attachment step_3.log)
>>> 
>>> # crm resource status
>>> Resource Group: svc-cse
>>> ip_19(ocf::heartbeat:IPaddr2):Stopped
>>> ip_11(ocf::heartbeat:IPaddr2):Stopped
>>> Clone Set: cl_tomcat [d_tomcat]
>>> d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
>>> Stopped: [ d_tomcat:1 ]
>>> 
>>> As you can see in the logs the OCF script doesn't return any failure. This 
>>> is noticed by pacemaker,
>>> however it doesn't reflect in crm_mon and it doesn't start the depending 
>>> resources.
>>> 
>>> Gr.
>>> Johan
>>> 
>>> On 2013-05-03 03:04, Andrew Beekhof wrote:
 On 02/05/2013, at 5:45 PM, Johan Huysmans  wrote:
 
> On 2013-05-01 05:48, Andrew Beekhof wrote:
>> On 17/04/2013, at 9:54 PM, Johan Huysmans  
>> wrote:
>> 
>>> Hi All,
>>> 
>>> I'm trying to setup a specific configuration in our cluster, however 
>>> I'm struggling with my configuration.
>>> 
>>> This is what I'm trying to achieve:
>>> On both nodes of the cluster a daemon must be running (tomcat).
>>> Some failover addresses are configured and must be running on the node 
>>> with a correctly running tomcat.
>>> 
>>> I have this achieved with a cloned tomcat resource and an collocation 
>>> between the cloned tomcat and the failover addresses.
>>> When I cause a failure in the tomcat on the node running the failover 
>>> addresses, the failover addresses will failover to the other node as 
>>> expected.
>>> crm_mo

Re: [Pacemaker] failure handling on a cloned resource

2013-05-07 Thread Johan Huysmans

Hi,

I only keep a couple of pe-input file, and that pe-inpurt-1 version was 
already overwritten.

I redid my tests as describe in my previous mails.

At the end of the test it was again written to pe-input1, which is 
included as attachment.


gr.
Johan

On 2013-05-07 04:08, Andrew Beekhof wrote:

I have a much clearer idea of the problem you're seeing now, thankyou.

Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?

On 03/05/2013, at 10:40 PM, Johan Huysmans  wrote:


Hi,

Below you can see my setup and my test, this shows that my cloned resource with 
on-fail=block does not recover automatically.

My Setup:

# rpm -aq | grep -i pacemaker
pacemaker-libs-1.1.9-1512.el6.i686
pacemaker-cluster-libs-1.1.9-1512.el6.i686
pacemaker-cli-1.1.9-1512.el6.i686
pacemaker-1.1.9-1512.el6.i686

# crm configure show
node CSE-1
node CSE-2
primitive d_tomcat ocf:ntc:tomcat \
op monitor interval="15s" timeout="510s" on-fail="block" \
op start interval="0" timeout="510s" \
params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" 
monitor_timeout="120" \
meta migration-threshold="1"
primitive ip_11 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" 
iflabel="ha" \
meta migration-threshold="1" failure-timeout="10"
primitive ip_19 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" 
iflabel="ha" \
meta migration-threshold="1" failure-timeout="10"
group svc-cse ip_19 ip_11
clone cl_tomcat d_tomcat
colocation colo_tomcat inf: svc-cse cl_tomcat
order order_tomcat inf: cl_tomcat svc-cse
property $id="cib-bootstrap-options" \
dc-version="1.1.9-1512.el6-2a917dd" \
cluster-infrastructure="cman" \
pe-warn-series-max="9" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
pe-input-series-max="9" \
pe-error-series-max="9" \
last-lrm-refresh="1367582088"

Currently only 1 node is available, CSE-1.


This is how I am currently testing my setup:

=> Starting point: Everything up and running

# crm resource status
Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Started
 ip_11(ocf::heartbeat:IPaddr2):Started
Clone Set: cl_tomcat [d_tomcat]
 Started: [ CSE-1 ]
 Stopped: [ d_tomcat:1 ]

=> Causing failure: Change system so tomcat is running but has a failure (in 
attachment step_2.log)

# crm resource status
Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Stopped
 ip_11(ocf::heartbeat:IPaddr2):Stopped
Clone Set: cl_tomcat [d_tomcat]
 d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
 Stopped: [ d_tomcat:1 ]

=> Fixing failure: Revert system so tomcat is running without failure (in 
attachment step_3.log)

# crm resource status
Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Stopped
 ip_11(ocf::heartbeat:IPaddr2):Stopped
Clone Set: cl_tomcat [d_tomcat]
 d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
 Stopped: [ d_tomcat:1 ]

As you can see in the logs the OCF script doesn't return any failure. This is 
noticed by pacemaker,
however it doesn't reflect in crm_mon and it doesn't start the depending 
resources.

Gr.
Johan

On 2013-05-03 03:04, Andrew Beekhof wrote:

On 02/05/2013, at 5:45 PM, Johan Huysmans  wrote:


On 2013-05-01 05:48, Andrew Beekhof wrote:

On 17/04/2013, at 9:54 PM, Johan Huysmans  wrote:


Hi All,

I'm trying to setup a specific configuration in our cluster, however I'm 
struggling with my configuration.

This is what I'm trying to achieve:
On both nodes of the cluster a daemon must be running (tomcat).
Some failover addresses are configured and must be running on the node with a 
correctly running tomcat.

I have this achieved with a cloned tomcat resource and an collocation between 
the cloned tomcat and the failover addresses.
When I cause a failure in the tomcat on the node running the failover 
addresses, the failover addresses will failover to the other node as expected.
crm_mon shows that this tomcat has a failure.
When I configure the tomcat resource with failure-timeout=0, the failure alarm 
in crm_mon isn't cleared whenever the tomcat failure is fixed.

All sounds right so far.

If my broken tomcat is automatically fixed, I expect this to be noticed by 
pacemaker and that that node will be able to run my failover addresses,
however I don't see this happening.

This is very hard to discuss without seeing logs.

So you created a tomcat error, waited for pacemaker to notice, fixed the error 
and observed the pacemaker did not re-notice?
How long did you wait? More than the 15s repeat interval I assume?  Did at 
least the resource agent notice?


When I configure the tomcat resource with failure-timeout=30, the failure alarm 
in crm_mon is cleared after 30seconds however the tomcat is still having a 
failure.

Can you define "still

Re: [Pacemaker] failure handling on a cloned resource

2013-05-06 Thread Andrew Beekhof
I have a much clearer idea of the problem you're seeing now, thankyou.

Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?

On 03/05/2013, at 10:40 PM, Johan Huysmans  wrote:

> Hi,
> 
> Below you can see my setup and my test, this shows that my cloned resource 
> with on-fail=block does not recover automatically.
> 
> My Setup:
> 
> # rpm -aq | grep -i pacemaker
> pacemaker-libs-1.1.9-1512.el6.i686
> pacemaker-cluster-libs-1.1.9-1512.el6.i686
> pacemaker-cli-1.1.9-1512.el6.i686
> pacemaker-1.1.9-1512.el6.i686
> 
> # crm configure show
> node CSE-1
> node CSE-2
> primitive d_tomcat ocf:ntc:tomcat \
>op monitor interval="15s" timeout="510s" on-fail="block" \
>op start interval="0" timeout="510s" \
>params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" 
> monitor_timeout="120" \
>meta migration-threshold="1"
> primitive ip_11 ocf:heartbeat:IPaddr2 \
>op monitor interval="10s" \
>params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" 
> iflabel="ha" \
>meta migration-threshold="1" failure-timeout="10"
> primitive ip_19 ocf:heartbeat:IPaddr2 \
>op monitor interval="10s" \
>params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" 
> iflabel="ha" \
>meta migration-threshold="1" failure-timeout="10"
> group svc-cse ip_19 ip_11
> clone cl_tomcat d_tomcat
> colocation colo_tomcat inf: svc-cse cl_tomcat
> order order_tomcat inf: cl_tomcat svc-cse
> property $id="cib-bootstrap-options" \
>dc-version="1.1.9-1512.el6-2a917dd" \
>cluster-infrastructure="cman" \
>pe-warn-series-max="9" \
>no-quorum-policy="ignore" \
>stonith-enabled="false" \
>pe-input-series-max="9" \
>pe-error-series-max="9" \
>last-lrm-refresh="1367582088"
> 
> Currently only 1 node is available, CSE-1.
> 
> 
> This is how I am currently testing my setup:
> 
> => Starting point: Everything up and running
> 
> # crm resource status
> Resource Group: svc-cse
> ip_19(ocf::heartbeat:IPaddr2):Started
> ip_11(ocf::heartbeat:IPaddr2):Started
> Clone Set: cl_tomcat [d_tomcat]
> Started: [ CSE-1 ]
> Stopped: [ d_tomcat:1 ]
> 
> => Causing failure: Change system so tomcat is running but has a failure (in 
> attachment step_2.log)
> 
> # crm resource status
> Resource Group: svc-cse
> ip_19(ocf::heartbeat:IPaddr2):Stopped
> ip_11(ocf::heartbeat:IPaddr2):Stopped
> Clone Set: cl_tomcat [d_tomcat]
> d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
> Stopped: [ d_tomcat:1 ]
> 
> => Fixing failure: Revert system so tomcat is running without failure (in 
> attachment step_3.log)
> 
> # crm resource status
> Resource Group: svc-cse
> ip_19(ocf::heartbeat:IPaddr2):Stopped
> ip_11(ocf::heartbeat:IPaddr2):Stopped
> Clone Set: cl_tomcat [d_tomcat]
> d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
> Stopped: [ d_tomcat:1 ]
> 
> As you can see in the logs the OCF script doesn't return any failure. This is 
> noticed by pacemaker,
> however it doesn't reflect in crm_mon and it doesn't start the depending 
> resources.
> 
> Gr.
> Johan
> 
> On 2013-05-03 03:04, Andrew Beekhof wrote:
>> On 02/05/2013, at 5:45 PM, Johan Huysmans  wrote:
>> 
>>> On 2013-05-01 05:48, Andrew Beekhof wrote:
 On 17/04/2013, at 9:54 PM, Johan Huysmans  wrote:
 
> Hi All,
> 
> I'm trying to setup a specific configuration in our cluster, however I'm 
> struggling with my configuration.
> 
> This is what I'm trying to achieve:
> On both nodes of the cluster a daemon must be running (tomcat).
> Some failover addresses are configured and must be running on the node 
> with a correctly running tomcat.
> 
> I have this achieved with a cloned tomcat resource and an collocation 
> between the cloned tomcat and the failover addresses.
> When I cause a failure in the tomcat on the node running the failover 
> addresses, the failover addresses will failover to the other node as 
> expected.
> crm_mon shows that this tomcat has a failure.
> When I configure the tomcat resource with failure-timeout=0, the failure 
> alarm in crm_mon isn't cleared whenever the tomcat failure is fixed.
 All sounds right so far.
>>> If my broken tomcat is automatically fixed, I expect this to be noticed by 
>>> pacemaker and that that node will be able to run my failover addresses,
>>> however I don't see this happening.
>> This is very hard to discuss without seeing logs.
>> 
>> So you created a tomcat error, waited for pacemaker to notice, fixed the 
>> error and observed the pacemaker did not re-notice?
>> How long did you wait? More than the 15s repeat interval I assume?  Did at 
>> least the resource agent notice?
>> 
> When I configure the tomcat resource with failure-timeout=30, the failure 
> alarm in crm_mon is cleared after 30seconds however the tomcat is still 
> having a fai

Re: [Pacemaker] failure handling on a cloned resource

2013-05-03 Thread Johan Huysmans

Hi,

Below you can see my setup and my test, this shows that my cloned 
resource with on-fail=block does not recover automatically.


My Setup:

# rpm -aq | grep -i pacemaker
pacemaker-libs-1.1.9-1512.el6.i686
pacemaker-cluster-libs-1.1.9-1512.el6.i686
pacemaker-cli-1.1.9-1512.el6.i686
pacemaker-1.1.9-1512.el6.i686

# crm configure show
node CSE-1
node CSE-2
primitive d_tomcat ocf:ntc:tomcat \
op monitor interval="15s" timeout="510s" on-fail="block" \
op start interval="0" timeout="510s" \
params instance_name="NMS" monitor_use_ssl="no" 
monitor_urls="/cse/health" monitor_timeout="120" \

meta migration-threshold="1"
primitive ip_11 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" 
iflabel="ha" \

meta migration-threshold="1" failure-timeout="10"
primitive ip_19 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" 
iflabel="ha" \

meta migration-threshold="1" failure-timeout="10"
group svc-cse ip_19 ip_11
clone cl_tomcat d_tomcat
colocation colo_tomcat inf: svc-cse cl_tomcat
order order_tomcat inf: cl_tomcat svc-cse
property $id="cib-bootstrap-options" \
dc-version="1.1.9-1512.el6-2a917dd" \
cluster-infrastructure="cman" \
pe-warn-series-max="9" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
pe-input-series-max="9" \
pe-error-series-max="9" \
last-lrm-refresh="1367582088"

Currently only 1 node is available, CSE-1.


This is how I am currently testing my setup:

=> Starting point: Everything up and running

# crm resource status
 Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Started
 ip_11(ocf::heartbeat:IPaddr2):Started
 Clone Set: cl_tomcat [d_tomcat]
 Started: [ CSE-1 ]
 Stopped: [ d_tomcat:1 ]

=> Causing failure: Change system so tomcat is running but has a failure 
(in attachment step_2.log)


# crm resource status
 Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Stopped
 ip_11(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: cl_tomcat [d_tomcat]
 d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
 Stopped: [ d_tomcat:1 ]

=> Fixing failure: Revert system so tomcat is running without failure 
(in attachment step_3.log)


# crm resource status
 Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Stopped
 ip_11(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: cl_tomcat [d_tomcat]
 d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
 Stopped: [ d_tomcat:1 ]

As you can see in the logs the OCF script doesn't return any failure. 
This is noticed by pacemaker,
however it doesn't reflect in crm_mon and it doesn't start the depending 
resources.


Gr.
Johan

On 2013-05-03 03:04, Andrew Beekhof wrote:

On 02/05/2013, at 5:45 PM, Johan Huysmans  wrote:


On 2013-05-01 05:48, Andrew Beekhof wrote:

On 17/04/2013, at 9:54 PM, Johan Huysmans  wrote:


Hi All,

I'm trying to setup a specific configuration in our cluster, however I'm 
struggling with my configuration.

This is what I'm trying to achieve:
On both nodes of the cluster a daemon must be running (tomcat).
Some failover addresses are configured and must be running on the node with a 
correctly running tomcat.

I have this achieved with a cloned tomcat resource and an collocation between 
the cloned tomcat and the failover addresses.
When I cause a failure in the tomcat on the node running the failover 
addresses, the failover addresses will failover to the other node as expected.
crm_mon shows that this tomcat has a failure.
When I configure the tomcat resource with failure-timeout=0, the failure alarm 
in crm_mon isn't cleared whenever the tomcat failure is fixed.

All sounds right so far.

If my broken tomcat is automatically fixed, I expect this to be noticed by 
pacemaker and that that node will be able to run my failover addresses,
however I don't see this happening.

This is very hard to discuss without seeing logs.

So you created a tomcat error, waited for pacemaker to notice, fixed the error 
and observed the pacemaker did not re-notice?
How long did you wait? More than the 15s repeat interval I assume?  Did at 
least the resource agent notice?


When I configure the tomcat resource with failure-timeout=30, the failure alarm 
in crm_mon is cleared after 30seconds however the tomcat is still having a 
failure.

Can you define "still having a failure"?
You mean it still shows up in crm_mon?
Have you read this link?

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html

"Still having a failure" means that the tomcat is still broken and my OCF 
script reports it as a failure.

What I expect is that pacemaker reports the failure as the failure exists and 
as long as it exists and that pacemaker reports that everything is ok once 
everything is back ok.

Do I do so

Re: [Pacemaker] failure handling on a cloned resource

2013-05-02 Thread Andrew Beekhof

On 02/05/2013, at 5:45 PM, Johan Huysmans  wrote:

> 
> On 2013-05-01 05:48, Andrew Beekhof wrote:
>> On 17/04/2013, at 9:54 PM, Johan Huysmans  wrote:
>> 
>>> Hi All,
>>> 
>>> I'm trying to setup a specific configuration in our cluster, however I'm 
>>> struggling with my configuration.
>>> 
>>> This is what I'm trying to achieve:
>>> On both nodes of the cluster a daemon must be running (tomcat).
>>> Some failover addresses are configured and must be running on the node with 
>>> a correctly running tomcat.
>>> 
>>> I have this achieved with a cloned tomcat resource and an collocation 
>>> between the cloned tomcat and the failover addresses.
>>> When I cause a failure in the tomcat on the node running the failover 
>>> addresses, the failover addresses will failover to the other node as 
>>> expected.
>>> crm_mon shows that this tomcat has a failure.
>>> When I configure the tomcat resource with failure-timeout=0, the failure 
>>> alarm in crm_mon isn't cleared whenever the tomcat failure is fixed.
>> All sounds right so far.
> If my broken tomcat is automatically fixed, I expect this to be noticed by 
> pacemaker and that that node will be able to run my failover addresses,
> however I don't see this happening.

This is very hard to discuss without seeing logs.

So you created a tomcat error, waited for pacemaker to notice, fixed the error 
and observed the pacemaker did not re-notice?
How long did you wait? More than the 15s repeat interval I assume?  Did at 
least the resource agent notice?

>> 
>>> When I configure the tomcat resource with failure-timeout=30, the failure 
>>> alarm in crm_mon is cleared after 30seconds however the tomcat is still 
>>> having a failure.
>> Can you define "still having a failure"?
>> You mean it still shows up in crm_mon?
>> Have you read this link?
>>
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html
> "Still having a failure" means that the tomcat is still broken and my OCF 
> script reports it as a failure.
>> 
>>> What I expect is that pacemaker reports the failure as the failure exists 
>>> and as long as it exists and that pacemaker reports that everything is ok 
>>> once everything is back ok.
>>> 
>>> Do I do something wrong with my configuration?
>>> Or how can I achieve my wanted setup?
>>> 
>>> Here is my configuration:
>>> 
>>> node CSE-1
>>> node CSE-2
>>> primitive d_tomcat ocf:custom:tomcat \
>>>op monitor interval="15s" timeout="510s" on-fail="block" \
>>>op start interval="0" timeout="510s" \
>>>params instance_name="NMS" monitor_use_ssl="no" 
>>> monitor_urls="/cse/health" monitor_timeout="120" \
>>>meta migration-threshold="1" failure-timeout="0"
>>> primitive ip_1 ocf:heartbeat:IPaddr2 \
>>>op monitor interval="10s" \
>>>params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1"
>>> primitive ip_2 ocf:heartbeat:IPaddr2 \
>>>op monitor interval="10s" \
>>>params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2"
>>> group svc-cse ip_1 ip_2
>>> clone cl_tomcat d_tomcat
>>> colocation colo_tomcat inf: svc-cse cl_tomcat
>>> order order_tomcat inf: cl_tomcat svc-cse
>>> property $id="cib-bootstrap-options" \
>>>dc-version="1.1.8-7.el6-394e906" \
>>>cluster-infrastructure="cman" \
>>>no-quorum-policy="ignore" \
>>>stonith-enabled="false"
>>> 
>>> Thanks!
>>> 
>>> Greetings,
>>> Johan Huysmans
>>> 
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] failure handling on a cloned resource

2013-05-02 Thread Johan Huysmans


On 2013-05-01 05:48, Andrew Beekhof wrote:

On 17/04/2013, at 9:54 PM, Johan Huysmans  wrote:


Hi All,

I'm trying to setup a specific configuration in our cluster, however I'm 
struggling with my configuration.

This is what I'm trying to achieve:
On both nodes of the cluster a daemon must be running (tomcat).
Some failover addresses are configured and must be running on the node with a 
correctly running tomcat.

I have this achieved with a cloned tomcat resource and an collocation between 
the cloned tomcat and the failover addresses.
When I cause a failure in the tomcat on the node running the failover 
addresses, the failover addresses will failover to the other node as expected.
crm_mon shows that this tomcat has a failure.
When I configure the tomcat resource with failure-timeout=0, the failure alarm 
in crm_mon isn't cleared whenever the tomcat failure is fixed.

All sounds right so far.
If my broken tomcat is automatically fixed, I expect this to be noticed 
by pacemaker and that that node will be able to run my failover addresses,

however I don't see this happening.



When I configure the tomcat resource with failure-timeout=30, the failure alarm 
in crm_mon is cleared after 30seconds however the tomcat is still having a 
failure.

Can you define "still having a failure"?
You mean it still shows up in crm_mon?
Have you read this link?

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html
"Still having a failure" means that the tomcat is still broken and my 
OCF script reports it as a failure.



What I expect is that pacemaker reports the failure as the failure exists and 
as long as it exists and that pacemaker reports that everything is ok once 
everything is back ok.

Do I do something wrong with my configuration?
Or how can I achieve my wanted setup?

Here is my configuration:

node CSE-1
node CSE-2
primitive d_tomcat ocf:custom:tomcat \
op monitor interval="15s" timeout="510s" on-fail="block" \
op start interval="0" timeout="510s" \
params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" 
monitor_timeout="120" \
meta migration-threshold="1" failure-timeout="0"
primitive ip_1 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1"
primitive ip_2 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2"
group svc-cse ip_1 ip_2
clone cl_tomcat d_tomcat
colocation colo_tomcat inf: svc-cse cl_tomcat
order order_tomcat inf: cl_tomcat svc-cse
property $id="cib-bootstrap-options" \
dc-version="1.1.8-7.el6-394e906" \
cluster-infrastructure="cman" \
no-quorum-policy="ignore" \
stonith-enabled="false"

Thanks!

Greetings,
Johan Huysmans

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] failure handling on a cloned resource

2013-04-30 Thread Andrew Beekhof

On 17/04/2013, at 9:54 PM, Johan Huysmans  wrote:

> Hi All,
> 
> I'm trying to setup a specific configuration in our cluster, however I'm 
> struggling with my configuration.
> 
> This is what I'm trying to achieve:
> On both nodes of the cluster a daemon must be running (tomcat).
> Some failover addresses are configured and must be running on the node with a 
> correctly running tomcat.
> 
> I have this achieved with a cloned tomcat resource and an collocation between 
> the cloned tomcat and the failover addresses.
> When I cause a failure in the tomcat on the node running the failover 
> addresses, the failover addresses will failover to the other node as expected.
> crm_mon shows that this tomcat has a failure.
> When I configure the tomcat resource with failure-timeout=0, the failure 
> alarm in crm_mon isn't cleared whenever the tomcat failure is fixed.

All sounds right so far.

> When I configure the tomcat resource with failure-timeout=30, the failure 
> alarm in crm_mon is cleared after 30seconds however the tomcat is still 
> having a failure.

Can you define "still having a failure"?
You mean it still shows up in crm_mon?
Have you read this link?
   
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html

> 
> What I expect is that pacemaker reports the failure as the failure exists and 
> as long as it exists and that pacemaker reports that everything is ok once 
> everything is back ok.
> 
> Do I do something wrong with my configuration?
> Or how can I achieve my wanted setup?
> 
> Here is my configuration:
> 
> node CSE-1
> node CSE-2
> primitive d_tomcat ocf:custom:tomcat \
>op monitor interval="15s" timeout="510s" on-fail="block" \
>op start interval="0" timeout="510s" \
>params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" 
> monitor_timeout="120" \
>meta migration-threshold="1" failure-timeout="0"
> primitive ip_1 ocf:heartbeat:IPaddr2 \
>op monitor interval="10s" \
>params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1"
> primitive ip_2 ocf:heartbeat:IPaddr2 \
>op monitor interval="10s" \
>params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2"
> group svc-cse ip_1 ip_2
> clone cl_tomcat d_tomcat
> colocation colo_tomcat inf: svc-cse cl_tomcat
> order order_tomcat inf: cl_tomcat svc-cse
> property $id="cib-bootstrap-options" \
>dc-version="1.1.8-7.el6-394e906" \
>cluster-infrastructure="cman" \
>no-quorum-policy="ignore" \
>stonith-enabled="false"
> 
> Thanks!
> 
> Greetings,
> Johan Huysmans
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] failure handling on a cloned resource

2013-04-22 Thread Johan Huysmans

Hi All,

I've created a bug for this issue i'm having
http://bugs.clusterlabs.org/show_bug.cgi?id=5154

I think this is a bug due to the fact that it worked on older releases.

Can someone verify it really is a bug, or just a configuration mistake.

Thanks!

Greetings,
Johan Huysmans


On 17-04-13 13:54, Johan Huysmans wrote:

Hi All,

I'm trying to setup a specific configuration in our cluster, however 
I'm struggling with my configuration.


This is what I'm trying to achieve:
On both nodes of the cluster a daemon must be running (tomcat).
Some failover addresses are configured and must be running on the node 
with a correctly running tomcat.


I have this achieved with a cloned tomcat resource and an collocation 
between the cloned tomcat and the failover addresses.
When I cause a failure in the tomcat on the node running the failover 
addresses, the failover addresses will failover to the other node as 
expected.

crm_mon shows that this tomcat has a failure.
When I configure the tomcat resource with failure-timeout=0, the 
failure alarm in crm_mon isn't cleared whenever the tomcat failure is 
fixed.
When I configure the tomcat resource with failure-timeout=30, the 
failure alarm in crm_mon is cleared after 30seconds however the tomcat 
is still having a failure.


What I expect is that pacemaker reports the failure as the failure 
exists and as long as it exists and that pacemaker reports that 
everything is ok once everything is back ok.


Do I do something wrong with my configuration?
Or how can I achieve my wanted setup?

Here is my configuration:

node CSE-1
node CSE-2
primitive d_tomcat ocf:custom:tomcat \
op monitor interval="15s" timeout="510s" on-fail="block" \
op start interval="0" timeout="510s" \
params instance_name="NMS" monitor_use_ssl="no" 
monitor_urls="/cse/health" monitor_timeout="120" \

meta migration-threshold="1" failure-timeout="0"
primitive ip_1 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1"
primitive ip_2 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2"
group svc-cse ip_1 ip_2
clone cl_tomcat d_tomcat
colocation colo_tomcat inf: svc-cse cl_tomcat
order order_tomcat inf: cl_tomcat svc-cse
property $id="cib-bootstrap-options" \
dc-version="1.1.8-7.el6-394e906" \
cluster-infrastructure="cman" \
no-quorum-policy="ignore" \
stonith-enabled="false"

Thanks!

Greetings,
Johan Huysmans

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] failure handling on a cloned resource

2013-04-17 Thread Johan Huysmans

Hi All,

I'm trying to setup a specific configuration in our cluster, however I'm 
struggling with my configuration.


This is what I'm trying to achieve:
On both nodes of the cluster a daemon must be running (tomcat).
Some failover addresses are configured and must be running on the node 
with a correctly running tomcat.


I have this achieved with a cloned tomcat resource and an collocation 
between the cloned tomcat and the failover addresses.
When I cause a failure in the tomcat on the node running the failover 
addresses, the failover addresses will failover to the other node as 
expected.

crm_mon shows that this tomcat has a failure.
When I configure the tomcat resource with failure-timeout=0, the failure 
alarm in crm_mon isn't cleared whenever the tomcat failure is fixed.
When I configure the tomcat resource with failure-timeout=30, the 
failure alarm in crm_mon is cleared after 30seconds however the tomcat 
is still having a failure.


What I expect is that pacemaker reports the failure as the failure 
exists and as long as it exists and that pacemaker reports that 
everything is ok once everything is back ok.


Do I do something wrong with my configuration?
Or how can I achieve my wanted setup?

Here is my configuration:

node CSE-1
node CSE-2
primitive d_tomcat ocf:custom:tomcat \
op monitor interval="15s" timeout="510s" on-fail="block" \
op start interval="0" timeout="510s" \
params instance_name="NMS" monitor_use_ssl="no" 
monitor_urls="/cse/health" monitor_timeout="120" \

meta migration-threshold="1" failure-timeout="0"
primitive ip_1 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1"
primitive ip_2 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2"
group svc-cse ip_1 ip_2
clone cl_tomcat d_tomcat
colocation colo_tomcat inf: svc-cse cl_tomcat
order order_tomcat inf: cl_tomcat svc-cse
property $id="cib-bootstrap-options" \
dc-version="1.1.8-7.el6-394e906" \
cluster-infrastructure="cman" \
no-quorum-policy="ignore" \
stonith-enabled="false"

Thanks!

Greetings,
Johan Huysmans

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org