Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Dan Frincu

Hi,

I've noticed the same type of behavior, however in a different context, 
my setup includes 3 drbd devices and a group of resources, all have to 
run on the same node and move together to other nodes. My issue was with 
the first resource that required access to a drbd device, which was the 
ocf:heartbeat:Filesystem RA trying to do a mount and failing.


The reason, it was trying to do the mount of the drbd device before the 
drbd device had finished migrating to primary state. Same as you, I 
introduced a start-delay, but on the start action. This proved to be of 
no use as the behavior persisted, even with an increased start-delay. 
However, it only happened when performing a fail-back operation, during 
fail-over, everything was ok, during fail-back, error.


The fix I've made was to remove any start-delay and to add group 
collocation constraints to all ms_drbd resources. Before that I only had 
one collocation constraint for the drbd device being promoted last.


I hope this helps.

Regards,

Dan

Pavlos Parissis wrote:

Hi,

I noticed a race condition while I was integration an application with
Pacemaker and thought to share with you.

The init script of the application is LSB-compliant and passes the
tests mentioned at the Pacemaker documentation. Moreover, the init
script
uses the supplied functions from the system[1] for starting,stopping
and checking the application.

I observed few times that the monitor action was failing after the
startup of the cluster or the movement of the resource group.
Because it was not happening always and manual start/status was always
working, it was quite tricky and difficult to find out the root cause
of the failure.
After few hours of troubleshooting, I found out that the 1st monitor
action after the start action, was executed too fast for the
application to create the pid file. As result monitor action was
receiving error.

I know it sounds a bit strange but it happened on my systems. The fact
that my systems are basically vmware images on a laptop could have a
relation with the issue.

Nevertheless, I would like to ask if you are thinking to implement an
init_wait on 1st monitor action. Could be useful.

To solve my issue I put a sleep after the start of the application in
the init script. This gives enough time for the application to create
its pid file and the 1st monitor doesn't fail.


Cheers,
Pavlos


[1] Cent0S 5.4

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Pavlos Parissis
On 13 October 2010 09:48, Dan Frincu dfri...@streamwide.ro wrote:
 Hi,

 I've noticed the same type of behavior, however in a different context, my
 setup includes 3 drbd devices and a group of resources, all have to run on
 the same node and move together to other nodes. My issue was with the first
 resource that required access to a drbd device, which was the
 ocf:heartbeat:Filesystem RA trying to do a mount and failing.

 The reason, it was trying to do the mount of the drbd device before the drbd
 device had finished migrating to primary state. Same as you, I introduced a
 start-delay, but on the start action. This proved to be of no use as the
 behavior persisted, even with an increased start-delay. However, it only
 happened when performing a fail-back operation, during fail-over, everything
 was ok, during fail-back, error.

 The fix I've made was to remove any start-delay and to add group collocation
 constraints to all ms_drbd resources. Before that I only had one collocation
 constraint for the drbd device being promoted last.

 I hope this helps.


I am glad that somebody else experienced the same issue:)

On my mail I was talking about the monitor action which was failing,
but the behavior you described happened on my system on the same
setup, drbd and fs resource.It also happened on the application
resource, the start was too fast and the FS was not mounted (yet) when
the action start fired for the application resource. A delay on start
function of the resource agent of the application fixed my issue.

In my setup I have all the necessary constraints to avoid this, at
least this is what I believe so:-)

Cheers,
Pavlos


[r...@node-01 sysconfig]# crm configure show
node $id=059313ce-c6aa-4bd5-a4fb-4b781de6d98f node-03
node $id=d791b1f5-9522-4c84-a66f-cd3d4e476b38 node-02
node $id=e388e797-21f4-4bbe-a588-93d12964b4d7 node-01
primitive drbd_01 ocf:linbit:drbd \
params drbd_resource=drbd_pbx_service_1 \
op monitor interval=30s \
op start interval=0 timeout=240s \
op stop interval=0 timeout=120s
primitive drbd_02 ocf:linbit:drbd \
params drbd_resource=drbd_pbx_service_2 \
op monitor interval=30s \
op start interval=0 timeout=240s \
op stop interval=0 timeout=120s
primitive fs_01 ocf:heartbeat:Filesystem \
params device=/dev/drbd1 directory=/pbx_service_01 fstype=ext3 \
meta migration-threshold=3 failure-timeout=60 \
op monitor interval=20s timeout=40s OCF_CHECK_LEVEL=20 \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s
primitive fs_02 ocf:heartbeat:Filesystem \
params device=/dev/drbd2 directory=/pbx_service_02 fstype=ext3 \
meta migration-threshold=3 failure-timeout=60 \
op monitor interval=20s timeout=40s OCF_CHECK_LEVEL=20 \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s
primitive ip_01 ocf:heartbeat:IPaddr2 \
params ip=192.168.78.10 cidr_netmask=24 broadcast=192.168.78.255 \
meta failure-timeout=120 migration-threshold=3 \
op monitor interval=5s
primitive ip_02 ocf:heartbeat:IPaddr2 \
meta failure-timeout=120 migration-threshold=3 \
params ip=192.168.78.20 cidr_netmask=24 broadcast=192.168.78.255 \
op monitor interval=5s
primitive pbx_01 lsb:znd-pbx_01 \
meta migration-threshold=3 failure-timeout=60
target-role=Started \
op monitor interval=20s timeout=20s \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s
primitive pbx_02 lsb:znd-pbx_02 \
meta migration-threshold=3 failure-timeout=60 \
op monitor interval=20s timeout=20s \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s
primitive sshd_01 lsb:znd-sshd-pbx_01 \
meta target-role=Started is-managed=true \
op monitor on-fail=stop interval=10m \
op start interval=0 timeout=60s on-fail=stop \
op stop interval=0 timeout=60s on-fail=stop
primitive sshd_02 lsb:znd-sshd-pbx_02 \
meta target-role=Started \
op monitor on-fail=stop interval=10m \
op start interval=0 timeout=60s on-fail=stop \
op stop interval=0 timeout=60s on-fail=stop
group pbx_service_01 ip_01 fs_01 pbx_01 sshd_01 \
meta target-role=Started
group pbx_service_02 ip_02 fs_02 pbx_02 sshd_02
ms ms-drbd_01 drbd_01 \
meta master-max=1 master-node-max=1 clone-max=2
clone-node-max=1 notify=true target-role=Started
ms ms-drbd_02 drbd_02 \
meta master-max=1 master-node-max=1 clone-max=2
clone-node-max=1 notify=true target-role=Started
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02
location SecondaryNode-drbd_01 ms-drbd_01 0: node-03
location SecondaryNode-drbd_02 ms-drbd_02 0: node-03
location 

Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Pavlos Parissis
On 13 October 2010 10:50, Dan Frincu dfri...@streamwide.ro wrote:
 From what I see you have a dual primary setup with failover on the third
 node, basically if you have one drbd resource for which you have both
 ordering and collocation, I don't think you need to improve it, if it
 ain't broke, don't fix it :)

 Regards,


No, I don't have Dual primary. My DRBD is in Single-Primary mode for
both DRBD resources.
I use N+1  setup. I have 2 resource group which have unique Primary
and shared secondary.
pbx_service_01 resource group has primary node-01 and secondary node-03
pbx_service_02 resource group has primary node-02 and secondary node-03

I use asymmetric cluster with specific location constraints in order
to implement the above.
The DRBD resource will never be in primary mode on 2 nodes at the same time.
I have set specific collocation and order constraints in order to
bond  DRBD ms resource to the appropriate resource group.

I hope it is clear now.

Cheers and thanks for looking at my conf,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Dan Frincu

Pavlos Parissis wrote:

On 13 October 2010 10:50, Dan Frincu dfri...@streamwide.ro wrote:
  

From what I see you have a dual primary setup with failover on the third
node, basically if you have one drbd resource for which you have both
ordering and collocation, I don't think you need to improve it, if it
ain't broke, don't fix it :)

Regards,




No, I don't have Dual primary. My DRBD is in Single-Primary mode for
both DRBD resources.
I use N+1  setup. I have 2 resource group which have unique Primary
and shared secondary.
pbx_service_01 resource group has primary node-01 and secondary node-03
pbx_service_02 resource group has primary node-02 and secondary node-03

I use asymmetric cluster with specific location constraints in order
to implement the above.
The DRBD resource will never be in primary mode on 2 nodes at the same time.
I have set specific collocation and order constraints in order to
bond  DRBD ms resource to the appropriate resource group.

I hope it is clear now.

Cheers and thanks for looking at my conf,
Pavlos

  
True, my bad, Dual-Primary does not apply to your setup, I formulated it 
wrong, I meant what you said :)


Regards,

Dan

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker