date:20101013

Re: [Pacemaker] problem about move node from one cluster to another

2010-10-13 Thread jiaju liu

Hi

Thank you for your help. I want to upgrade my openais. Do I need to restall 
linux and download openais of the latest version? or any other simple 
way?Thanks:-). 

Hi,

Depending on the openais version (please mention it) 
?
Hi Thank you for your reply my openais version is openais-0.80.5-15.1
pacemaker version is pacemaker-1.0.5-4.1.
I use restart but it does not work.?I found?it could not stop
?
?
this behavior could 
happen, I've seen it as well, on openais-0.8.0. What I've done to fix it 
was to restart the openais process via /etc/init.d/openais restart. And 
then it worked, however, this was one of the reasons I updated the 
packages to the latest versions of corosync, pacemaker, etc. The tricky 
part was doing the migration procedure for upgrading production servers 
without service downtime, but that's another story.

Regards,

Dan

jiaju liu wrote:
>
>
>
>?

-- next part --
An HTML attachment was scrubbed...
URL: 
<http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101013/2f55034d/attachment-0001.htm>

--

Message: 4
Date: Wed, 13 Oct 2010 14:37:51 +0300
From: Dan Frincu 
To: The Pacemaker cluster resource manager

Subject: Re: [Pacemaker] problem about move node from one cluster to
    another cluster
Message-ID: <4cb59a0f.80...@streamwide.ro>
Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"

Hi,

Yes, it sometime needs to be killed manually because the process hangs 
and the restart operation never seems to end. Yet another reason to upgrade.

All,

Question: given the fact that this type of software usually gets 
installed on a platform once and then usually goes into service for many 
years, on servers where downtime should be kept to a minimum (gee, 
that's why you use a cluster :)), how does this fit the release schedule?

I mean, there are plenty of users out there with question related to 
Heartbeat 2, openais-0.8.0, and so on and so forth, some environments 
cannot be changed lightly, others, not at all, so what is the response 
to "this feature doesn't work on that version of software?", upgrade? If 
so, at what interval (keeping in mind that you probably want the stable 
packages on your system)?

I'm asking this because when I started working with openais, the latest 
version available was 0.8.0 on some SUSE repos that aren't available 
anymore.

Regards,

Dan

jiaju liu wrote:
> Hi,
>
> Depending on the openais version (please mention it)
>  
> Hi
> Thank you for your reply my openais version is openais-0.80.5-15.1
> pacemaker version is pacemaker-1.0.5-4.1.
> I use restart but it does not work. I found it could not stop
>  
>  
> this behavior could
> happen, I've seen it as well, on openais-0.8.0. What I've done to fix it
> was to restart the openais process via /etc/init.d/openais restart. And
> then it worked, however, this was one of the reasons I updated the
> packages to the latest versions of corosync, pacemaker, etc. The tricky
> part was doing the migration procedure for upgrading production servers
> without service downtime, but that's another story.
>
>
>   

-- 
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

-- next part --
An HTML attachment was scrubbed...
URL: 
<http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101013/9573d51d/attachment-0001.htm>

--

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

End of Pacemaker Digest, Vol 35, Issue 46
*

  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] stand_alone_ping stop Node start

2010-10-13 Thread Andrew Beekhof

Now you're just being annoying.
This is the third copy of this email I've received.

On principle I am now not going to give you an answer.

On Thu, Oct 14, 2010 at 3:47 AM, jiaju liu  wrote:

> Hi
> I reboot my node, and it appears
> node2 pingd: [3932]: info: stand_alone_ping: Node 192.168.10.100 is
> unreachable (read)
> and the node could not start
>
>  192.168.10.100  is ib network I will start ib after the node start, so do
> you have any idea let the node start first?Thanks very much.:-)
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

[Pacemaker] stand_alone_ping stop Node start

2010-10-13 Thread jiaju liu

Hi 
I reboot my node, and it appears
node2 pingd: [3932]: info: stand_alone_ping: Node 192.168.10.100 is unreachable 
(read)
and the node could not start
 
 192.168.10.100  is ib network I will start ib after the node start, so do you 
have any idea let the node start first?Thanks very much.:-) 
 


  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

[Pacemaker] stand_alone_ping stop node starting

2010-10-13 Thread jiaju liu



Hi 
I reboot my node, and it appears
node2 pingd: [3932]: info: stand_alone_ping: Node 192.168.10.100 is unreachable 
(read)
and the node could not start
 
 192.168.10.100  is ib network I will start ib after the node start, so do you 
have any idea let the node start first?Thanks very much.:-) 
 


  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] crm resource move doesn't move the resource

2010-10-13 Thread Pavlos Parissis

On 11 October 2010 11:16, Pavlos Parissis  wrote:
> On 8 October 2010 09:29, Andrew Beekhof  wrote:
>> On Fri, Oct 8, 2010 at 8:34 AM, Pavlos Parissis
>>  wrote:
>>> On 8 October 2010 08:29, Andrew Beekhof  wrote:
 On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis
  wrote:
>
>
> On 7 October 2010 09:01, Andrew Beekhof  wrote:
>>
>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis
>>  wrote:
>> > Hi,
>> >
>> > I am having again the same issue, in a different set of 3 nodes. When I
>> > try
>> > to failover manually the resource group on the standby node, the 
>> > ms-drbd
>> > resource is not moved as well and as a result the resource group is not
>> > fully started, only the ip resource is started.
>> > Any ideas why I am having this issue?
>>
>> I think its a bug that was fixed recently.  Could you try the latest
>> from code Mercurial?
>
> 1.1 or 1.2 branch?

 1.1

>>> to save time on compiling stuff I want to use the available rpms on
>>> 1.1.3 version from rpm-next repo.
>>> But before I go and recreate the scenario, which means rebuild 3
>>> nodes, I would like to know if this bug is fixed in 1.1.3
>>
>> As I said, I believe so.
>
> I recreated the 3 node cluster and I didn't face that issue, but I am
> going to keep an eye on it for few days and even rerun the whole
> scenario (recreate 3 node cluster ...) just to be very sure. If I
> don't the see it again I will also close the bug report
>
> Thanks,
> Pavlos
>


I recreated the 3-node cluster using 1.1.3 version just see if it is
solved, but the issue appeared again.
So, Andrew the issue is not solved in 1.1.3. I am going to update the
bug report accordingly.

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] problem about move node from one cluster to another cluster

2010-10-13 Thread Dan Frincu


Hi,

Yes, it sometime needs to be killed manually because the process hangs 
and the restart operation never seems to end. Yet another reason to upgrade.


All,

Question: given the fact that this type of software usually gets 
installed on a platform once and then usually goes into service for many 
years, on servers where downtime should be kept to a minimum (gee, 
that's why you use a cluster :)), how does this fit the release schedule?


I mean, there are plenty of users out there with question related to 
Heartbeat 2, openais-0.8.0, and so on and so forth, some environments 
cannot be changed lightly, others, not at all, so what is the response 
to "this feature doesn't work on that version of software?", upgrade? If 
so, at what interval (keeping in mind that you probably want the stable 
packages on your system)?


I'm asking this because when I started working with openais, the latest 
version available was 0.8.0 on some SUSE repos that aren't available 
anymore.


Regards,

Dan

jiaju liu wrote:

Hi,

Depending on the openais version (please mention it)
 
Hi

Thank you for your reply my openais version is openais-0.80.5-15.1
pacemaker version is pacemaker-1.0.5-4.1.
I use restart but it does not work. I found it could not stop
 
 
this behavior could

happen, I've seen it as well, on openais-0.8.0. What I've done to fix it
was to restart the openais process via /etc/init.d/openais restart. And
then it worked, however, this was one of the reasons I updated the
packages to the latest versions of corosync, pacemaker, etc. The tricky
part was doing the migration procedure for upgrading production servers
without service downtime, but that's another story.


  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] problem about move node from one cluster to another cluster

2010-10-13 Thread jiaju liu

--- 10年10月13日，周三, pacemaker-requ...@oss.clusterlabs.org 
 写道：

Send Pacemaker mailing list submissions to
    pacemaker@oss.clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
or, via email, send a message with subject or body 'help' to
    pacemaker-requ...@oss.clusterlabs.org

You can reach the person managing the list at
    pacemaker-ow...@oss.clusterlabs.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Pacemaker digest..."

--

Message: 1
Date: Tue, 12 Oct 2010 21:04:36 +0300
From: Dan Frincu 
To: The Pacemaker cluster resource manager

Subject: Re: [Pacemaker] problem about move node from one clusterto
    another cluster
Message-ID: <4cb4a334.3040...@streamwide.ro>
Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"

Hi,

Depending on the openais version (please mention it) 

Hi Thank you for your reply my openais version is openais-0.80.5-15.1
pacemaker version is pacemaker-1.0.5-4.1.
I use restart but it does not work. I found it could not stop

this behavior could 
happen, I've seen it as well, on openais-0.8.0. What I've done to fix it 
was to restart the openais process via /etc/init.d/openais restart. And 
then it worked, however, this was one of the reasons I updated the 
packages to the latest versions of corosync, pacemaker, etc. The tricky 
part was doing the migration procedure for upgrading production servers 
without service downtime, but that's another story.

Regards,

Dan

jiaju liu wrote:
>
>
>
> 

  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Dan Frincu


Pavlos Parissis wrote:

On 13 October 2010 10:50, Dan Frincu  wrote:
  

From what I see you have a dual primary setup with failover on the third
node, basically if you have one drbd resource for which you have both
ordering and collocation, I don't think you need to "improve" it, if it
ain't broke, don't fix it :)

Regards,




No, I don't have Dual primary. My DRBD is in Single-Primary mode for
both DRBD resources.
I use N+1  setup. I have 2 resource group which have unique Primary
and shared secondary.
pbx_service_01 resource group has primary node-01 and secondary node-03
pbx_service_02 resource group has primary node-02 and secondary node-03

I use asymmetric cluster with specific location constraints in order
to implement the above.
The DRBD resource will never be in primary mode on 2 nodes at the same time.
I have set specific collocation and order constraints in order to
"bond"  DRBD ms resource to the appropriate resource group.

I hope it is clear now.

Cheers and thanks for looking at my conf,
Pavlos

  
True, my bad, Dual-Primary does not apply to your setup, I formulated it 
wrong, I meant what you said :)


Regards,

Dan

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Pavlos Parissis

On 13 October 2010 10:50, Dan Frincu  wrote:
> From what I see you have a dual primary setup with failover on the third
> node, basically if you have one drbd resource for which you have both
> ordering and collocation, I don't think you need to "improve" it, if it
> ain't broke, don't fix it :)
>
> Regards,
>

No, I don't have Dual primary. My DRBD is in Single-Primary mode for
both DRBD resources.
I use N+1  setup. I have 2 resource group which have unique Primary
and shared secondary.
pbx_service_01 resource group has primary node-01 and secondary node-03
pbx_service_02 resource group has primary node-02 and secondary node-03

I use asymmetric cluster with specific location constraints in order
to implement the above.
The DRBD resource will never be in primary mode on 2 nodes at the same time.
I have set specific collocation and order constraints in order to
"bond"  DRBD ms resource to the appropriate resource group.

I hope it is clear now.

Cheers and thanks for looking at my conf,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Migrate resources based on connectivity

2010-10-13 Thread Dan Frincu


Hi,

Pavlos Parissis wrote:

On 12 October 2010 20:00, Dan Frincu  wrote:
  

Hi,

Lars Ellenberg wrote:

On Mon, Oct 11, 2010 at 03:50:01PM +0300, Dan Frincu wrote:


Hi,

Dejan Muhamedagic wrote:


Hi,

On Sun, Oct 10, 2010 at 10:27:13PM +0300, Dan Frincu wrote:


Hi,

I have the following setup:
- order drbd0:promote drbd1:promote
- order drbd1:promote drbd2:promote
- order drbd2:promote all:start
- collocation all drbd2:Master
- all is a group of resources, drbd{0..3} are drbd ms resources.

I want to migrate the resources based on ping connectivity to a
default gateway. Based on
http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks
and http://www.clusterlabs.org/wiki/Example_configurations I've
tried the following:
- primitive ping ocf:pacemaker:ping params host_list=1.2.3.4
multiplier=100 op monitor interval=5s timeout=5s
- clone ping_clone ping meta globally-unique=false
- location ping_nok all \
  rule $id="ping_nok-rule" -inf: not_defined ping_clone or
ping_clone number:lte 0


Use pingd to reference the attribute in the location constraint.


Not to be disrespectful, but after 3 days being stuck on this issue,
I don't exactly understand how to do that. Could you please provide
an example.

Thank you in advance.


The example you reference lists:

primitive pingdnet1 ocf:pacemaker:pingd \
params host_list=192.168.23.1 \
name=pingdnet1
^^

clone cl-pingdnet1 pingdnet1
   ^

param name default is pingd,
and is the attribute name to be used in the location constraints.

You will need to reference pingd in you location constraint, or set an
explicit name in the primitive definition, and reference that.

Your ping primitive sets the default 'pingd' attribute,
but you reference some 'ping_clone' attribute,
which apparently no-one really references.



I've finally managed to finish the setup with the indications received
above, the behavior is the expected one. Also, I've tried the
ocf:pacemaker:pingd and even though it does the reachability tests properly,
it fails to update the cib upon restoring the connectivity, I had to
manually run attrd_updater -R to get the resources to start again, therefore
I'm going with ocf:pacemaker:ping.



it would be quite useful for the rest of people if you post your final
and working configuration.
Cheers,
Pavlos
  

The relevant stuff is related to the group and ping location constraint.

primitive _/ping_gw/_ ocf:pacemaker:ping \
   params host_list="1.1.1.99" multiplier="100" *name="ping_gw_name" *\
   op monitor interval="5s" timeout="60s" \
   op start interval="0s" timeout="60s"
group all virtual_ip_1 virtual_ip_2 Failover_Alert fs_home fs_mysql 
fs_storage httpd mysqld \

   meta target-role="Started"
ms ms_drbd_home drbd_home \
   meta notify="true" globally-unique="false" target-role="Started" 
master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"

ms ms_drbd_mysql drbd_mysql \
   meta notify="true" globally-unique="false" target-role="Started" 
master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"

ms ms_drbd_storage drbd_storage \
   meta notify="true" globally-unique="false" target-role="Started" 
master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"

clone ping_gw_clone _/ping_gw/_ \
   meta globally-unique="false" target-role="Started"
location nok_ping_to_gw all \
   rule $id="nok_ping_to_gw-rule" -inf: not_defined *ping_gw_name* 
or *ping_gw_name* lte 0

colocation all_on_home inf: all ms_drbd_home:Master
colocation all_on_mysql inf: all ms_drbd_mysql:Master
colocation all_on_storage inf: all ms_drbd_storage:Master
order all_after_storage inf: ms_drbd_storage:promote all:start
order ms_drbd_home_after_ms_drbd_mysql inf: ms_drbd_mysql:promote 
ms_drbd_home:promote
order ms_drbd_storage_after_ms_drbd_home inf: ms_drbd_home:promote 
ms_drbd_storage:promote

property $id="cib-bootstrap-options" \
   expected-quorum-votes="2" \
   stonith-enabled="false" \
   symmetric-cluster="true" \
   dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
   no-quorum-policy="ignore" \
   cluster-infrastructure="openais" \
   last-lrm-refresh="1286905225"
rsc_defaults $id="rsc-options" \
   multiple-active="block" \
   resource-stickiness="1000"

I hope this helps.

Regards,

Dan

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.cl

Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Dan Frincu


Pavlos Parissis wrote:

On 13 October 2010 09:48, Dan Frincu  wrote:
  

Hi,

I've noticed the same type of behavior, however in a different context, my
setup includes 3 drbd devices and a group of resources, all have to run on
the same node and move together to other nodes. My issue was with the first
resource that required access to a drbd device, which was the
ocf:heartbeat:Filesystem RA trying to do a mount and failing.

The reason, it was trying to do the mount of the drbd device before the drbd
device had finished migrating to primary state. Same as you, I introduced a
start-delay, but on the start action. This proved to be of no use as the
behavior persisted, even with an increased start-delay. However, it only
happened when performing a fail-back operation, during fail-over, everything
was ok, during fail-back, error.

The fix I've made was to remove any start-delay and to add group collocation
constraints to all ms_drbd resources. Before that I only had one collocation
constraint for the drbd device being promoted last.

I hope this helps.




I am glad that somebody else experienced the same issue:)

On my mail I was talking about the monitor action which was failing,
but the behavior you described happened on my system on the same
setup, drbd and fs resource.It also happened on the application
resource, the start was too fast and the FS was not mounted (yet) when
the action start fired for the application resource. A delay on start
function of the resource agent of the application fixed my issue.

In my setup I have all the necessary constraints to avoid this, at
least this is what I believe so:-)

Cheers,
Pavlos
  
From what I see you have a dual primary setup with failover on the 
third node, basically if you have one drbd resource for which you have 
both ordering and collocation, I don't think you need to "improve" it, 
if it ain't broke, don't fix it :)


Regards,

Dan


[r...@node-01 sysconfig]# crm configure show
node $id="059313ce-c6aa-4bd5-a4fb-4b781de6d98f" node-03
node $id="d791b1f5-9522-4c84-a66f-cd3d4e476b38" node-02
node $id="e388e797-21f4-4bbe-a588-93d12964b4d7" node-01
primitive drbd_01 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_1" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive drbd_02 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_2" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive fs_01 ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive fs_02 ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive ip_01 ocf:heartbeat:IPaddr2 \
params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \
meta failure-timeout="120" migration-threshold="3" \
op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
meta failure-timeout="120" migration-threshold="3" \
params ip="192.168.78.20" cidr_netmask="24" broadcast="192.168.78.255" \
op monitor interval="5s"
primitive pbx_01 lsb:znd-pbx_01 \
meta migration-threshold="3" failure-timeout="60"
target-role="Started" \
op monitor interval="20s" timeout="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive pbx_02 lsb:znd-pbx_02 \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive sshd_01 lsb:znd-sshd-pbx_01 \
meta target-role="Started" is-managed="true" \
op monitor on-fail="stop" interval="10m" \
op start interval="0" timeout="60s" on-fail="stop" \
op stop interval="0" timeout="60s" on-fail="stop"
primitive sshd_02 lsb:znd-sshd-pbx_02 \
meta target-role="Started" \
op monitor on-fail="stop" interval="10m" \
op start interval="0" timeout="60s" on-fail="stop" \
op stop interval="0" timeout="60s" on-fail="stop"
group pbx_service_01 ip_01 fs_01 pbx_01 sshd_01 \
meta target-role="Started"
group pbx_service_02 ip_02 fs_02 pbx_02 sshd_02
ms ms-drbd_01 drbd_01 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
ms ms-drbd_02 drbd_02 \
meta master

Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Pavlos Parissis

On 13 October 2010 09:48, Dan Frincu  wrote:
> Hi,
>
> I've noticed the same type of behavior, however in a different context, my
> setup includes 3 drbd devices and a group of resources, all have to run on
> the same node and move together to other nodes. My issue was with the first
> resource that required access to a drbd device, which was the
> ocf:heartbeat:Filesystem RA trying to do a mount and failing.
>
> The reason, it was trying to do the mount of the drbd device before the drbd
> device had finished migrating to primary state. Same as you, I introduced a
> start-delay, but on the start action. This proved to be of no use as the
> behavior persisted, even with an increased start-delay. However, it only
> happened when performing a fail-back operation, during fail-over, everything
> was ok, during fail-back, error.
>
> The fix I've made was to remove any start-delay and to add group collocation
> constraints to all ms_drbd resources. Before that I only had one collocation
> constraint for the drbd device being promoted last.
>
> I hope this helps.
>

I am glad that somebody else experienced the same issue:)

On my mail I was talking about the monitor action which was failing,
but the behavior you described happened on my system on the same
setup, drbd and fs resource.It also happened on the application
resource, the start was too fast and the FS was not mounted (yet) when
the action start fired for the application resource. A delay on start
function of the resource agent of the application fixed my issue.

In my setup I have all the necessary constraints to avoid this, at
least this is what I believe so:-)

Cheers,
Pavlos


[r...@node-01 sysconfig]# crm configure show
node $id="059313ce-c6aa-4bd5-a4fb-4b781de6d98f" node-03
node $id="d791b1f5-9522-4c84-a66f-cd3d4e476b38" node-02
node $id="e388e797-21f4-4bbe-a588-93d12964b4d7" node-01
primitive drbd_01 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_1" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive drbd_02 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_2" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive fs_01 ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive fs_02 ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive ip_01 ocf:heartbeat:IPaddr2 \
params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \
meta failure-timeout="120" migration-threshold="3" \
op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
meta failure-timeout="120" migration-threshold="3" \
params ip="192.168.78.20" cidr_netmask="24" broadcast="192.168.78.255" \
op monitor interval="5s"
primitive pbx_01 lsb:znd-pbx_01 \
meta migration-threshold="3" failure-timeout="60"
target-role="Started" \
op monitor interval="20s" timeout="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive pbx_02 lsb:znd-pbx_02 \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive sshd_01 lsb:znd-sshd-pbx_01 \
meta target-role="Started" is-managed="true" \
op monitor on-fail="stop" interval="10m" \
op start interval="0" timeout="60s" on-fail="stop" \
op stop interval="0" timeout="60s" on-fail="stop"
primitive sshd_02 lsb:znd-sshd-pbx_02 \
meta target-role="Started" \
op monitor on-fail="stop" interval="10m" \
op start interval="0" timeout="60s" on-fail="stop" \
op stop interval="0" timeout="60s" on-fail="stop"
group pbx_service_01 ip_01 fs_01 pbx_01 sshd_01 \
meta target-role="Started"
group pbx_service_02 ip_02 fs_02 pbx_02 sshd_02
ms ms-drbd_01 drbd_01 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
ms ms-drbd_02 drbd_02 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
location PrimaryNode-pbx_service_01 pbx_service_01 200: node

Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Dan Frincu


Hi,

I've noticed the same type of behavior, however in a different context, 
my setup includes 3 drbd devices and a group of resources, all have to 
run on the same node and move together to other nodes. My issue was with 
the first resource that required access to a drbd device, which was the 
ocf:heartbeat:Filesystem RA trying to do a mount and failing.


The reason, it was trying to do the mount of the drbd device before the 
drbd device had finished migrating to primary state. Same as you, I 
introduced a start-delay, but on the start action. This proved to be of 
no use as the behavior persisted, even with an increased start-delay. 
However, it only happened when performing a fail-back operation, during 
fail-over, everything was ok, during fail-back, error.


The fix I've made was to remove any start-delay and to add group 
collocation constraints to all ms_drbd resources. Before that I only had 
one collocation constraint for the drbd device being promoted last.


I hope this helps.

Regards,

Dan

Pavlos Parissis wrote:

Hi,

I noticed a race condition while I was integration an application with
Pacemaker and thought to share with you.

The init script of the application is LSB-compliant and passes the
tests mentioned at the Pacemaker documentation. Moreover, the init
script
uses the supplied functions from the system[1] for starting,stopping
and checking the application.

I observed few times that the monitor action was failing after the
startup of the cluster or the movement of the resource group.
Because it was not happening always and manual start/status was always
working, it was quite tricky and difficult to find out the root cause
of the failure.
After few hours of troubleshooting, I found out that the 1st monitor
action after the start action, was executed too fast for the
application to create the pid file. As result monitor action was
receiving error.

I know it sounds a bit strange but it happened on my systems. The fact
that my systems are basically vmware images on a laptop could have a
relation with the issue.

Nevertheless, I would like to ask if you are thinking to implement an
"init_wait" on 1st monitor action. Could be useful.

To solve my issue I put a sleep after the start of the application in
the init script. This gives enough time for the application to create
its pid file and the 1st monitor doesn't fail.


Cheers,
Pavlos


[1] Cent0S 5.4

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] problem about move node from one cluster to another

Re: [Pacemaker] stand_alone_ping stop Node start

[Pacemaker] stand_alone_ping stop Node start

[Pacemaker] stand_alone_ping stop node starting

Re: [Pacemaker] crm resource move doesn't move the resource

Re: [Pacemaker] problem about move node from one cluster to another cluster

Re: [Pacemaker] problem about move node from one cluster to another cluster

Re: [Pacemaker] 1st monitor is too fast after the start

Re: [Pacemaker] 1st monitor is too fast after the start

Re: [Pacemaker] Migrate resources based on connectivity

Re: [Pacemaker] 1st monitor is too fast after the start

Re: [Pacemaker] 1st monitor is too fast after the start

Re: [Pacemaker] 1st monitor is too fast after the start

13 matches

Site Navigation

Mail list logo

Footer information