[Pacemaker] Different value on cluster-infrastructure between 2 nodes

2013-04-12 Thread Pavlos Parissis
Hi

I am doing a rolling upgrade of pacemaker from CentOS 6.3 to 6.4 and
when 1st node is upgraded and gets 1.1.8 version it doesn't join the
cluster and I ended up with 2 clusters.

In the logs of node1 I see
cluster-infrastructure" value="classic openais (with pluin)

but node2(still in centos6.3 and pacemaker 1.1.7) it has
cluster-infrastructure="openais"

I also see different dc-version between nodes.

Does anyone know if these could be the reason for node1 to not join the
cluster and decides to make its own cluster?

corosync communication looks fine

Printing ring status.
Local node ID 484162314
RING ID 0
id  = 10.187.219.28
status  = ring 0 active with no faults
RING ID 1
id  = 192.168.1.2
status  = ring 1 active with no faults


Cheers,
Pavlos


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] 1.1.8 not compatible with 1.1.7?

2013-04-12 Thread Pavlos Parissis
Hoi,

As I wrote to another post[1] I failed to upgrade to 1.1.8 for a 2 node
cluster.

Before the upgrade process both nodes are using CentOS 6.3, corosync
1.4.1-7 and pacemaker-1.1.7.

I followed the rolling upgrade process, so I stopped pacemaker and then
corosync on node1 and upgraded to CentOS 6.4. The OS upgrade upgrades
also pacemaker to 1.1.8-7 and corosync to 1.4.1-15.
The upgrade of rpms went smoothly as I knew about the crmsh issue so I
made sure I had crmsh rpm on my repos.

Corosync started without any problems and both nodes could see each
other[2]. But for some reason node2 failed to receive a reply on join
offer from node1 and node1 never joined the cluster. Node1 formed a new
cluster as it never got an reply from node2, so I ended up with a
split-brain situation.

Logs of node1 can be found here
https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node1.log
and of node2 here
https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node2.log

I have found this thread[3] which could be related to my problem but the
bug which caused the failure on join on that case is solved in 1.1.8.

Any ideas?

Cheers,
Pavlos





[1] Subject Different value on cluster-infrastructure between 2 nodes
[2]
https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/corosync.status
[3] http://comments.gmane.org/gmane.linux.highavailability.pacemaker/13185



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Disable startup fencing with cman

2013-04-14 Thread Pavlos Parissis
On 14/04/2013 10:47 πμ, Andreas Mock wrote:
> Hi all,
> 
>  
> 
> in a two node cluster (RHEL6.x, cman, pacemaker)
> 
> when I startup the very first node,
> 
> this node will try to fence the other node if it can't see it.
> 
> This can be true in case of maintenance. How do I avoid
> 
> this startup fencing temporarily when I know that the
> 
> other node is down?

Have you tried to standby the node? I don't know if it will work, just
sharing my idea here.


> 
>  
> 
> Best regards
> 
> Andreas
> 
>  
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?

2013-04-14 Thread Pavlos Parissis
On 12/04/2013 09:37 μμ, Pavlos Parissis wrote:
> Hoi,
> 
> As I wrote to another post[1] I failed to upgrade to 1.1.8 for a 2 node
> cluster.
> 
> Before the upgrade process both nodes are using CentOS 6.3, corosync
> 1.4.1-7 and pacemaker-1.1.7.
> 
> I followed the rolling upgrade process, so I stopped pacemaker and then
> corosync on node1 and upgraded to CentOS 6.4. The OS upgrade upgrades
> also pacemaker to 1.1.8-7 and corosync to 1.4.1-15.
> The upgrade of rpms went smoothly as I knew about the crmsh issue so I
> made sure I had crmsh rpm on my repos.
> 
> Corosync started without any problems and both nodes could see each
> other[2]. But for some reason node2 failed to receive a reply on join
> offer from node1 and node1 never joined the cluster. Node1 formed a new
> cluster as it never got an reply from node2, so I ended up with a
> split-brain situation.
> 
> Logs of node1 can be found here
> https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node1.log
> and of node2 here
> https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node2.log
>

Doing a Disconnect & Reattach upgrade of both nodes at the same time
brings me a working 1.1.8 cluster. Any attempt to make a 1.1.8 node to
join a cluster with a 1.1.7 failed.

Cheers,
Pavlos




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?

2013-04-15 Thread Pavlos Parissis
Hoi,

I upgraded 1st node and here are the logs
https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node1.debuglog
https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node2.debuglog

Enabling tracing on the mentioned functions didn't give at least to me any
more information.

Cheers,
Pavlos


On 15 April 2013 01:42, Andrew Beekhof  wrote:

>
> On 15/04/2013, at 7:31 AM, Pavlos Parissis 
> wrote:
>
> > On 12/04/2013 09:37 μμ, Pavlos Parissis wrote:
> >> Hoi,
> >>
> >> As I wrote to another post[1] I failed to upgrade to 1.1.8 for a 2 node
> >> cluster.
> >>
> >> Before the upgrade process both nodes are using CentOS 6.3, corosync
> >> 1.4.1-7 and pacemaker-1.1.7.
> >>
> >> I followed the rolling upgrade process, so I stopped pacemaker and then
> >> corosync on node1 and upgraded to CentOS 6.4. The OS upgrade upgrades
> >> also pacemaker to 1.1.8-7 and corosync to 1.4.1-15.
> >> The upgrade of rpms went smoothly as I knew about the crmsh issue so I
> >> made sure I had crmsh rpm on my repos.
> >>
> >> Corosync started without any problems and both nodes could see each
> >> other[2]. But for some reason node2 failed to receive a reply on join
> >> offer from node1 and node1 never joined the cluster. Node1 formed a new
> >> cluster as it never got an reply from node2, so I ended up with a
> >> split-brain situation.
> >>
> >> Logs of node1 can be found here
> >> https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node1.log
> >> and of node2 here
> >> https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node2.log
> >>
> >
> > Doing a Disconnect & Reattach upgrade of both nodes at the same time
> > brings me a working 1.1.8 cluster. Any attempt to make a 1.1.8 node to
> > join a cluster with a 1.1.7 failed.
>
> There wasn't enough detail in the logs to suggest a solution, but if you
> add the following to /etc/sysconfig/pacemaker and re-test, it might shed
> some additional light on the problem.
>
> export PCMK_trace_functions=ais_dispatch_message
>
> Certainly there was no intention to make them incompatible.
>
> >
> > Cheers,
> > Pavlos
> >
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] location constraint question

2010-09-20 Thread Pavlos Parissis
Hi,
I am having problems to understand why my DRBD ms resource wants a
location constraint. My setup is quite simple
3 nodes
2 resource groups which hold ip,fs and the dymmy resources
2 resources for 2 drbd
2 master/slave resource for 2 DRBD.

The objective is to have pbx_service_01 to use as primary node-1 and
secondary node-03, and for pbx_service_02 to use a primary node-2 and
secondary node-03.
So, a N+1 architecture. Having the configuration [1] everything works
as I want [2]. But, I found a comment from Lars Ellenberg [3] which
basically says to use location constraint on ms DRBD.
So, I deleted the  PrimaryNode-drbd_01 and SecondaryNode-drbd_01
location constraints just to see the impact only 1 of the 2 resource
group.
I noticed that only ip_01 is started from  pbx_service_01 resource
group and not the fs and pbx_01 (pbx_01 no starting is normal because
the order constraint ).
I thought that since I have a location constraint for the resource
group will be enough.
What have I understood incorrectly?

BTW, why does crm_mon report only 4 resource?

Thanks,
Pavlos




[1]
[r...@node-01 ~]# crm configure show
node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02
node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01
node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03
primitive drbd_01 ocf:linbit:drbd \
    params drbd_resource="drbd_pbx_service_1" \
    op monitor interval="30s"
primitive drbd_02 ocf:linbit:drbd \
    params drbd_resource="drbd_pbx_service_2" \
    op monitor interval="30s"
primitive fs_01 ocf:heartbeat:Filesystem \
    params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3"
primitive fs_02 ocf:heartbeat:Filesystem \
    params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3"
primitive ip_01 ocf:heartbeat:IPaddr2 \
    params ip="10.10.10.10" cidr_netmask="28" broadcast="10.10.10.127" \
    op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
    params ip="10.10.10.11" cidr_netmask="28" broadcast="10.10.10.127" \
    op monitor interval="5s"
primitive pbx_01 ocf:heartbeat:Dummy \
    params state="/pbx_service_01/Dummy.state"
primitive pbx_02 ocf:heartbeat:Dummy \
    params state="/pbx_service_02/Dummy.state"
group pbx_service_01 ip_01 fs_01 pbx_01
group pbx_service_02 ip_02 fs_02 pbx_02
ms ms-drbd_01 drbd_01 \
    meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
ms ms-drbd_02 drbd_02 \
    meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02
location SecondaryNode-drbd_01 ms-drbd_01 0: node-03
location SecondaryNode-drbd_02 ms-drbd_02 0: node-03
location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03
location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03
colocation fs-on-drbd_01 inf: fs_01 ms-drbd_01:Master
colocation fs-on-drbd_02 inf: fs_02 ms-drbd_02:Master
colocation pbx_01-with-fs_01 inf: pbx_01 fs_01
colocation pbx_01-with-ip_01 inf: pbx_01 ip_01
colocation pbx_02-with-fs_02 inf: pbx_02 fs_02
colocation pbx_02-with-ip_02 inf: pbx_02 ip_02
order fs_01-after-drbd_01 inf: ms-drbd_01:promote fs_01:start
order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start
order pbx_01-after-fs_01 inf: fs_01 pbx_01
order pbx_01-after-ip_01 inf: ip_01 pbx_01
order pbx_02-after-fs_02 inf: fs_02 pbx_02
order pbx_02-after-ip_02 inf: ip_02 pbx_02
property $id="cib-bootstrap-options" \
    dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
    cluster-infrastructure="Heartbeat" \
    stonith-enabled="false" \
    symmetric-cluster="false"
rsc_defaults $id="rsc-options" \
    resource-stickiness="1000"




[2]
[r...@node-03 ~]# crm_mon -1

Last updated: Mon Sep 20 15:36:46 2010
Stack: Heartbeat
Current DC: node-03 (e5195d6b-ed14-4bb3-92d3-9105543f9251) - partition
with quorum
Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
3 Nodes configured, unknown expected votes
4 Resources configured.


Online: [ node-03 node-01 node-02 ]

 Resource Group: pbx_service_01
 ip_01  (ocf::heartbeat:IPaddr2):   Started node-01
 fs_01  (ocf::heartbeat:Filesystem):    Started node-01
 pbx_01 (ocf::heartbeat:Dummy): Started node-01
 Resource Group: pbx_service_02
 ip_02  (ocf::heartbeat:IPaddr2):   Started node-02
 fs_02  (ocf::heartbeat:Filesystem):    Started node-02
 pbx_02 (ocf::heartbeat:Dummy): Started node-02
 Master/Slave Set: ms-drbd_01
 Masters: [ node-01 ]
 Slaves: [ node-03 ]
 Master/Slave Set: ms-drbd_02
 Masters: [ node-02 ]
 Slaves: [ node-03 ]

[3] http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg04105.html

not have location preference cons

Re: [Pacemaker] location constraint question

2010-09-21 Thread Pavlos Parissis
On 21 September 2010 08:38, Andrew Beekhof  wrote:
>> BTW, why does crm_mon report only 4 resource?
>
> Because the drbd resources were made into master/slaves.
>
> See:
>   ms ms-drbd_01 drbd_01 \
>        meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
>
OK, thanks.

I tried several things today in order to avoid a location constraint
directly on drbd ms resource but nothing worked.
I am pretty sure that if you have an a-symmetric cluster you need to
have location constraint on drbd ms resource.

BTW Adrew, why shouldn't we have location preference constraints on
the master role directly?

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] location constraint question

2010-09-21 Thread Pavlos Parissis
On 21 September 2010 09:04, Andrew Beekhof  wrote:
> On Tue, Sep 21, 2010 at 8:58 AM, Pavlos Parissis
>  wrote:
>> On 21 September 2010 08:38, Andrew Beekhof  wrote:
>>>> BTW, why does crm_mon report only 4 resource?
>>>
>>> Because the drbd resources were made into master/slaves.
>>>
>>> See:
>>>   ms ms-drbd_01 drbd_01 \
>>>        meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1" notify="true"
>>>
>> OK, thanks.
>>
>> I tried several things today in order to avoid a location constraint
>> directly on drbd ms resource but nothing worked.
>> I am pretty sure that if you have an a-symmetric cluster you need to
>> have location constraint on drbd ms resource.
>
> yep, otherwise it doesn't know where its allowed to start
>
>>
>> BTW Adrew, why shouldn't we have location preference constraints on
>> the master role directly?
>
> No reason at all. Its allowed.

Thanks for the clarification,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] migration-threshold and failure-timeout

2010-09-21 Thread Pavlos Parissis
Hi,

I am trying to figure a way to do the following
if the monitor of x resource fails N time in a period of Z then fail over to
the other node and clear fail-count.

Regards,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] migration-threshold and failure-timeout

2010-09-21 Thread Pavlos Parissis
On 21 September 2010 15:28, Vadym Chepkov  wrote:

> On Tue, Sep 21, 2010 at 9:14 AM, Dan Frincu  wrote:
> > Hi,
> >
> > This =>
> >
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-failure-migration.html
> > explains it pretty well. Notice the INFINITY score and what sets it.
> >
> > However I don't know of any automatic method to clear the failcount.
> >
> > Regards,
> > Dan
>
>
> in pacemaker 1.0 nothing will clean failcount automatically, this is a
> feature of pacemaker 1.1, imho
>
> But,
>
> crm configure rsc_defaults failure-timeout="10min"
>
> will make cluster to "forget" about previous failure in 10 minutes.
> if you want to futher decrease this paramater, you might need to decrease
>
> crm configure property cluster-recheck-interval="10min"
>
> Cheers,
> Vadym
>
>
Ok guys thank you very much for the info,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] target-role default value

2010-09-24 Thread Pavlos Parissis
Hi,

What is the default value for target-role in resource?
I tried to query it with crm_resource but without success.
 crm_resource pbx_02 --get-property target-role
crm_resource pbx_02 --get-parameter target-role --meta


Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] target-role default value

2010-09-24 Thread Pavlos Parissis
On 24 September 2010 11:40, Michael Schhwartzkopff wrote:

> On Friday 24 September 2010 11:34:11 Pavlos Parissis wrote:
> > Hi,
> >
> > What is the default value for target-role in resource?
> > I tried to query it with crm_resource but without success.
> >  crm_resource pbx_02 --get-property target-role
> > crm_resource pbx_02 --get-parameter target-role --meta
> >
> >
> > Cheers,
> > Pavlos
>
> started
>
>
thanks.
How do I get default values for parameters which are not set?

Thanks again,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] default timeout for op start/stop

2010-09-24 Thread Pavlos Parissis
Hi,

When I verify my conf I get complains about the timeout on start and stop
operation
crm(live)configure# verify
WARNING: drbd_01: default timeout 20s for start is smaller than the advised
240
WARNING: drbd_01: default timeout 20s for stop is smaller than the advised
100
WARNING: drbd_02: default timeout 20s for start is smaller than the advised
240
WARNING: drbd_02: default timeout 20s for stop is smaller than the advised
100

Since I don't specifically set timeout for the mentioned resources I thought
this 20s is coming from the defaults.
So, I queried the defaults and got the following
[r...@node-03 ~]# crm_attribute --type op_defaults --name timeout
scope=op_defaults  name=timeout value=(null)

So, I am wondering from where this 20s is coming from.

I had the same issue for IP and Filesystem type resources and in order to
get rid of the warning I specifically set it to be 60s.

Regards,
Pavlos


[r...@node-03 ~]# crm configure show
node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02
node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01
node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03
primitive drbd_01 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_1" \
op monitor interval="30s"
primitive drbd_02 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_2" \
op monitor interval="30s"
primitive fs_01 ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3"
\
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive fs_02 ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3"
\
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive ip_01 ocf:heartbeat:IPaddr2 \
params ip="10.10.10.10" cidr_netmask="25" broadcast="10.10.10.127" \
meta failure-timeout="120" migration-threshold="3" \
op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
params ip="10.10.10.11" cidr_netmask="25" broadcast="10.10.10.127" \
op monitor interval="5s"
primitive pbx_01 ocf:heartbeat:Dummy \
params state="/pbx_service_01/Dummy.state" \
meta failure-timeout="60" migration-threshold="3" \
op monitor interval="20s" timeout="40s"
primitive pbx_02 ocf:heartbeat:Dummy \
params state="/pbx_service_02/Dummy.state" \
meta failure-timeout="60" migration-threshold="3"
group pbx_service_01 ip_01 fs_01 pbx_01 \
meta target-role="Started"
group pbx_service_02 ip_02 fs_02 pbx_02 \
meta target-role="Started"
ms ms-drbd_01 drbd_01 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
ms ms-drbd_02 drbd_02 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02
location SecondaryNode-drbd_01 ms-drbd_01 0: node-03
location SecondaryNode-drbd_02 ms-drbd_02 0: node-03
location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03
location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03
colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master
colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master
colocation pbx_01-with-fs_01 inf: pbx_01 fs_01
colocation pbx_01-with-ip_01 inf: pbx_01 ip_01
colocation pbx_02-with-fs_02 inf: pbx_02 fs_02
colocation pbx_02-with-ip_02 inf: pbx_02 ip_02
order fs_01-after-drbd_01 inf: ms-drbd_01:promote fs_01:start
order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start
order pbx_01-after-fs_01 inf: fs_01 pbx_01
order pbx_01-after-ip_01 inf: ip_01 pbx_01
order pbx_02-after-fs_02 inf: fs_02 pbx_02
order pbx_02-after-ip_02 inf: ip_02 pbx_02
property $id="cib-bootstrap-options" \
dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
symmetric-cluster="false" \
last-lrm-refresh="1285323745"
rsc_defaults $id="rsc-options" \
resource-stickiness="1000"
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] default timeout for op start/stop

2010-09-24 Thread Pavlos Parissis
On 24 September 2010 13:54, Michael Schhwartzkopff wrote:

> On Friday 24 September 2010 13:50:49 Pavlos Parissis wrote:
> > Hi,
> >
> > When I verify my conf I get complains about the timeout on start and stop
> > operation
> > crm(live)configure# verify
> > WARNING: drbd_01: default timeout 20s for start is smaller than the
> advised
> > 240
> > WARNING: drbd_01: default timeout 20s for stop is smaller than the
> advised
> > 100
> > WARNING: drbd_02: default timeout 20s for start is smaller than the
> advised
> > 240
> > WARNING: drbd_02: default timeout 20s for stop is smaller than the
> advised
> > 100
> >
> > Since I don't specifically set timeout for the mentioned resources I
> > thought this 20s is coming from the defaults.
> > So, I queried the defaults and got the following
> > [r...@node-03 ~]# crm_attribute --type op_defaults --name timeout
> > scope=op_defaults  name=timeout value=(null)
> >
> >
>
> Default timeout is coded into the resource agent. You safely can ignore the
> WARNINGs. These are also removed from more recent versions of pacemaker.
>
> thanks again
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] default timeout for op start/stop

2010-09-27 Thread Pavlos Parissis
On 24 September 2010 18:12, Dejan Muhamedagic  wrote:
[...snip...]

> >
> > Default timeout is coded into the resource agent. You safely can ignore
> the
> > WARNINGs. These are also removed from more recent versions of pacemaker.
>
>
> These warnings shouldn't be ignored. The defaults which are coded
> in the RA are what the author of the RA advised as minimum. These
> values are, however, not used automatically by the CRM, so they
> need to be specified in the configuration. And then the resources
> should be thoroughly tested to see if the timeouts are meaningful
> in the given environment.
>
> Thanks,
>
> Are you saying that if timeouts are not set CRM will wait for ever for each
operation to finish?

Regards,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] default timeout for op start/stop

2010-09-27 Thread Pavlos Parissis
On 27 September 2010 12:17, Dejan Muhamedagic  wrote:

> Hi,
>
> On Mon, Sep 27, 2010 at 12:00:19PM +0200, Pavlos Parissis wrote:
> > On 24 September 2010 18:12, Dejan Muhamedagic 
> wrote:
> > [...snip...]
> >
> > > >
> > > > Default timeout is coded into the resource agent. You safely can
> ignore
> > > the
> > > > WARNINGs. These are also removed from more recent versions of
> pacemaker.
> > >
> > >
> > > These warnings shouldn't be ignored. The defaults which are coded
> > > in the RA are what the author of the RA advised as minimum. These
> > > values are, however, not used automatically by the CRM, so they
> > > need to be specified in the configuration. And then the resources
> > > should be thoroughly tested to see if the timeouts are meaningful
> > > in the given environment.
> > >
> > > Thanks,
> > >
> > Are you saying that if timeouts are not set CRM will wait for ever for
> each
> > operation to finish?
>
> No. It will use the global default timeout value
> (default-action-timeout) which is set to 20s. That's why the
> shell issues the warnings: 20s is shorter than what has been
> advertised in the meta-data of the RA you want to configure.
>

ok thanks
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] crm resource move doesn't move the resource

2010-09-28 Thread Pavlos Parissis
Hi,


When I issue "crm resource move pbx_service_01 node-0N" it moves this
resource group but the fs_01 resource is not started because drbd_01 is
still running on other node and it is not moved as well tonode-0N, even I
have colocation constraints.
I am pretty sure that I have that working before, but I can't figure why it
doesn't work anymore.
The resource pbx_service_01 and drbd_01  are moved to another node in case
of failure, but for some reason not manually.

Can you see in my conf where it could be the problem? I have already spent
some time and I think I can't see the obvious anymore:-(


node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02
node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01
node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03
primitive drbd_01 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_1" \
op monitor interval="30s"
primitive drbd_02 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_2" \
op monitor interval="30s"
primitive fs_01 ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3"
\
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20"
primitive fs_02 ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3"
\
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20"
primitive ip_01 ocf:heartbeat:IPaddr2 \
params ip="10.10.10.10" cidr_netmask="25" broadcast="10.10.10.127" \
meta failure-timeout="120" migration-threshold="3" \
op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
params ip="10.10.10.11" cidr_netmask="25" broadcast="10.10.10.127" \
op monitor interval="5s"
primitive pbx_01 ocf:heartbeat:Dummy \
params state="/pbx_service_01/Dummy.state" \
meta failure-timeout="60" migration-threshold="3"
target-role="Started" \
op monitor interval="20s" timeout="40s"
primitive pbx_02 ocf:heartbeat:Dummy \
params state="/pbx_service_02/Dummy.state" \
meta failure-timeout="60" migration-threshold="3"
group pbx_service_01 ip_01 fs_01 pbx_01 \
meta target-role="Started"
group pbx_service_02 ip_02 fs_02 pbx_02 \
meta target-role="Started"
ms ms-drbd_01 drbd_01 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
ms ms-drbd_02 drbd_02 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02
location SecondaryNode-drbd_01 ms-drbd_01 0: node-03
location SecondaryNode-drbd_02 ms-drbd_02 0: node-03
location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03
location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03
colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master
colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master
colocation pbx_01-with-fs_01 inf: pbx_01 fs_01
colocation pbx_01-with-ip_01 inf: pbx_01 ip_01
colocation pbx_02-with-fs_02 inf: pbx_02 fs_02
colocation pbx_02-with-ip_02 inf: pbx_02 ip_02
order fs_01-after-drbd_01 inf: ms-drbd_01:promote fs_01:start
order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start
order pbx_01-after-fs_01 inf: fs_01 pbx_01
order pbx_01-after-ip_01 inf: ip_01 pbx_01
order pbx_02-after-fs_02 inf: fs_02 pbx_02
order pbx_02-after-ip_02 inf: ip_02 pbx_02
property $id="cib-bootstrap-options" \
dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
symmetric-cluster="false" \
last-lrm-refresh="1285323745"
rsc_defaults $id="rsc-options" \
resource-stickiness="1000"
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] promote a ms resource to a node

2010-09-28 Thread Pavlos Parissis
Hi,

Let's say that I have manually demote a ms resource and have the following
situation
crm(live)resource# demote ms-drbd_01
crm(live)resource# status
[..snip..]
Master/Slave Set: ms-drbd_01
 Slaves: [ node-01 node-03 ]

How can I manually promote ms-drbd_01 on node-03?
The promote command doesn't accept node names and the move command on
ms-drbd_01 says can't find resource.

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource move doesn't move the resource

2010-09-29 Thread Pavlos Parissis
On 28 September 2010 15:09, Pavlos Parissis wrote:

> Hi,
>
>
> When I issue "crm resource move pbx_service_01 node-0N" it moves this
> resource group but the fs_01 resource is not started because drbd_01 is
> still running on other node and it is not moved as well tonode-0N, even I
> have colocation constraints.
> I am pretty sure that I have that working before, but I can't figure why it
> doesn't work anymore.
> The resource pbx_service_01 and drbd_01  are moved to another node in case
> of failure, but for some reason not manually.
>
> Can you see in my conf where it could be the problem? I have already spent
> some time and I think I can't see the obvious anymore:-(
>
> [...snip ...]

Just to that this issue is applicable only for one of the resource group,
even the conf is the same for both of them!

So, after hours of running the same test again and again, and reading 10
lines of logs (BTW it seams that they say in a clear way why certain things
happen) I decided to recreate the drbd_01 and ms-drbd_01 resource and adjust
the order constraints
before it was like this
order fs_01-after-drbd_01 inf: ms-drbd_01:promote fs_01:start
order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start
order pbx_01-after-fs_01 inf: fs_01 pbx_01
order pbx_01-after-ip_01 inf: ip_01 pbx_01
order pbx_02-after-fs_02 inf: fs_02 pbx_02
order pbx_02-after-ip_02 inf: ip_02 pbx_02

and now like this
order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start
order pbx_02-after-fs_02 inf: fs_02 pbx_02
order pbx_02-after-ip_02 inf: ip_02 pbx_02
order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote
pbx_service_01:start*
*
as you can see no major changes.

The end result is that now every time I issue "crm resource move
pbx_service_01 node-0N" the drbd_01 is promoted on that node as well and the
whole resource group is started! So, issue is solved but I don't like it for
the very simple reason, I don't why it didn't work, and that scares me!

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker

2010-09-29 Thread Pavlos Parissis
Please paste the conf of corosync, without suppling the conf is quite difficult 
to help you
Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker

2010-09-29 Thread Pavlos Parissis
On 29 September 2010 21:01, Andreas Hofmeister  wrote:

>  On 29.09.2010 19:59, Mike A Meyer wrote:
>
> We have two nodes that we have the IP address assigned to a bond0 network
> interface instead of the usual eth0 network interface.  We are wondering if
> there are issues with trying to configure corosync/pacemaker with an IP
> assigned to a bond0 network interface.  We are seeing that
> corosync/pacemaker will start on both nodes, but it doesn't detect other
> nodes in the cluster.  We do have SELinux and the firewall shut off on both
> nodes.  Any information would be helpful.
>
>
> We run the cluster stuff on bonding devices (actually on a VLan on top of a
> bond)  and it works well. We use it in a two-node setup in round-robin mode,
> the nodes are connected back-to-back (i.e. no Switch in between).
>
> If you use bonding over a Switch, check your bonding mode - round-robin
> just won't work. Try LACP if you have connected each node to  a single
> switch or if your Switches support link aggregation over multiple Devices
> (the cheaper ones won't). Try "active-backup" with multiple switches.
>
> To check your configuration, use "ping" and check the "icmp_seq" in the
> replies. If some sequence number is missing, your setup is probably broken.
>
>
It is quite common to connect both interfaces of a bond on the same switch
and then face issues.
Mike you need to tell us a bit more on the layer 2 connectivity and how it
does look like.

We also use active-backup mode on our bond interfaces, but we use 2 switches
and it works without any problem

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker

2010-09-30 Thread Pavlos Parissis
On 30 September 2010 15:23, Mike A Meyer  wrote:

> Pavlos,
>
> Thanks for helping out on this.  We are running on RHEL 5.5 running on the
> iron and not a VM.   We don't have SELinux turned on and the firewall is
> disabled.  Here is information in the /etc/modprobe.conf file.
>
> alias eth0 bnx2
> alias eth1 bnx2
> alias scsi_hostadapter cciss
> alias scsi_hostadapter1 qla2xxx
> alias scsi_hostadapter2 usb-storage
> alias bond0 bonding
> options bond0 mode=1 miimon=100
> options lpfc lpfc_lun_queue_depth=16 lpfc_nodev_tmo=30
> lpfc_discovery_threads=32
>
>
> We did take off the bond0 as a test and now only have our IP address
> assigned to eth0 and still having the same problem when starting corosync.
> The problem we are finding in the /var/log/cluster/corosync.log file is
> below.
>
> Sep 30 07:58:57 e-magdb1.buysub.com crmd: [28406]: info: crm_timer_popped:
> Election Trigger (I_DC_TIMEOUT) just popped!
> Sep 30 07:58:57 e-magdb1.buysub.com crmd: [28406]: WARN: do_log: FSA:
> Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Sep 30 07:58:57 e-magdb1.buysub.com crmd: [28406]: info:
> do_state_transition: State transition S_PENDING -> S_ELECTION [
> input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ]
>
> What could this 'just popped' message mean?
>
> I have no idea about the meaning of this message. But, lets exclude any
network issues.
Does the ping between the nodes works?
if you run tcpdump on the interface and then start corosync, do you see
musticast packets arriving?

Unfortunately, I don't use corosync so I can't compare your conf with mine,
I use heartbeat, so I can't tell if you have any conf issue.

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resources are restarted without obvious reasons

2010-10-01 Thread Pavlos Parissis
Hi
Could be related to a possible bug mentioned here[1]?

BTW here is the conf of pacemaker
node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02
node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01
node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03
primitive drbd_01 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_1" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive drbd_02 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_2" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive fs_01 ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3"
\
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive fs_02 ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3"
\
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive ip_01 ocf:heartbeat:IPaddr2 \
params ip="192.168.78.10" cidr_netmask="24"
broadcast="192.168.78.255" \
meta failure-timeout="120" migration-threshold="3" \
op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
params ip="192.168.78.20" cidr_netmask="24"
broadcast="192.168.78.255" \
op monitor interval="5s"
primitive pbx_01 lsb:test-01 \
meta failure-timeout="60" migration-threshold="3"
target-role="Started" \
op monitor interval="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive pbx_02 lsb:test-02 \
meta failure-timeout="60" migration-threshold="3"
target-role="Started" \
op monitor interval="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
group pbx_service_01 ip_01 fs_01 pbx_01 \
meta target-role="Started"
group pbx_service_02 ip_02 fs_02 pbx_02 \
meta target-role="Started"
ms ms-drbd_01 drbd_01 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
ms ms-drbd_02 drbd_02 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02
location SecondaryNode-drbd_01 ms-drbd_01 0: node-03
location SecondaryNode-drbd_02 ms-drbd_02 0: node-03
location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03
location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03
colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master
colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master
order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote
pbx_service_01:start
order pbx_service_02-after-drbd_02 inf: ms-drbd_02:promote
pbx_service_02:start
property $id="cib-bootstrap-options" \
dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
symmetric-cluster="false" \
last-lrm-refresh="1285323745"
rsc_defaults $id="rsc-options" \

Cheers,
Pavlos




[1]
http://oss.clusterlabs.org/pipermail/pacemaker/2010-September/007624.html
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resources are restarted without obvious reasons

2010-10-01 Thread Pavlos Parissis
Hi,
It seams that it happens every time PE wants to check the conf
09:23:55 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer
(I_PE_CALC) just popped!

and then check_rsc_parameters() wants to reset my resources

09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of
pbx_02 on node-02, provider changed: heartbeat -> 
09:23:55 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02
09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of
pbx_01 on node-01, provider changed: heartbeat -> 

looking at the code I can't conclude where the issue could  be, in the
actual conf or  I am hitting a bug
static gboolean
check_rsc_parameters(resource_t *rsc, node_t *node, xmlNode *rsc_entry,
 pe_working_set_t *data_set)
{
int attr_lpc = 0;
gboolean force_restart = FALSE;
gboolean delete_resource = FALSE;

const char *value = NULL;
const char *old_value = NULL;
const char *attr_list[] = {
XML_ATTR_TYPE,
XML_AGENT_ATTR_CLASS,
XML_AGENT_ATTR_PROVIDER
};

for(; attr_lpc < DIMOF(attr_list); attr_lpc++) {
value = crm_element_value(rsc->xml, attr_list[attr_lpc]);
old_value = crm_element_value(rsc_entry, attr_list[attr_lpc]);
if(value == old_value /* ie. NULL */
   || crm_str_eq(value, old_value, TRUE)) {
continue;
}

force_restart = TRUE;
crm_notice("Forcing restart of %s on %s, %s changed: %s -> %s",
   rsc->id, node->details->uname, attr_list[attr_lpc],
   crm_str(old_value), crm_str(value));
}
if(force_restart) {
/* make sure the restart happens */
stop_action(rsc, node, FALSE);
set_bit(rsc->flags, pe_rsc_start_pending);
delete_resource = TRUE;
}
return delete_resource;
}


On 1 October 2010 09:13, Pavlos Parissis  wrote:

> Hi
> Could be related to a possible bug mentioned here[1]?
>
> BTW here is the conf of pacemaker
> node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02
> node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01
> node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03
> primitive drbd_01 ocf:linbit:drbd \
> params drbd_resource="drbd_pbx_service_1" \
> op monitor interval="30s" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="120s"
> primitive drbd_02 ocf:linbit:drbd \
> params drbd_resource="drbd_pbx_service_2" \
> op monitor interval="30s" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="120s"
> primitive fs_01 ocf:heartbeat:Filesystem \
> params device="/dev/drbd1" directory="/pbx_service_01"
> fstype="ext3" \
> meta migration-threshold="3" failure-timeout="60" \
> op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
> op start interval="0" timeout="60s" \
> op stop interval="0" timeout="60s"
> primitive fs_02 ocf:heartbeat:Filesystem \
> params device="/dev/drbd2" directory="/pbx_service_02"
> fstype="ext3" \
> meta migration-threshold="3" failure-timeout="60" \
> op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
> op start interval="0" timeout="60s" \
> op stop interval="0" timeout="60s"
> primitive ip_01 ocf:heartbeat:IPaddr2 \
> params ip="192.168.78.10" cidr_netmask="24"
> broadcast="192.168.78.255" \
> meta failure-timeout="120" migration-threshold="3" \
> op monitor interval="5s"
> primitive ip_02 ocf:heartbeat:IPaddr2 \
> params ip="192.168.78.20" cidr_netmask="24"
> broadcast="192.168.78.255" \
> op monitor interval="5s"
> primitive pbx_01 lsb:test-01 \
> meta failure-timeout="60" migration-threshold="3"
> target-role="Started" \
> op monitor interval="20s" \
> op start interval="0" timeout="60s" \
> op stop interval="0" timeout="60s"
> primitive pbx_02 lsb:test-02 \
> meta failure-timeout="60" migration-threshold="3"
> target-role="Started" \
> op monitor interval="20s" \
> op start interval="0" timeout="60s"

Re: [Pacemaker] crm resource move doesn't move the resource

2010-10-02 Thread Pavlos Parissis
Hi,

I am having again the same issue, in a different set of 3 nodes. When I try
to failover manually the resource group on the standby node, the ms-drbd
resource is not moved as well and as a result the resource group is not
fully started, only the ip resource is started.
Any ideas why I am having this issue?

here are the info
[r...@node-01 ~]# crm resource move pbx_service_01 node-03
[r...@node-01 ~]# crm resource unmove pbx_service_01
[r...@node-01 ~]# ptest -Ls
Allocation scores:
clone_color: ms-drbd_01 allocation score on node-01: 100
clone_color: ms-drbd_01 allocation score on node-03: 0
clone_color: drbd_01:0 allocation score on node-01: 11100
clone_color: drbd_01:0 allocation score on node-03: 0
clone_color: drbd_01:1 allocation score on node-01: 100
clone_color: drbd_01:1 allocation score on node-03: 11000
native_color: drbd_01:0 allocation score on node-01: 11100
native_color: drbd_01:0 allocation score on node-03: 0
native_color: drbd_01:1 allocation score on node-01: -100
native_color: drbd_01:1 allocation score on node-03: 11000
drbd_01:0 promotion score on node-01: 10100
drbd_01:1 promotion score on node-03: 1
drbd_01:2 promotion score on none: 0
group_color: pbx_service_01 allocation score on node-01: 200
group_color: pbx_service_01 allocation score on node-03: 10
group_color: ip_01 allocation score on node-01: 200
group_color: ip_01 allocation score on node-03: 1010
group_color: fs_01 allocation score on node-01: 0
group_color: fs_01 allocation score on node-03: 0
group_color: pbx_01 allocation score on node-01: 0
group_color: pbx_01 allocation score on node-03: 0
native_color: ip_01 allocation score on node-01: 200
native_color: ip_01 allocation score on node-03: 1010
drbd_01:0 promotion score on node-01: 100
drbd_01:1 promotion score on node-03: -100
drbd_01:2 promotion score on none: 0
native_color: fs_01 allocation score on node-01: -100
native_color: fs_01 allocation score on node-03: -100
native_color: pbx_01 allocation score on node-01: -100
native_color: pbx_01 allocation score on node-03: -100


[r...@node-01 ~]# crm status

Last updated: Sat Oct  2 18:27:32 2010
Stack: Heartbeat
Current DC: node-03 (3dd75a8f-9819-450f-9f18-c27730665925) - partition with
quorum
Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
3 Nodes configured, unknown expected votes
2 Resources configured.


Online: [ node-03 node-01 node-02 ]

 Master/Slave Set: ms-drbd_01
 Masters: [ node-01 ]
 Slaves: [ node-03 ]
 Resource Group: pbx_service_01
 ip_01  (ocf::heartbeat:IPaddr2):   Started node-03
 fs_01  (ocf::heartbeat:Filesystem):Stopped
 pbx_01 (lsb:test-01):  Stopped


[r...@node-01 ~]# crm configure show
node $id="3dd75a8f-9819-450f-9f18-c27730665925" node-03
node $id="4e47db29-5f14-4371-9734-317bf342b8ed" node-02
node $id="a8f56e42-438f-4ea5-a6ba-a7f1d23ed401" node-01
primitive drbd_01 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_1" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive fs_01 ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3"
\
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive ip_01 ocf:heartbeat:IPaddr2 \
params ip="192.168.78.10" cidr_netmask="24"
broadcast="192.168.78.255" \
meta failure-timeout="120" migration-threshold="3" \
op monitor interval="5s"
primitive pbx_01 lsb:test-01 \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
group pbx_service_01 ip_01 fs_01 pbx_01 \
meta target-role="Started"
ms ms-drbd_01 drbd_01 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
location SecondaryNode-drbd_01 ms-drbd_01 0: node-03
location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03
colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master
order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote
pbx_service_01:start
property $id="cib-bootstrap-options" \
dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
cluster-infrastructure="Heartbeat" \
symmetric-cluster="false" \
stonith-enabled="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="1000"
[r...@node-01 ~]#

Thanks,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Pr

Re: [Pacemaker] crm resource move doesn't move the resource

2010-10-02 Thread Pavlos Parissis
I am wondering if resource-stickiness="1000" could be reason for the
behavior I see, but again when on the other cluster i recreated the ms-drbd
the issue was solved.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] promote a ms resource to a node

2010-10-03 Thread Pavlos Parissis
just for the record here is the constraint
location master-location ms-drbd_02 \
rule $id="master-rule" $role="Master" 1000: #uname eq node-03
Cheers,
Pavlos


On 30 September 2010 10:24, Andrew Beekhof  wrote:

> A resource location constraint with role=Master would do it.
> Not sure about the shell syntax though.
>
> On Tue, Sep 28, 2010 at 3:51 PM, Pavlos Parissis
>  wrote:
> > Hi,
> >
> > Let's say that I have manually demote a ms resource and have the
> following
> > situation
> > crm(live)resource# demote ms-drbd_01
> > crm(live)resource# status
> > [..snip..]
> > Master/Slave Set: ms-drbd_01
> >  Slaves: [ node-01 node-03 ]
> >
> > How can I manually promote ms-drbd_01 on node-03?
> > The promote command doesn't accept node names and the move command on
> > ms-drbd_01 says can't find resource.
> >
> > Cheers,
> > Pavlos
> >
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs:
> >
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
> >
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Recommend Fencing device

2010-10-04 Thread Pavlos Parissis
Hi

Which fencing devices will you recommend? I want to use a device which will
give as less problems as possible on configuring a fencing resource for 3
node cluster.

Regards,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resources are restarted without obvious reasons

2010-10-05 Thread Pavlos Parissis
On 5 October 2010 11:15, Andrew Beekhof  wrote:

> On Fri, Oct 1, 2010 at 9:53 AM, Pavlos Parissis
>  wrote:
> > Hi,
> > It seams that it happens every time PE wants to check the conf
> > 09:23:55 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer
> > (I_PE_CALC) just popped!
> >
> > and then check_rsc_parameters() wants to reset my resources
> >
> > 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart
> of
> > pbx_02 on node-02, provider changed: heartbeat -> 
> > 09:23:55 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02
> > 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart
> of
> > pbx_01 on node-01, provider changed: heartbeat -> 
>
> Could be a bug in the code that detects changes to the resource definition.
> Could you file a bug please?
>
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
here it is http://developerbugs.linux-foundation.org/show_bug.cgi?id=2504
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] init Script fails in 1 of LSB Compatible test

2010-10-05 Thread Pavlos Parissis
Hi,

I am thinking to put under cluster control the sshd and I am checking if the
/etc/init.d/sshd supplied by RedHat 5.4 is compatible with LSB.
So, I run the test mentioned here [1] and it fails at test 6, it returns 1
and failed message.
Could this create problems within pacemaker?

Regards,
Pavlos




[1]
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] init Script fails in 1 of LSB Compatible test

2010-10-05 Thread Pavlos Parissis
On 5 October 2010 13:19, Andrew Beekhof  wrote:

> On Tue, Oct 5, 2010 at 12:51 PM, Pavlos Parissis
>  wrote:
> > Hi,
> >
> > I am thinking to put under cluster control the sshd and I am checking if
> the
> > /etc/init.d/sshd supplied by RedHat 5.4 is compatible with LSB.
> > So, I run the test mentioned here [1] and it fails at test 6, it returns
> 1
> > and failed message.
> > Could this create problems within pacemaker?
>
> yes
>
>
what kind of prolems and why?

Regards,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Online and Offline status when doing crm_mon

2010-10-05 Thread Pavlos Parissis
On 5 October 2010 22:12, Mike A Meyer  wrote:

> We are setup in a two node active/passive cluster using pacemaker/corosync.
>  We shutdown the pacemaker/corosync on both nodes and changed the uname -n
> on our nodes to show the short name instead of the FQDN.  Started up
> pacemaker/corosync and ever since we done that, when we run the crm_mon
> command, we see this below.
>
> 
> Last updated: Tue Oct  5 13:28:16 2010
> Stack: openais
> Current DC: e-magdb2 - partition with quorum
> Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
> 4 Nodes configured, 2 expected votes
> 2 Resources configured.
> 
>
> Online: [ e-magdb2 e-magdb1 ]
> OFFLINE: [ e-magdb1.testingpcmk.com e-magdb2.testingpcmkr.com ]
>
> We did edit the crm configuration file to use short names for both nodes.
> We can ping both the short name and the FQDN on our internal network and
> both come back with the right IP address.  We are running on RHEL 5.
>  Anybody have any ideas why the FQDN shows offline since this change since
> we configured pacemaker/corosync to use short names?  Is it grabbing it from
> internal DNS from the IP address we have in the /etc/corosync.conf file?
>  Everything seems to be working correctly and failing over correctly.
>  Should this be something to worry about though or is it a display bug
> maybe?  Below is the corosync.conf file.
>

Did you follow this[1] procedure?

Changing the names on the conf file of corosync it not enough.


[1]
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-node-delete.html
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource move doesn't move the resource

2010-10-07 Thread Pavlos Parissis
On 7 October 2010 09:01, Andrew Beekhof  wrote:

> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis
>  wrote:
> > Hi,
> >
> > I am having again the same issue, in a different set of 3 nodes. When I
> try
> > to failover manually the resource group on the standby node, the ms-drbd
> > resource is not moved as well and as a result the resource group is not
> > fully started, only the ip resource is started.
> > Any ideas why I am having this issue?
>
> I think its a bug that was fixed recently.  Could you try the latest
> from code Mercurial?


1.1 or 1.2 branch?
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] pacemaker version

2010-10-07 Thread Pavlos Parissis
On 7 October 2010 08:33, Andrew Beekhof  wrote:
>
> On Wed, Oct 6, 2010 at 5:04 PM, Gianluca Cecchi
>  wrote:
> > On Wed, Oct 6, 2010 at 4:25 PM, Shravan Mishra  
> > wrote:
> >> That is what I heard too, that's the reason for this question.
> >>
> >
> > On June, inside a complex thread regarding "colocation -inf", Andrew
> > reported the link and also several clarifications after some questions
> > of mine...
> >
> > See in particular:
> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006606.html
> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006610.html
> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006620.html
> >
> > I think they are still valid...
>
> Absolutely.  These will all be valid until 1.2 comes out (and then
> they'll apply to 1.3 instead :-)

I am a bit confused about the meaning of pacemaker-1.0,1.1 and 1.2
schema, mentioned here
http://theclusterguy.clusterlabs.org/post/441442543/new-pacemaker-release-series

Let's I have installed 1.1 how do I know if I use pacemaker-1.o or 1.2 schema?
Sorry if it sounds stupid but I simple don't understand it

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource move doesn't move the resource

2010-10-07 Thread Pavlos Parissis
On 8 October 2010 04:26, jiaju liu  wrote:

> Message: 2
> Date: Thu, 7 Oct 2010 21:58:29 +0200
> From: Pavlos Parissis 
> http://cn.mc157.mail.yahoo.com/mc/compose?to=pavlos.paris...@gmail.com>
> >
> To: The Pacemaker cluster resource manager
> 
> http://cn.mc157.mail.yahoo.com/mc/compose?to=pacema...@oss.clusterlabs.org>
> >
> Subject: Re: [Pacemaker] crm resource move doesn't move the resource
> Message-ID:
> 
> http://cn.mc157.mail.yahoo.com/mc/compose?to=bukp0wt2wie...@mail.gmail.com>
> >
> Content-Type: text/plain; charset="utf-8"
>
> On 7 October 2010 09:01, Andrew Beekhof 
> http://cn.mc157.mail.yahoo.com/mc/compose?to=and...@beekhof.net>>
> wrote:
>
> > On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis
> > http://cn.mc157.mail.yahoo.com/mc/compose?to=pavlos.paris...@gmail.com>>
> wrote:
> > > Hi,
> > >
> > > I am having again the same issue, in a different set of 3 nodes. When I
> > try
> > > to failover manually the resource group on the standby node, the
> ms-drbd
> > > resource is not moved as well and as a result the resource group is not
> > > fully started, only the ip resource is started.
> > > Any ideas why I am having this issue?
> >
> > I think its a bug that was fixed recently.  Could you try the latest
> > from code Mercurial?
> Maybe you should clear failcount
>
>
> the failcount was 0.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] pacemaker version

2010-10-07 Thread Pavlos Parissis
On 8 October 2010 07:47, Andrew Beekhof  wrote:
> On Thu, Oct 7, 2010 at 10:10 PM, Pavlos Parissis
>  wrote:
>> On 7 October 2010 08:33, Andrew Beekhof  wrote:
>>>
>>> On Wed, Oct 6, 2010 at 5:04 PM, Gianluca Cecchi
>>>  wrote:
>>> > On Wed, Oct 6, 2010 at 4:25 PM, Shravan Mishra  
>>> > wrote:
>>> >> That is what I heard too, that's the reason for this question.
>>> >>
>>> >
>>> > On June, inside a complex thread regarding "colocation -inf", Andrew
>>> > reported the link and also several clarifications after some questions
>>> > of mine...
>>> >
>>> > See in particular:
>>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006606.html
>>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006610.html
>>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006620.html
>>> >
>>> > I think they are still valid...
>>>
>>> Absolutely.  These will all be valid until 1.2 comes out (and then
>>> they'll apply to 1.3 instead :-)
>>
>> I am a bit confused about the meaning of pacemaker-1.0,1.1 and 1.2
>> schema, mentioned here
>> http://theclusterguy.clusterlabs.org/post/441442543/new-pacemaker-release-series
>>
>> Let's I have installed 1.1 how do I know if I use pacemaker-1.o or 1.2 
>> schema?
>> Sorry if it sounds stupid but I simple don't understand it
>
> cibadmin -Ql | grep validate
>


[r...@node-01 ~]# cibadmin -Ql | grep validate


since validate-with is set to pacemaker-1.0, I am using pacemaker-1.0
schema, right?
So, if I upgrade to 1.1.3 and leave validate-with to pacemaker-1.0, I
will run a stable 1.1.3, but if I set to pacemaker-1.1 I will be
running a "testing|unstable" 1.1.3. Have I understood it correctly?

Thanks,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource move doesn't move the resource

2010-10-07 Thread Pavlos Parissis
On 8 October 2010 08:29, Andrew Beekhof  wrote:
> On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis
>  wrote:
>>
>>
>> On 7 October 2010 09:01, Andrew Beekhof  wrote:
>>>
>>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis
>>>  wrote:
>>> > Hi,
>>> >
>>> > I am having again the same issue, in a different set of 3 nodes. When I
>>> > try
>>> > to failover manually the resource group on the standby node, the ms-drbd
>>> > resource is not moved as well and as a result the resource group is not
>>> > fully started, only the ip resource is started.
>>> > Any ideas why I am having this issue?
>>>
>>> I think its a bug that was fixed recently.  Could you try the latest
>>> from code Mercurial?
>>
>> 1.1 or 1.2 branch?
>
> 1.1
>
to save time on compiling stuff I want to use the available rpms on
1.1.3 version from rpm-next repo.
But before I go and recreate the scenario, which means rebuild 3
nodes, I would like to know if this bug is fixed in 1.1.3

Thanks,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] pacemaker version

2010-10-08 Thread Pavlos Parissis
On 8 October 2010 09:28, Andrew Beekhof  wrote:
> On Fri, Oct 8, 2010 at 8:31 AM, Pavlos Parissis
>  wrote:
>> On 8 October 2010 07:47, Andrew Beekhof  wrote:
>>> On Thu, Oct 7, 2010 at 10:10 PM, Pavlos Parissis
>>>  wrote:
>>>> On 7 October 2010 08:33, Andrew Beekhof  wrote:
>>>>>
>>>>> On Wed, Oct 6, 2010 at 5:04 PM, Gianluca Cecchi
>>>>>  wrote:
>>>>> > On Wed, Oct 6, 2010 at 4:25 PM, Shravan Mishra 
>>>>> >  wrote:
>>>>> >> That is what I heard too, that's the reason for this question.
>>>>> >>
>>>>> >
>>>>> > On June, inside a complex thread regarding "colocation -inf", Andrew
>>>>> > reported the link and also several clarifications after some questions
>>>>> > of mine...
>>>>> >
>>>>> > See in particular:
>>>>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006606.html
>>>>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006610.html
>>>>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006620.html
>>>>> >
>>>>> > I think they are still valid...
>>>>>
>>>>> Absolutely.  These will all be valid until 1.2 comes out (and then
>>>>> they'll apply to 1.3 instead :-)
>>>>
>>>> I am a bit confused about the meaning of pacemaker-1.0,1.1 and 1.2
>>>> schema, mentioned here
>>>> http://theclusterguy.clusterlabs.org/post/441442543/new-pacemaker-release-series
>>>>
>>>> Let's I have installed 1.1 how do I know if I use pacemaker-1.o or 1.2 
>>>> schema?
>>>> Sorry if it sounds stupid but I simple don't understand it
>>>
>>> cibadmin -Ql | grep validate
>>>
>>
>>
>> [r...@node-01 ~]# cibadmin -Ql | grep validate
>> > have-quorum="1" dc-uuid="b7764e7b-0a00-4745-8d9e-6911271eefb2"
>> admin_epoch="0" epoch="271" num_updates="3">
>>
>> since validate-with is set to pacemaker-1.0, I am using pacemaker-1.0
>> schema, right?
>
> Right.
>
>> So, if I upgrade to 1.1.3 and leave validate-with to pacemaker-1.0, I
>> will run a stable 1.1.3, but if I set to pacemaker-1.1 I will be
>> running a "testing|unstable" 1.1.3.
>
> You'll be enabling some unfinished features.
>
>> Have I understood it correctly?
>
> Essentially, yes.

thanks

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource move doesn't move the resource

2010-10-08 Thread Pavlos Parissis
On 8 October 2010 09:29, Andrew Beekhof  wrote:
> On Fri, Oct 8, 2010 at 8:34 AM, Pavlos Parissis
>  wrote:
>> On 8 October 2010 08:29, Andrew Beekhof  wrote:
>>> On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis
>>>  wrote:
>>>>
>>>>
>>>> On 7 October 2010 09:01, Andrew Beekhof  wrote:
>>>>>
>>>>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis
>>>>>  wrote:
>>>>> > Hi,
>>>>> >
>>>>> > I am having again the same issue, in a different set of 3 nodes. When I
>>>>> > try
>>>>> > to failover manually the resource group on the standby node, the ms-drbd
>>>>> > resource is not moved as well and as a result the resource group is not
>>>>> > fully started, only the ip resource is started.
>>>>> > Any ideas why I am having this issue?
>>>>>
>>>>> I think its a bug that was fixed recently.  Could you try the latest
>>>>> from code Mercurial?
>>>>
>>>> 1.1 or 1.2 branch?
>>>
>>> 1.1
>>>
>> to save time on compiling stuff I want to use the available rpms on
>> 1.1.3 version from rpm-next repo.
>> But before I go and recreate the scenario, which means rebuild 3
>> nodes, I would like to know if this bug is fixed in 1.1.3
>
> As I said, I believe so.
>

I've just upgraded[1] my pacemaker to 1.1.3 and stonithd can not be
started, am I missing something?

Oct 08 21:08:01 node-02 heartbeat: [14192]: info: Starting
"/usr/lib/heartbeat/stonithd" as uid 0  gid 0 (pid 14192)
Oct 08 21:08:01 node-02 heartbeat: [14193]: info: Starting
"/usr/lib/heartbeat/attrd" as uid 101  gid 103 (pid 14193)
Oct 08 21:08:01 node-02 heartbeat: [14194]: info: Starting
"/usr/lib/heartbeat/crmd" as uid 101  gid 103 (pid 14194)
Oct 08 21:08:01 node-02 ccm: [14189]: info: Hostname: node-02
Oct 08 21:08:01 node-02 cib: [14190]: WARN: ccm_connect: CCM Activation failed
Oct 08 21:08:01 node-02 cib: [14190]: WARN: ccm_connect: CCM
Connection failed 1 times (30 max)
Oct 08 21:08:01 node-02 attrd: [14193]: info: Invoked: /usr/lib/heartbeat/attrd
Oct 08 21:08:01 node-02 stonith-ng: [14192]: info: Invoked:
/usr/lib/heartbeat/stonithd
Oct 08 21:08:01 node-02 stonith-ng: [14192]: info:
G_main_add_SignalHandler: Added signal handler for signal 17
Oct 08 21:08:01 node-02 heartbeat: [14158]: WARN: Client [stonith-ng]
pid 14192 failed authorization [no default client auth]
Oct 08 21:08:01 node-02 heartbeat: [14158]: ERROR:
api_process_registration_msg: cannot add client(stonith-ng)
Oct 08 21:08:01 node-02 stonith-ng: [14192]: ERROR:
register_heartbeat_conn: Cannot sign on with heartbeat:
Oct 08 21:08:01 node-02 stonith-ng: [14192]: CRIT: main: Cannot sign
in to the cluster... terminating
Oct 08 21:08:01 node-02 heartbeat: [14158]: WARN: Managed
/usr/lib/heartbeat/stonithd process 14192 exited with return code 100.
Oct 08 21:08:01 node-02 crmd: [14194]: info: Invoked: /usr/lib/heartbeat/crmd
Oct 08 21:08:01 node-02 crmd: [14194]: info: G_main_add_SignalHandler:
Added signal handler for signal 17
Oct 08 21:08:02 node-02 crmd: [14194]: WARN: do_cib_control: Couldn't
complete CIB registration 1 times... pause and retry
Oct 08 21:08:04 node-02 cib: [14190]: WARN: ccm_connect: CCM Activation failed
Oct 08 21:08:04 node-02 cib: [14190]: WARN: ccm_connect: CCM
Connection failed 2 times (30 max)
Oct 08 21:08:05 node-02 crmd: [14194]: WARN: do_cib_control: Couldn't
complete CIB registration 2 times... pause and retry
[..snip...]
Oct 08 21:08:33 node-02 crmd: [14194]: ERROR: te_connect_stonith:
Sign-in failed: triggered a retry


[1] I use CentOS 5.4 and when I did the installation I used the
following repository
[r...@node-02 ~]# cat /etc/yum.repos.d/pacemaker.repo
[clusterlabs]
name=High Availability/Clustering server technologies (epel-5)
baseurl=http://www.clusterlabs.org/rpm/epel-5
type=rpm-md
gpgcheck=0
enabled=1

and in order to perform the upgrade I added the following rep.

[clusterlabs-next]
name=High Availability/Clustering server technologies (epel-5-next)
baseurl=http://www.clusterlabs.org/rpm-next/epel-5
metadata_expire=45m
type=rpm-md
gpgcheck=0
enabled=1

and here is the installation/upgrade log, where you can see only
pacemaker-libs and pacemaker were upgraded.
Oct 03 21:06:20 Installed: libibverbs-1.1.3-2.el5.i386
Oct 03 21:06:25 Installed: lm_sensors-2.10.7-9.el5.i386
Oct 03 21:06:31 Installed: 1:net-snmp-5.3.2.2-9.el5_5.1.i386
Oct 03 21:06:31 Installed: librdmacm-1.0.10-1.el5.i386
Oct 03 21:06:32 Installed: openhpi-libs-2.14.0-5.el5.i386
Oct 03 21:06:33 Installed: OpenIPMI-libs-2.0.16-7.el5.i386
Oct 03 21:06:35 Installed: libesmtp-1.0.4-5.el5.i386
Oct 03 21:06:36 Installed: cluster-glue-libs-1.0.6-1.6.el5.i386
Oct 03 21:06:37

[Pacemaker] unpack_rsc_op: Hard error

2010-10-09 Thread Pavlos Parissis
Hi,

Does anyone know why PE wants to unpack resources on nodes that will
never run due to location constraints?
I am getting this messages and I am wondering if they harmless or not.

23:12:38 pengine: [7705]: notice: unpack_rsc_op: Hard error -
sshd-pbx_01_monitor_0 failed with rc=5: Preventing sshd-pbx_01 from
re-starting on node-02
23:12:38 pengine: [7705]: notice: unpack_rsc_op: Hard error -
pbx_01_monitor_0 failed with rc=5: Preventing pbx_01 from re-starting
on node-02

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource move doesn't move the resource

2010-10-09 Thread Pavlos Parissis
On 8 October 2010 22:05, Pavlos Parissis  wrote:
> On 8 October 2010 09:29, Andrew Beekhof  wrote:
>> On Fri, Oct 8, 2010 at 8:34 AM, Pavlos Parissis
>>  wrote:
>>> On 8 October 2010 08:29, Andrew Beekhof  wrote:
>>>> On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis
>>>>  wrote:
>>>>>
>>>>>
>>>>> On 7 October 2010 09:01, Andrew Beekhof  wrote:
>>>>>>
>>>>>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis
>>>>>>  wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > I am having again the same issue, in a different set of 3 nodes. When I
>>>>>> > try
>>>>>> > to failover manually the resource group on the standby node, the 
>>>>>> > ms-drbd
>>>>>> > resource is not moved as well and as a result the resource group is not
>>>>>> > fully started, only the ip resource is started.
>>>>>> > Any ideas why I am having this issue?
>>>>>>
>>>>>> I think its a bug that was fixed recently.  Could you try the latest
>>>>>> from code Mercurial?
>>>>>
>>>>> 1.1 or 1.2 branch?
>>>>
>>>> 1.1
>>>>
>>> to save time on compiling stuff I want to use the available rpms on
>>> 1.1.3 version from rpm-next repo.
>>> But before I go and recreate the scenario, which means rebuild 3
>>> nodes, I would like to know if this bug is fixed in 1.1.3
>>
>> As I said, I believe so.
>>
>
> I've just upgraded[1] my pacemaker to 1.1.3 and stonithd can not be
> started, am I missing something?
>
> Oct 08 21:08:01 node-02 heartbeat: [14192]: info: Starting
> "/usr/lib/heartbeat/stonithd" as uid 0  gid 0 (pid 14192)
> Oct 08 21:08:01 node-02 heartbeat: [14193]: info: Starting
> "/usr/lib/heartbeat/attrd" as uid 101  gid 103 (pid 14193)
> Oct 08 21:08:01 node-02 heartbeat: [14194]: info: Starting
> "/usr/lib/heartbeat/crmd" as uid 101  gid 103 (pid 14194)
> Oct 08 21:08:01 node-02 ccm: [14189]: info: Hostname: node-02
> Oct 08 21:08:01 node-02 cib: [14190]: WARN: ccm_connect: CCM Activation failed
> Oct 08 21:08:01 node-02 cib: [14190]: WARN: ccm_connect: CCM
> Connection failed 1 times (30 max)
> Oct 08 21:08:01 node-02 attrd: [14193]: info: Invoked: 
> /usr/lib/heartbeat/attrd
> Oct 08 21:08:01 node-02 stonith-ng: [14192]: info: Invoked:
> /usr/lib/heartbeat/stonithd
> Oct 08 21:08:01 node-02 stonith-ng: [14192]: info:
> G_main_add_SignalHandler: Added signal handler for signal 17
> Oct 08 21:08:01 node-02 heartbeat: [14158]: WARN: Client [stonith-ng]
> pid 14192 failed authorization [no default client auth]
> Oct 08 21:08:01 node-02 heartbeat: [14158]: ERROR:
> api_process_registration_msg: cannot add client(stonith-ng)
> Oct 08 21:08:01 node-02 stonith-ng: [14192]: ERROR:
> register_heartbeat_conn: Cannot sign on with heartbeat:
> Oct 08 21:08:01 node-02 stonith-ng: [14192]: CRIT: main: Cannot sign
> in to the cluster... terminating
> Oct 08 21:08:01 node-02 heartbeat: [14158]: WARN: Managed
> /usr/lib/heartbeat/stonithd process 14192 exited with return code 100.
> Oct 08 21:08:01 node-02 crmd: [14194]: info: Invoked: /usr/lib/heartbeat/crmd
> Oct 08 21:08:01 node-02 crmd: [14194]: info: G_main_add_SignalHandler:
> Added signal handler for signal 17
> Oct 08 21:08:02 node-02 crmd: [14194]: WARN: do_cib_control: Couldn't
> complete CIB registration 1 times... pause and retry
> Oct 08 21:08:04 node-02 cib: [14190]: WARN: ccm_connect: CCM Activation failed
> Oct 08 21:08:04 node-02 cib: [14190]: WARN: ccm_connect: CCM
> Connection failed 2 times (30 max)
> Oct 08 21:08:05 node-02 crmd: [14194]: WARN: do_cib_control: Couldn't
> complete CIB registration 2 times... pause and retry
> [..snip...]
> Oct 08 21:08:33 node-02 crmd: [14194]: ERROR: te_connect_stonith:
> Sign-in failed: triggered a retry
>
Solved by adding apiauth stonith-ng  uid=root on ha.hf
it was mentioned here,
http://www.gossamer-threads.com/lists/linuxha/users/67189#67189
and a patch exists which will make heartbeat to not require apiauth.
http://hg.linux-ha.org/dev/rev/9624b66a6b82

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] crmd thinks lsb returns error on monito

2010-10-09 Thread Pavlos Parissis
Hi,

My resource is not started because I get this

00:44:27 crmd: [3141]: WARN: status_from_rc: Action 16
(pbx_02_monitor_0) on node-02 failed (target: 7 vs. rc: 5): Error

but when I run manually the status I get 3, which ok because the
application is stopped

[r...@node-02 ~]# /etc/init.d/znd-pbx_02 status
pbx_02 is stopped
[r...@node-02 ~]# echo $?
3

why does crm get error in this case?

Thanks,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] unpack_rsc_op: Hard error

2010-10-10 Thread Pavlos Parissis
On 9 October 2010 23:20, Pavlos Parissis  wrote:
> Hi,
>
> Does anyone know why PE wants to unpack resources on nodes that will
> never run due to location constraints?
> I am getting this messages and I am wondering if they harmless or not.
>
> 23:12:38 pengine: [7705]: notice: unpack_rsc_op: Hard error -
> sshd-pbx_01_monitor_0 failed with rc=5: Preventing sshd-pbx_01 from
> re-starting on node-02
> 23:12:38 pengine: [7705]: notice: unpack_rsc_op: Hard error -
> pbx_01_monitor_0 failed with rc=5: Preventing pbx_01 from re-starting
> on node-02
>
> Cheers,
> Pavlos
>

It seams that return code of 5 from a LSB script confuses the cluster.
I have made my init script to be LSB compliant, it passes the tests
here[1], but I have also implemented what it is mentioned here [2]
regarding the exit codes.
I have implemented the exit code 5 which causes troubles because when
the cluster run the monitor on the slave node, where no resources are
active, gets rc=5.
If I remove the exit 5 everything is fine. Is this a expected behavior?


[1]http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html

[2]http://refspecs.freestandards.org/LSB_3.1.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html

the init script
[r...@node-03 ~]# cat /etc/init.d/znd-pbx_01
#!/bin/bash
#
### BEGIN INIT INFO
# Provides: pbx_01
# Required-Start: $local_fs $network
# Required-Stop: $local_fs $network
# Default-Start:   3 4 5
# Default-Stop: 0 1 2 6
# Short-Description: start and stop pbx_01
# Description: Init script fro pbxnsip.
### END INIT INFO

# source function library
. /etc/init.d/functions

RETVAL=0

# Installation location
INSTALLDIR=/pbx_service_01/pbxnsip
PBX_CONFIG=$INSTALLDIR/pbx.xml
PBX=pbx_01
PID_FILE=/var/run/$PBX.pid
LOCK_FILE=/var/lock/subsys/$PBX
PBX_OPTIONS="--dir $INSTALLDIR --config $PBX_CONFIG --pidfile $PID_FILE"

#sleep 10;

#[ -x $INSTALLDIR/$PBX ] || exit 5

start()
{
echo -n "Starting PBX: "
daemon --pidfile $PID_FILE $INSTALLDIR/$PBX $PBX_OPTIONS
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && touch $LOCK_FILE
return $RETVAL

}
stop()
{
echo -n "Stopping PBX: "
killproc -p $PID_FILE $PBX
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && rm -f $LOCK_FILE
return $RETVAL
}

case "$1" in
start)
start
;;
stop)
stop
;;
restart)
stop
start
;;
force-reload)
stop
start
;;
status)
status -p $PID_FILE $PBX
RETVAL=$?
;;
*)
echo $"Usage: $0 {start|stop|restart|force-reload|status}"
exit 2
esac
exit $RETVAL
[r...@node-03 ~]#

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource move doesn't move the resource

2010-10-11 Thread Pavlos Parissis
On 8 October 2010 09:29, Andrew Beekhof  wrote:
> On Fri, Oct 8, 2010 at 8:34 AM, Pavlos Parissis
>  wrote:
>> On 8 October 2010 08:29, Andrew Beekhof  wrote:
>>> On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis
>>>  wrote:
>>>>
>>>>
>>>> On 7 October 2010 09:01, Andrew Beekhof  wrote:
>>>>>
>>>>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis
>>>>>  wrote:
>>>>> > Hi,
>>>>> >
>>>>> > I am having again the same issue, in a different set of 3 nodes. When I
>>>>> > try
>>>>> > to failover manually the resource group on the standby node, the ms-drbd
>>>>> > resource is not moved as well and as a result the resource group is not
>>>>> > fully started, only the ip resource is started.
>>>>> > Any ideas why I am having this issue?
>>>>>
>>>>> I think its a bug that was fixed recently.  Could you try the latest
>>>>> from code Mercurial?
>>>>
>>>> 1.1 or 1.2 branch?
>>>
>>> 1.1
>>>
>> to save time on compiling stuff I want to use the available rpms on
>> 1.1.3 version from rpm-next repo.
>> But before I go and recreate the scenario, which means rebuild 3
>> nodes, I would like to know if this bug is fixed in 1.1.3
>
> As I said, I believe so.

I recreated the 3 node cluster and I didn't face that issue, but I am
going to keep an eye on it for few days and even rerun the whole
scenario (recreate 3 node cluster ...) just to be very sure. If I
don't the see it again I will also close the bug report

Thanks,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crmd thinks lsb returns error on monito

2010-10-11 Thread Pavlos Parissis
On 10 October 2010 17:40, Andrew Beekhof  wrote:
> On Sun, Oct 10, 2010 at 12:47 AM, Pavlos Parissis
>  wrote:
>> Hi,
>>
>> My resource is not started because I get this
>>
>> 00:44:27 crmd: [3141]: WARN: status_from_rc: Action 16
>> (pbx_02_monitor_0) on node-02 failed (target: 7 vs. rc: 5): Error
>>
>> but when I run manually the status I get 3, which ok because the
>> application is stopped
>>
>> [r...@node-02 ~]# /etc/init.d/znd-pbx_02 status
>> pbx_02 is stopped
>> [r...@node-02 ~]# echo $?
>> 3
>>
>> why does crm get error in this case?
>
> I imagine because when pacemaker ran it, the script didn't return 3.
>
pacemaker got 5 because the script returns 5 when the application is
not available on the system, which happens only when the fs is not
active. What actually happened in this particular case is the the
start action on fs and on the resource, which holds the application,
started on the same second. I am pretty sure that the start of the
application resource went too fast and at the time the LSB script was
executed the fs was not available, even the fs resources returned 0 on
start and on the first monitor.
This issue doesn't happen always but if I put a sleep on LSB script
for the application resource I don't run into that issue.
The resource are in group with order ip fs app.
I also removed the exit code 5 from the LSB script, it confuses the
cluster when the monitor action does place on the slave node.

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] unpack_rsc_op: Hard error

2010-10-11 Thread Pavlos Parissis
On 10 October 2010 17:39, Andrew Beekhof  wrote:
> On Sat, Oct 9, 2010 at 11:20 PM, Pavlos Parissis
>  wrote:
>> Hi,
>>
>> Does anyone know why PE wants to unpack resources on nodes that will
>> never run due to location constraints?
>
> Because part of its job is to make sure they dont run there.
>
>> I am getting this messages and I am wondering if they harmless or not.
>
> Basically yes.  We've since reduced this to an informational message.
>
So, it is not necessary to place the LSB script of a resource to nodes
where the resource will never run, due to location constraints.Am I
right?

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resource is stuck

2010-10-11 Thread Pavlos Parissis
On 11 October 2010 11:12, Pavlos Parissis  wrote:
> Hi,
>
> Cluster got an error on monitor and stop action on a resource and
> since then I can't do stop/start/manage/unmanage that resource.
> For some strange reason the actions monitor/stop failed, manually
> worked, but i can't figure out why they failed when cluster run status
> and stop on the specific lsb resource.
>
> The issue now is that I can't do anything about that resource, even I
> have  cleared out the failcount counter.
>
> How can i escape from the situation?
>
> hb_report attached
>
> Regards,
> Pavlos
>

After reading again and again the "configuration explained" document
and especial page 18, I found a solution. Adding on-fail="stop" for
monitor/stop/start operation on the resource get me out that
situation. After I added this setting cluster initiated stop action
which was successful.!

The resource was stuck, actually blocked, because that is the default
action when stop action fails and stonith is disabled.

Blame on me not remembering page 18:-)

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] 1st monitor is too fast after the start

2010-10-12 Thread Pavlos Parissis
Hi,

I noticed a race condition while I was integration an application with
Pacemaker and thought to share with you.

The init script of the application is LSB-compliant and passes the
tests mentioned at the Pacemaker documentation. Moreover, the init
script
uses the supplied functions from the system[1] for starting,stopping
and checking the application.

I observed few times that the monitor action was failing after the
startup of the cluster or the movement of the resource group.
Because it was not happening always and manual start/status was always
working, it was quite tricky and difficult to find out the root cause
of the failure.
After few hours of troubleshooting, I found out that the 1st monitor
action after the start action, was executed too fast for the
application to create the pid file. As result monitor action was
receiving error.

I know it sounds a bit strange but it happened on my systems. The fact
that my systems are basically vmware images on a laptop could have a
relation with the issue.

Nevertheless, I would like to ask if you are thinking to implement an
"init_wait" on 1st monitor action. Could be useful.

To solve my issue I put a sleep after the start of the application in
the init script. This gives enough time for the application to create
its pid file and the 1st monitor doesn't fail.


Cheers,
Pavlos


[1] Cent0S 5.4

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] sshd under cluster

2010-10-12 Thread Pavlos Parissis
Hi,

I was asked to place sshd daemon under cluster and because I faced few
challenges, I thought to share them with you.

The 1st challenge was to clone the sshd daemon, init script and its
configuration. The procedure is at the bottom of this mail.

The 2nd challenge was the init script of sshd in CentOS. It has 2
issues, 1st issue was that it was failing at test 6 mentioned here
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html.

The 2nd issue was that during shutdown or reboot of the cluster node,
stop action on resource was receiving return code 143 from init script
and the whole shutdown/reboot process was stuck for few minutes. The
root cause of that was the killall command which is being called by
the init script. The init script calls killall, only on shutdown or
reboot, to close any open connections. But, that call was killing also
the script itself! Because of that cluster was getting error on stop
action and the lock file of the sshd was not removed as well. You can
image the consequences.

For both issues I filled a bug report and hacked the init script in
order to have a short term resolution.

The last challenge was related to a mail sent few hours ago. The 1st
monitor action after the start action was too fast and sshd didn't
have enough time to create its pid file. As a result the monitor was
thinking that the sshd was down but it wasn't.
A sleep 1 after the start function in the init script solved the issue.

Cheers,
Pavlos

Clone SSH for pbx_0N
Prerequisite: the default sshd to listen only on nodes IP and not on all IPs.

cp -p /etc/init.d/sshd /etc/init.d/sshd-pbx_02

cp -p /etc/pam.d/sshd /etc/pam.d/sshd-pbx_02 # optional because it is
needed only if UsePam true - On RH is true by default

ln -s /usr/sbin/sshd /usr/sbin/sshd-pbx_02

touch /etc/sysconfig/sshd-pbx_02
echo 'OPTIONS="-f /etc/ssh/sshd_config-pbx_02"' > /etc/sysconfig/sshd-pbx_02

cp -p /etc/ssh/sshd_config /etc/ssh/sshd_config-pbx_02

[r...@node-02 ~]# diff -wu /etc/init.d/sshd /etc/init.d/sshd-pbx_02
--- /etc/init.d/sshd2009-09-03 20:12:38.0 +0200
+++ /etc/init.d/sshd-pbx_02 2010-10-12 12:25:50.0 +0200
@@ -1,33 +1,33 @@
-#!/bin/bash
+#!/bin/bash -x
 #
-# Init file for OpenSSH server daemon
+# Init file for OpenSSH server daemon used by pbx_02
 #
 # chkconfig: 2345 55 25
-# description: OpenSSH server daemon
+# description: OpenSSH server daemon for pbx_02
 #
-# processname: sshd
-# config: /etc/ssh/ssh_host_key
-# config: /etc/ssh/ssh_host_key.pub
+# processname: sshd-pbx_02
+# config: /etc/ssh/ssh_host_key-pbx_02
+# config: /etc/ssh/ssh_host_key-pbx_02.pub
 # config: /etc/ssh/ssh_random_seed
-# config: /etc/ssh/sshd_config
-# pidfile: /var/run/sshd.pid
+# config: /etc/ssh/sshd_config-pbx_02
+# pidfile: /var/run/sshd-pbx_02.pid

 # source function library
 . /etc/rc.d/init.d/functions

 # pull in sysconfig settings
-[ -f /etc/sysconfig/sshd ] && . /etc/sysconfig/sshd
+[ -f /etc/sysconfig/sshd-pbx_02 ] && . /etc/sysconfig/sshd-pbx_02

 RETVAL=0
-prog="sshd"
+prog="sshd-pbx_02"

 # Some functions to make the below more readable
 KEYGEN=/usr/bin/ssh-keygen
-SSHD=/usr/sbin/sshd
-RSA1_KEY=/etc/ssh/ssh_host_key
-RSA_KEY=/etc/ssh/ssh_host_rsa_key
-DSA_KEY=/etc/ssh/ssh_host_dsa_key
-PID_FILE=/var/run/sshd.pid
+SSHD=/usr/sbin/sshd-pbx_02
+RSA1_KEY=/etc/ssh/ssh_host_key-pbx_02
+RSA_KEY=/etc/ssh/ssh_host_rsa_key-pbx_02
+DSA_KEY=/etc/ssh/ssh_host_dsa_key-pbx_02
+PID_FILE=/var/run/sshd-pbx_02.pid

 runlevel=$(set -- $(runlevel); eval "echo \$$#" )

@@ -110,7 +110,11 @@
echo -n $"Starting $prog: "
$SSHD $OPTIONS && success || failure
RETVAL=$?
-   [ "$RETVAL" = 0 ] && touch /var/lock/subsys/sshd
+   [ "$RETVAL" = 0 ] && touch /var/lock/subsys/sshd-pbx_02
+# to avoid a race condition, 1st cluster monitor after start fails
+# because the pid file is not created yet. Few msecs detail on the
+# creation of pid file is enough to cause issues.
+sleep 1
echo
 }

@@ -119,16 +123,25 @@
echo -n $"Stopping $prog: "
if [ -n "`pidfileofproc $SSHD`" ] ; then
killproc $SSHD
+   elif [ -z "`pidfileofproc $SSHD`"] && [ ! -f
/var/lock/subsys/sshd-pbx_02 ] ; then
+success
+RETVAL=0
else
failure $"Stopping $prog"
fi
RETVAL=$?
+
+### Added by Pavlos Parissis ###
+# Disable the below bit because killall kills the script itself.
+# This causes problems within the cluster, shutdown of a node fails.
+# Any open connections will be killed by /etc/init.d.halt anyways
+
# if we are in halt or reboot runlevel kill all running sessions
# so the TCP connections are closed cleanly
-   if [ "x$runlevel&qu

Re: [Pacemaker] Migrate resources based on connectivity

2010-10-12 Thread Pavlos Parissis
On 12 October 2010 20:00, Dan Frincu  wrote:
> Hi,
>
> Lars Ellenberg wrote:
>
> On Mon, Oct 11, 2010 at 03:50:01PM +0300, Dan Frincu wrote:
>
>
> Hi,
>
> Dejan Muhamedagic wrote:
>
>
> Hi,
>
> On Sun, Oct 10, 2010 at 10:27:13PM +0300, Dan Frincu wrote:
>
>
> Hi,
>
> I have the following setup:
> - order drbd0:promote drbd1:promote
> - order drbd1:promote drbd2:promote
> - order drbd2:promote all:start
> - collocation all drbd2:Master
> - all is a group of resources, drbd{0..3} are drbd ms resources.
>
> I want to migrate the resources based on ping connectivity to a
> default gateway. Based on
> http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks
> and http://www.clusterlabs.org/wiki/Example_configurations I've
> tried the following:
> - primitive ping ocf:pacemaker:ping params host_list=1.2.3.4
> multiplier=100 op monitor interval=5s timeout=5s
> - clone ping_clone ping meta globally-unique=false
> - location ping_nok all \
>   rule $id="ping_nok-rule" -inf: not_defined ping_clone or
> ping_clone number:lte 0
>
>
> Use pingd to reference the attribute in the location constraint.
>
>
> Not to be disrespectful, but after 3 days being stuck on this issue,
> I don't exactly understand how to do that. Could you please provide
> an example.
>
> Thank you in advance.
>
>
> The example you reference lists:
>
>   primitive pingdnet1 ocf:pacemaker:pingd \
>   params host_list=192.168.23.1 \
>   name=pingdnet1
>   ^^
>
>   clone cl-pingdnet1 pingdnet1
>  ^
>
> param name default is pingd,
> and is the attribute name to be used in the location constraints.
>
> You will need to reference pingd in you location constraint, or set an
> explicit name in the primitive definition, and reference that.
>
> Your ping primitive sets the default 'pingd' attribute,
> but you reference some 'ping_clone' attribute,
> which apparently no-one really references.
>
>
>
> I've finally managed to finish the setup with the indications received
> above, the behavior is the expected one. Also, I've tried the
> ocf:pacemaker:pingd and even though it does the reachability tests properly,
> it fails to update the cib upon restoring the connectivity, I had to
> manually run attrd_updater -R to get the resources to start again, therefore
> I'm going with ocf:pacemaker:ping.
>
it would be quite useful for the rest of people if you post your final
and working configuration.
Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Pavlos Parissis
On 13 October 2010 09:48, Dan Frincu  wrote:
> Hi,
>
> I've noticed the same type of behavior, however in a different context, my
> setup includes 3 drbd devices and a group of resources, all have to run on
> the same node and move together to other nodes. My issue was with the first
> resource that required access to a drbd device, which was the
> ocf:heartbeat:Filesystem RA trying to do a mount and failing.
>
> The reason, it was trying to do the mount of the drbd device before the drbd
> device had finished migrating to primary state. Same as you, I introduced a
> start-delay, but on the start action. This proved to be of no use as the
> behavior persisted, even with an increased start-delay. However, it only
> happened when performing a fail-back operation, during fail-over, everything
> was ok, during fail-back, error.
>
> The fix I've made was to remove any start-delay and to add group collocation
> constraints to all ms_drbd resources. Before that I only had one collocation
> constraint for the drbd device being promoted last.
>
> I hope this helps.
>

I am glad that somebody else experienced the same issue:)

On my mail I was talking about the monitor action which was failing,
but the behavior you described happened on my system on the same
setup, drbd and fs resource.It also happened on the application
resource, the start was too fast and the FS was not mounted (yet) when
the action start fired for the application resource. A delay on start
function of the resource agent of the application fixed my issue.

In my setup I have all the necessary constraints to avoid this, at
least this is what I believe so:-)

Cheers,
Pavlos


[r...@node-01 sysconfig]# crm configure show
node $id="059313ce-c6aa-4bd5-a4fb-4b781de6d98f" node-03
node $id="d791b1f5-9522-4c84-a66f-cd3d4e476b38" node-02
node $id="e388e797-21f4-4bbe-a588-93d12964b4d7" node-01
primitive drbd_01 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_1" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive drbd_02 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_2" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive fs_01 ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive fs_02 ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive ip_01 ocf:heartbeat:IPaddr2 \
params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \
meta failure-timeout="120" migration-threshold="3" \
op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
meta failure-timeout="120" migration-threshold="3" \
params ip="192.168.78.20" cidr_netmask="24" broadcast="192.168.78.255" \
op monitor interval="5s"
primitive pbx_01 lsb:znd-pbx_01 \
meta migration-threshold="3" failure-timeout="60"
target-role="Started" \
op monitor interval="20s" timeout="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive pbx_02 lsb:znd-pbx_02 \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive sshd_01 lsb:znd-sshd-pbx_01 \
meta target-role="Started" is-managed="true" \
op monitor on-fail="stop" interval="10m" \
op start interval="0" timeout="60s" on-fail="stop" \
op stop interval="0" timeout="60s" on-fail="stop"
primitive sshd_02 lsb:znd-sshd-pbx_02 \
meta target-role="Started" \
op monitor on-fail="stop" interval="10m" \
op start interval="0" timeout="60s" on-fail="stop" \
op stop interval="0" timeout="60s" on-fail="stop"
group pbx_service_01 ip_01 fs_01 pbx_01 sshd_01 \
meta target-role="Started"
group pbx_service_02 ip_02 fs_02 pbx_02 sshd_02
ms ms-drbd_01 drbd_01 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
ms ms-drbd_02 drbd_02 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
location PrimaryNode-pbx_service_01 pbx_service_01 200: node

Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Pavlos Parissis
On 13 October 2010 10:50, Dan Frincu  wrote:
> From what I see you have a dual primary setup with failover on the third
> node, basically if you have one drbd resource for which you have both
> ordering and collocation, I don't think you need to "improve" it, if it
> ain't broke, don't fix it :)
>
> Regards,
>

No, I don't have Dual primary. My DRBD is in Single-Primary mode for
both DRBD resources.
I use N+1  setup. I have 2 resource group which have unique Primary
and shared secondary.
pbx_service_01 resource group has primary node-01 and secondary node-03
pbx_service_02 resource group has primary node-02 and secondary node-03

I use asymmetric cluster with specific location constraints in order
to implement the above.
The DRBD resource will never be in primary mode on 2 nodes at the same time.
I have set specific collocation and order constraints in order to
"bond"  DRBD ms resource to the appropriate resource group.

I hope it is clear now.

Cheers and thanks for looking at my conf,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource move doesn't move the resource

2010-10-13 Thread Pavlos Parissis
On 11 October 2010 11:16, Pavlos Parissis  wrote:
> On 8 October 2010 09:29, Andrew Beekhof  wrote:
>> On Fri, Oct 8, 2010 at 8:34 AM, Pavlos Parissis
>>  wrote:
>>> On 8 October 2010 08:29, Andrew Beekhof  wrote:
>>>> On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis
>>>>  wrote:
>>>>>
>>>>>
>>>>> On 7 October 2010 09:01, Andrew Beekhof  wrote:
>>>>>>
>>>>>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis
>>>>>>  wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > I am having again the same issue, in a different set of 3 nodes. When I
>>>>>> > try
>>>>>> > to failover manually the resource group on the standby node, the 
>>>>>> > ms-drbd
>>>>>> > resource is not moved as well and as a result the resource group is not
>>>>>> > fully started, only the ip resource is started.
>>>>>> > Any ideas why I am having this issue?
>>>>>>
>>>>>> I think its a bug that was fixed recently.  Could you try the latest
>>>>>> from code Mercurial?
>>>>>
>>>>> 1.1 or 1.2 branch?
>>>>
>>>> 1.1
>>>>
>>> to save time on compiling stuff I want to use the available rpms on
>>> 1.1.3 version from rpm-next repo.
>>> But before I go and recreate the scenario, which means rebuild 3
>>> nodes, I would like to know if this bug is fixed in 1.1.3
>>
>> As I said, I believe so.
>
> I recreated the 3 node cluster and I didn't face that issue, but I am
> going to keep an eye on it for few days and even rerun the whole
> scenario (recreate 3 node cluster ...) just to be very sure. If I
> don't the see it again I will also close the bug report
>
> Thanks,
> Pavlos
>


I recreated the 3-node cluster using 1.1.3 version just see if it is
solved, but the issue appeared again.
So, Andrew the issue is not solved in 1.1.3. I am going to update the
bug report accordingly.

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Active-Active HA Firewall

2010-10-15 Thread Pavlos Parissis
On 15 October 2010 09:47, Marcel Hauser  wrote:

>
> But that is no problem. firewalling is no hard job any more. A reasonable
>> machine can firewall 1 GBit/s traffic.
>>
>
> valid point. my only "concern" is/was that i don't like the idea of a
> passive firewall because when you need it to failover (maybe after 2
> years :-) ) you may just realize that it's somehow broken too.
>

a monitor system should help you out on this.


>
> In an active-active like setup you basically know that both system are
> actually working as expected.
>
>
>  - how would you guys detect a firewall failure on any node (pingd ??)...
>>> and if a failure occurs... will the crm automatically unconfigure the
>>> cloned ip's on that node ?
>>>
>>
>> pingd to check the availability of the attached network. The cluste
>> resource
>> manager takes care for the failover. See the "from the scratch" doc.
>>
>
> Yes i've read that in the docs. But is this really common practice for
> firewall clusters ? i don't want the firewall to failover if i'm having
> "internal problems with internal hosts/pingable addresses"!?
>
> otherwise i have to build an internal ping cluster ;-)
>

I have always believed that you should only trigger a failover when
something that is needed to offer the service is not available (disk, a
filesystem, a NIC etc)

Having said that, I believe a firewall in order to be operational needs
access to common elements like disk/fs/nic and on top of that to uplink
routers or to any routers that are part of its routing table. Furthermore, a
firewall needs access to any layer2 switch which gives him access to the
attached LANs

But, deciding which element should be part of the "health system" has to do
with the network design and if layer 2 or layer 3 redundancy exists in your
environment. If the layer 2 or layer 3 redundancy is not available, then
make little sense to add them in your "health system", because in a case of
failure this element wont be accessible by the standby firewall as well.


> why did you choose to run conntrackd and heartbeat over a dedicated bonding
> interface in your pdf, compared to the FW builder docs which say to run
> heartbeat over every interface of the firewall, which therefore might enable
> the cluster to detect network card failures... because the heartbeat is not
> received over a given failed interface anymore ?
>
>
>  Rumors say that the is a good German book about clusters from O'Reilly. In
>> the
>> examples chapter the author exactly describes the setup you mentioned. ;-)
>>
>
> :-) i've seen that... but i hate reading books (no matter on what
> topic)... and my learning curve is much more efficient if i learn it myself
> :-)
>

I didn't quick search and I couldn't find it, what is the name of the book?


> but thanks for the hint... any i really appreciate your and any other help!
>
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Help understanding why a failover occurred.

2010-10-16 Thread Pavlos Parissis
On 16 October 2010 00:45, Jai  wrote:

> I have setup a DRBD->Xen failover cluster. Last night at around 02:50 it
> failed the resources from server "bravo" to "alpha". I'm trying to find out
> what caused the failover of resources. I don't see anything in the logs that
> indicate the cause but I don't really know what to look for. If someone
> could help me understand these logs and what I'm looking for would be great.
> I'm not even sure how far back I need to go.
>
>
>
I don't seen anything as well, but I am not surprised by that. I have seen a
similar issue on my  cluster where logs weren't that helpful.

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] using xml for rules

2010-10-17 Thread Pavlos Parissis
Hi,

I am trying to make a rule to control the failback on the resources. I want
during working days from 06:00 to 23:00 and on weekend from 08:00 to 16:00
to have
resource-stickiness 1000 and on the left hours zero, so cluster can perform
failback any resource which failed over during the working hours.
I wrote the following but I also get error [1]. I am not xml guru so I must
have done some stupid mistake here. Any hints?



  
   
 
   
 
 
   
 
   

  
  

  


[1]
cibadmin -V --replace --obj_type rsc_defaults --xml-file tmp.xml
Call cib_replace failed (-47): Update does not conform to the configured
schema/DTD
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Help understanding why a failover occurred.

2010-10-18 Thread Pavlos Parissis
On 18 October 2010 04:03, Tim Serong  wrote:

> On 10/16/2010 at 09:45 AM, Jai  wrote:
> > I have setup a DRBD->Xen failover cluster. Last night at around 02:50 it
> failed
> > the resources from server "bravo" to "alpha". I'm trying to find out what
> > caused the failover of resources. I don't see anything in the logs that
> > indicate the cause but I don't really know what to look for. If someone
> could
> > help me understand these logs and what I'm looking for would be great.
> I'm
> > not even sure how far back I need to go.
>
> I reckon it's this:
>
> Oct 16 02:46:04 bravo attrd: [25098]: info: attrd_perform_update: Sent
> update 161: pingval=0
>
> Which suggests bravo lost connectivity to 12.12.12.1 around that time,
> causing
> the failover.
>
> For reference, if you're looking at pengine logs...  A few lines above
> where
> it says "info: process_pe_message: Transition NNN: PEngine Input stored in:
> /var/lib/pengine/pe-input-MMM.bz2", you'll see what it's about to do to
> your
> resources.  If this is just: "Leave resource FOO (Started/Master/Slave
> etc.)"
> that transition is probably boring.  If it says "Start FOO (...)" or
> "Promote/Demote/Stop FOO (...)", it means something has changed.  Scroll up
> a bit, to above where pengine is saying "unpack_config",
> "determine_node_status"
> etc. and you should see a message suggesting the cause for the change
> (failed
> op, timeout, ping attribute modified, etc.)  It might be a bit inscrutable
> sometimes, but it'll be there somewhere...
>
> HTH
>
>
These are very useful tips on understanding the logs
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Help understanding why a failover occurred.

2010-10-18 Thread Pavlos Parissis
On 18 October 2010 05:17, Jai  wrote:

> > I don't seen anything as well, but I am not surprised by that. I have
> seen a
> > similar issue on my cluster where logs weren't that helpful.
>
> Does it still occur on your cluster?
>
> No, I haven't seen it again. But it could be that I couldn't see it in the
logs.
If I were you I will follow the advice from Tim, you may find what real
happened?

Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Question: How many nodes can join a cluster?

2010-10-18 Thread Pavlos Parissis
On 18 October 2010 10:52, Florian Haas  wrote:

> - Original Message -
> > From: "Andreas Vogelsang" 
> > To: pacemaker@oss.clusterlabs.org
> > Sent: Monday, October 18, 2010 9:46:12 AM
> > Subject: [Pacemaker] Question: How many nodes can join a cluster?
> > Hello,
> >
> >
> >
> > I’m creating a presentation about a virtual Linux-HA Cluster. I just
> > asked me how many nodes pacemaker can handle. Mr. Schwartzkopff wrote
> > in his Book that Linux-HA version 2 can handle up to 16 Nodes. Is this
> > also true for pacemaker?
>

I have been asked the same question and I said to them, let's say it is 126,
what is the use of having 126 nodes in the cluster?
Can someone imagine himself going through the logs to find why the
resource-XXX failed while there are 200 resources?!!

The only use of having 126 nodes is if you want to have HPC, but HPC is
total different story than high available clusters.
Even in N+N setup I would go with more than 4 or 6 nodes.


My 2 cents,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Question: How many nodes can join a cluster?

2010-10-18 Thread Pavlos Parissis
On 18 October 2010 11:13, Dan Frincu  wrote:

>  Pavlos Parissis wrote:
>
>
>
> On 18 October 2010 10:52, Florian Haas  wrote:
>
>> - Original Message -
>> > From: "Andreas Vogelsang" 
>> > To: pacemaker@oss.clusterlabs.org
>> > Sent: Monday, October 18, 2010 9:46:12 AM
>> > Subject: [Pacemaker] Question: How many nodes can join a cluster?
>> > Hello,
>> >
>> >
>> >
>> > I’m creating a presentation about a virtual Linux-HA Cluster. I just
>> > asked me how many nodes pacemaker can handle. Mr. Schwartzkopff wrote
>> > in his Book that Linux-HA version 2 can handle up to 16 Nodes. Is this
>> > also true for pacemaker?
>>
>
> I have been asked the same question and I said to them, let's say it is
> 126, what is the use of having 126 nodes in the cluster?
> Can someone imagine himself going through the logs to find why the
> resource-XXX failed while there are 200 resources?!!
>
> The only use of having 126 nodes is if you want to have HPC, but HPC is
> total different story than high available clusters.
> Even in N+N setup I would go with more than 4 or 6 nodes.
>
>
> My 2 cents,
> Pavlos
>
>
>  Actually, the syslog_facility in corosync.conf allows you to specify
> either a log file for each node in the cluster (locally), or setting up a
> remote syslog server. Either way, identifying the node by hostname or some
> other identifier should point out what is going on where. Granted, it's a
> large amount of data to process, therefore (such is the case with any large
> deployment) SNMP is a much better alternative for tracking issues, or (if
> you have _126_ times the same resource) adding some notification options to
> the RA might be a choice, such as SNMP trap, or even email.
>
> BTW, I'm also interested in this, I remember reading something about 64
> nodes, but I'd appreciate an official response.
>
> Have you ever done troubleshooting on a 4 node cluster at 01:00 night?
believe me it is not fun.

I don't say there are no use cases which require a lot of nodes, but I have
my doubts if there are a lot of use cases for High Available Clusters.
Adding without a second thought nodes and services increase complexity,
which is one of the main root cause of major problems.


Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Move DRBD master

2010-10-19 Thread Pavlos Parissis
On 19 October 2010 01:18, Vadym Chepkov  wrote:

> Hi,
>
> What is the crm shell command to move drbd master to a different node?
>
>
take a look at this
http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg06300.html
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] unpack_rsc_op: Hard error

2010-10-19 Thread Pavlos Parissis
On 19 October 2010 14:16, Andrew Beekhof  wrote:

> On Mon, Oct 11, 2010 at 11:25 AM, Pavlos Parissis
>  wrote:
> > On 10 October 2010 17:39, Andrew Beekhof  wrote:
> >> On Sat, Oct 9, 2010 at 11:20 PM, Pavlos Parissis
> >>  wrote:
> >>> Hi,
> >>>
> >>> Does anyone know why PE wants to unpack resources on nodes that will
> >>> never run due to location constraints?
> >>
> >> Because part of its job is to make sure they dont run there.
> >>
> >>> I am getting this messages and I am wondering if they harmless or not.
> >>
> >> Basically yes.  We've since reduced this to an informational message.
> >>
> > So, it is not necessary to place the LSB script of a resource to nodes
> > where the resource will never run, due to location constraints.Am I
> > right?
>
> Correct, though the probes might show up in crm_mon as "failed".
>
>
>
Even it is correct, I placed the script on all nodes, just to avoid the
warnings.

Thanks,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Failover domains?

2010-10-26 Thread Pavlos Parissis
On 25 October 2010 19:50, David Quenzler  wrote:

> Is there a way to limit failover behavior to a subset of cluster nodes
> or pin a resource to a node?
>
>
Yes, there is a way.

Make sure you have a asymmetric cluster by setting symmetric-cluster to
false
and then configure accordingly your location constraints in order to have
the failover domains as you wish.

Here is en example from my cluster where I have 3 nodes and 2 resource
group. Each resource group have unique primary node but both of them have
shared secondary node.

location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02

location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03
location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03


Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
Hi,

I have a APC AP9606 PDU and I am trying to find a stonith agent which
works with that PDU.
The apcmaster and apcmastersnmp don't work as you see below. I managed
to get the rackpdu working by setting the outlet config (the oid for
snmpwalk fails) and setting also the command OID.
Here is a long command

stonith -t external/rackpdu hostlist=node-01,node-02,node-03
pduip=192.168.100.100 oid=.1.3.6.1.4.1.318.1.1.4.4.2.1.3
community=private outlet_config=/tmp/outlet_config -T on node-01


Does anyone know any other PDU which works out of box with the
supplied stonith agents?

Regards,
Pavlos

[r...@node-01 ~]# stonith -t apcmastersnmp ipaddr=192.168.100.100
port=161 community=private -S

** (process:3887): CRITICAL **: APC_read: error in response packet,
reason 2 [(noSuchName) There is no such variable name in this MIB.].

** (process:3887): CRITICAL **: apcmastersnmp_set_config: cannot read
number of outlets.
Invalid config info for apcmastersnmp device
Valid config names are:
ipaddr
port
community


[r...@node-01 ~]# stonith -t apcmaster ipaddr=192.168.100.100
login=stonith password=stonith -S

** (process:4215): CRITICAL **: Did not find string Escape character
is '^]'. from APC MasterSwitch.

** (process:4215): CRITICAL **: Received
[\xff\xfb\u0001\xff\xfb\u0003\xff\xfd\u0003
\u000dUser Name : ]

** (process:4215): CRITICAL **: Did not find string Escape character
is '^]'. from APC MasterSwitch.

** (process:4215): CRITICAL **: Received []
connect() failed: Connection reset by peer

** (process:4215): CRITICAL **: Did not find string Escape character
is '^]'. from APC MasterSwitch.

** (process:4215): CRITICAL **: Received []
connect() failed: Connection reset by peer
connect() failed: Connection reset by peer

** (process:4215): CRITICAL **: Did not find string Escape character
is '^]'. from APC MasterSwitch.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
On 27 October 2010 13:12, Vadym Chepkov  wrote:
>
> On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote:
>>
>> Does anyone know any other PDU which works out of box with the
>> supplied stonith agents?
>>
>
> I use APC AP7901, works like a charm:
>
> primitive pdu stonith:external/rackpdu \
>        params pduip="10.6.6.6" community="pdu-6" hostlist="AUTO"
> clone fencing pdu
>
> Vadym

Then most likely the defaults OIDs of the rackpdu agents matches the
OIDs of the AP7901.
In my case I have to use OID for the device itself
1.3.6.1.4.1.318.1.1.4.4.2.1.3  and OID for retrieving (snmpwalk) the
outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 .

Hold on a sec, are you using clone on AP7901? Does it support multiple
connections? Mine it doesn't.

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
On 27 October 2010 13:43, Vadym Chepkov  wrote:

>
> On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote:
>
> > On 27 October 2010 13:12, Vadym Chepkov  wrote:
> >>
> >> On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote:
> >>>
> >>> Does anyone know any other PDU which works out of box with the
> >>> supplied stonith agents?
> >>>
> >>
> >> I use APC AP7901, works like a charm:
> >>
> >> primitive pdu stonith:external/rackpdu \
> >>params pduip="10.6.6.6" community="pdu-6" hostlist="AUTO"
> >> clone fencing pdu
> >>
> >> Vadym
> >
> > Then most likely the defaults OIDs of the rackpdu agents matches the
> > OIDs of the AP7901.
> > In my case I have to use OID for the device itself
> > 1.3.6.1.4.1.318.1.1.4.4.2.1.3  and OID for retrieving (snmpwalk) the
> > outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 .
> >
> > Hold on a sec, are you using clone on AP7901? Does it support multiple
> > connections? Mine it doesn't.
>
> Then it's useless regardless clone or not, you have to have multiple
> instances, because server can't reliable fence itself, right?
>
>
>
My understanding is/was that I need to have one resource running on 1 of the
3 nodes in the cluster and if a fence event has to be triggered then
pacemaker will send to it to the one stonith resource. I am planning to test
that the coming days.[1]
Am I right? if not then I have to buy a different PDU! :-(

Cheers,
Pavlos


[1] by testing I mean kill the heartbeat links on 1 node and DC node should
fence that node.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
On 27 October 2010 14:09, Vadym Chepkov  wrote:
>
[...snip...]

>> > Hold on a sec, are you using clone on AP7901? Does it support multiple
>> > connections? Mine it doesn't.
>>
>> Then it's useless regardless clone or not, you have to have multiple
>> instances, because server can't reliable fence itself, right?
>>
>>
>
> My understanding is/was that I need to have one resource running on 1 of
the
> 3 nodes in the cluster and if a fence event has to be triggered then
> pacemaker will send to it to the one stonith resource. I am planning to
test
> that the coming days.[1]
> Am I right? if not then I have to buy a different PDU! :-(
>
> My understanding is you have to have a fencing device for each of your
> hosts. Are you sure one connection limitation applies for SNMP? Most
likely
> it's only for tcp sessions - ssh/http ?

Valid point Vadym, SNMP is over UDP so conntionless communication.
I am wondering how i can test this - if cloning works on this PDU.

> If you look into rackpdu log you will see this:
> Oct 19 12:39:00 xen-11 stonithd: [8606]: debug: external_run_cmd: Calling
> '/usr/lib64/stonith/plugins/external/rackpdu gethosts'
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_run_cmd:
> '/usr/lib64/stonith/plugins/external/rackpdu gethosts' output: xen-11
xen-12
> Outlet_3 Outlet_4 Outlet_5 Outlet_6 Outlet_7 Outlet_8
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: running
> 'rackpdu gethosts' returned 0
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host xen-11
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host xen-12
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_3
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_4
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_5
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_6
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_7
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_8
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: remove us (xen-11) from
the
> host list for pdu:0
> check the last line - the agent is smart enough to know it can't fence
> itself.
> Vadym
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
On 27 October 2010 14:11, Dejan Muhamedagic  wrote:

> Hi,
>
> On Wed, Oct 27, 2010 at 01:58:20PM +0200, Pavlos Parissis wrote:
> > On 27 October 2010 13:43, Vadym Chepkov  wrote:
> >
> > >
> > > On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote:
> > >
> > > > On 27 October 2010 13:12, Vadym Chepkov  wrote:
> > > >>
> > > >> On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote:
> > > >>>
> > > >>> Does anyone know any other PDU which works out of box with the
> > > >>> supplied stonith agents?
> > > >>>
> > > >>
> > > >> I use APC AP7901, works like a charm:
> > > >>
> > > >> primitive pdu stonith:external/rackpdu \
> > > >>params pduip="10.6.6.6" community="pdu-6" hostlist="AUTO"
> > > >> clone fencing pdu
> > > >>
> > > >> Vadym
> > > >
> > > > Then most likely the defaults OIDs of the rackpdu agents matches the
> > > > OIDs of the AP7901.
> > > > In my case I have to use OID for the device itself
> > > > 1.3.6.1.4.1.318.1.1.4.4.2.1.3  and OID for retrieving (snmpwalk) the
> > > > outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 .
> > > >
> > > > Hold on a sec, are you using clone on AP7901? Does it support
> multiple
> > > > connections? Mine it doesn't.
> > >
> > > Then it's useless regardless clone or not, you have to have multiple
> > > instances, because server can't reliable fence itself, right?
> > >
> > >
> > >
> > My understanding is/was that I need to have one resource running on 1 of
> the
> > 3 nodes in the cluster and if a fence event has to be triggered then
> > pacemaker will send to it to the one stonith resource. I am planning to
> test
> > that the coming days.[1]
> > Am I right? if not then I have to buy a different PDU! :-(
>
> Yes. In case a node which is currently running the stonith
> resource is to be fenced, then the stonith resource would move
> elsewhere first. But, yes, you should test this just like
> anything else. Make sure to test both the "node gone" event
> (failed links) and a critical action failing (such as stop).
>
>
>
I am going to test this.

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
Hi,

I quickly tested cloning on this fencing and it worked. I used iptables to
break the heartbeat link on node-01 and it was fenced by the other node -
the DC.
In the coming days I will test without cloning fencing device.

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
On 27 October 2010 14:09, Vadym Chepkov  wrote:

>
> On Oct 27, 2010, at 7:58 AM, Pavlos Parissis wrote:
>
>
> On 27 October 2010 13:43, Vadym Chepkov  wrote:
>
>>
>> On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote:
>>
>> > On 27 October 2010 13:12, Vadym Chepkov  wrote:
>> >>
>> >> On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote:
>> >>>
>> >>> Does anyone know any other PDU which works out of box with the
>> >>> supplied stonith agents?
>> >>>
>> >>
>> >> I use APC AP7901, works like a charm:
>> >>
>> >> primitive pdu stonith:external/rackpdu \
>> >>params pduip="10.6.6.6" community="pdu-6" hostlist="AUTO"
>> >> clone fencing pdu
>> >>
>> >> Vadym
>> >
>> > Then most likely the defaults OIDs of the rackpdu agents matches the
>> > OIDs of the AP7901.
>> > In my case I have to use OID for the device itself
>> > 1.3.6.1.4.1.318.1.1.4.4.2.1.3  and OID for retrieving (snmpwalk) the
>> > outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 .
>> >
>> > Hold on a sec, are you using clone on AP7901? Does it support multiple
>> > connections? Mine it doesn't.
>>
>> Then it's useless regardless clone or not, you have to have multiple
>> instances, because server can't reliable fence itself, right?
>>
>>
>>
> My understanding is/was that I need to have one resource running on 1 of
> the 3 nodes in the cluster and if a fence event has to be triggered then
> pacemaker will send to it to the one stonith resource. I am planning to test
> that the coming days.[1]
> Am I right? if not then I have to buy a different PDU! :-(
>
>
> My understanding is you have to have a fencing device for each of your
> hosts. Are you sure one connection limitation applies for SNMP? Most likely
> it's only for tcp sessions - ssh/http ?
> If you look into rackpdu log you will see this:
>
> Oct 19 12:39:00 xen-11 stonithd: [8606]: debug: external_run_cmd: Calling
> '/usr/lib64/stonith/plugins/external/rackpdu gethosts'
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_run_cmd:
> '/usr/lib64/stonith/plugins/external/rackpdu gethosts' output: xen-11 xen-12
> Outlet_3 Outlet_4 Outlet_5 Outlet_6 Outlet_7 Outlet_8
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: running
> 'rackpdu gethosts' returned 0
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host xen-11
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host xen-12
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_3
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_4
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_5
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_6
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_7
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu
> host Outlet_8
> Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: remove us (xen-11) from the
> host list for pdu:0
>
> check the last line - the agent is smart enough to know it can't fence
> itself.
>
>
>
do you enable debug by setting debug 1 on ha.cf?
do you see that WARN on your system?
stonith-ng: [3369]: WARN: parse_host_line: Could not parse (0 42):
/usr/lib/stonith/plugins/external/rackpdu: line 125: local: can only be used
in a function

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
On 27 October 2010 19:23, Vadym Chepkov  wrote:

>
> On Oct 27, 2010, at 1:18 PM, Pavlos Parissis wrote:
>
>
> ok, i have done the same hack but i will remove it. I think 1.1.4 will be
> out before we go on production and hopefully this will be fixed in 1.1.4.
>
>
>
> This is part of cluster-glue, not pacemaker and it's 1.0.6 now
>
>
> yeap you aright and I am wrong
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
On 27 October 2010 19:25, Vadym Chepkov  wrote:

>
> On Oct 27, 2010, at 1:19 PM, Pavlos Parissis wrote:
>
>
>
> On 27 October 2010 17:08, Vadym Chepkov  wrote:
>
>>
>> On Oct 27, 2010, at 11:02 AM, Pavlos Parissis wrote:
>>
>> > BTW
>> > here is my conf for the fencing
>> > primitive pdu stonith:external/rackpdu \
>> > params community="empisteftiko"
>> names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4"
>> oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.3" hostlist="AUTO" pduip="192.168.100.100"
>> stonith-timeout="30"
>> > clone fencing pdu \
>> > meta target-role="Started"
>> > location fencing-on-node-01 fencing 1: node-01
>> > location fencing-on-node-02 fencing 1: node-02
>> > location fencing-on-node-03 fencing 1: node-03
>> >
>> > am I missing something?
>> >
>>
>> I would say you have "extra" :)
>> why do you need location constraints for this device?
>>
>> Vadym
>>
>>
> I use symmetric-cluster="false" and if I don't set location constraints the
> resource will not start.
>
>
>
>
> oh, I would expect stonith resources to be "exempt"
>
>
it is a typical resource as all resources, although I expected the same.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
I did more testing using the clone type of fencing and worked as I expected.

test1 hack init script to return 1 on stop and run a crm resource move on
that resource
result node it was fenced and resource was started on the other node

test2 using firewall to break the heartbeat links on node with resource
result node it was fenced and resource was started on the other node

As Dejan suggested I am going to run the same type of tests when 1 fence
resource is used.
In this test I will try to cause a fencing on the node which has fencing
resource running on it and see if pacemaker moves the resource before it
fences the node.


Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-27 Thread Pavlos Parissis
On 27 October 2010 19:46, Pavlos Parissis  wrote:

> I did more testing using the clone type of fencing and worked as I
> expected.
>
> test1 hack init script to return 1 on stop and run a crm resource move on
> that resource
> result node it was fenced and resource was started on the other node
>
> test2 using firewall to break the heartbeat links on node with resource
> result node it was fenced and resource was started on the other node
>
> As Dejan suggested I am going to run the same type of tests when 1 fence
> resource is used.
> In this test I will try to cause a fencing on the node which has fencing
> resource running on it and see if pacemaker moves the resource before it
> fences the node.
>
>
>
>
I did the same tests without cloning and pacemaker moves fencing resource
before triggers a reboot on the node where fencing resource was running.
So, cloning fencing resource and having just one fence resource have the
same behaviour! at least for these 2 tests.
now I don't know which configuration solution I should choose!

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Multiple independent two-node clusters side-by-side?

2010-10-27 Thread Pavlos Parissis
this
http://www.gossamer-threads.com/lists/linuxha/users/67482?search_string=Redundant%20Rings%20%26quot;Still%20Not%20There%3F;#67482
post
has a lot information for you on this subject.

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] AP9606 fencing device

2010-10-28 Thread Pavlos Parissis
On 28 October 2010 10:21, Dejan Muhamedagic  wrote:

> Hi,
>
> On Wed, Oct 27, 2010 at 08:15:09PM +0200, Pavlos Parissis wrote:
> > On 27 October 2010 19:46, Pavlos Parissis 
> wrote:
> >
> > > I did more testing using the clone type of fencing and worked as I
> > > expected.
> > >
> > > test1 hack init script to return 1 on stop and run a crm resource move
> on
> > > that resource
> > > result node it was fenced and resource was started on the other node
> > >
> > > test2 using firewall to break the heartbeat links on node with resource
> > > result node it was fenced and resource was started on the other node
> > >
> > > As Dejan suggested I am going to run the same type of tests when 1
> fence
> > > resource is used.
> > > In this test I will try to cause a fencing on the node which has
> fencing
> > > resource running on it and see if pacemaker moves the resource before
> it
> > > fences the node.
> > >
> > >
> > >
> > >
> > I did the same tests without cloning and pacemaker moves fencing resource
> > before triggers a reboot on the node where fencing resource was running.
> > So, cloning fencing resource and having just one fence resource have the
> > same behaviour! at least for these 2 tests.
> > now I don't know which configuration solution I should choose!
>
> Whichever you feel more comfortable with, providing that the
> device really can support multiple connections simultaneously.
> I'd opt for non-cloned version. It's simpler, it avoids possible
> device contention.
>
> Thanks,
>
>
Under which conditions does pacemaker initiate multiple connections to a
fencing device?
Given the fact that rackpdu agent uses SNMP, so connections limits
 apply, I don't quite understand how cloned version will give me
issues. I make a big assumption here that connections limits are not
applicable when fencing device is contacted over SNMP  .

Furthermore, using cloned version a fence event triggers faster than
non-cloned version, because in non-cloned version situation the resource
must move to another node, if the node to be fenced holds the fencing
resources.

Because of the above, I selected for now the cloned version. But your mail
worries a bit.

what test can I do in order to make sure that the cloned version will not
give me issues?

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Pacemaker-1.1.4, when?

2010-10-28 Thread Pavlos Parissis
Hi,

When do we expect to have Pacemaker-1.1.4 available?

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Impossible to add a 4th node to a cluster

2010-10-28 Thread Pavlos Parissis
On 28 October 2010 16:09, Guillaume Chanaud
 wrote:
>  Hello,
>
> i have a cluster of two master/slave drbd server running into a vlan
> (machines are dedicated servers)
> (filer1 and filer2)
> I added a third node to the cluster (a "blank node" for the moment)
> correctly
> (server1)
> When i add a 4th node to the cluster (which is a "mirror" of server1)
> (server2)
> this node start as standalone...Here is the message.log :
>
> Oct 28 15:59:27 ns209045 corosync[16543]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Oct 28 15:59:28 ns209045 corosync[16543]:   [pcmk  ] notice:
> pcmk_peer_update: Transitional membership event on ring 945392: memb=1,
> new=0, lost=0
> Oct 28 15:59:28 ns209045 corosync[16543]:   [pcmk  ] info: pcmk_peer_update:
> memb: server2 16820416
> Oct 28 15:59:28 ns209045 corosync[16543]:   [pcmk  ] notice:
> pcmk_peer_update: Stable membership event on ring 945392: memb=1, new=0,
> lost=0
> Oct 28 15:59:28 ns209045 corosync[16543]:   [pcmk  ] info: pcmk_peer_update:
> MEMB: server2 16820416
> Oct 28 15:59:28 ns209045 corosync[16543]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Oct 28 15:59:29 ns209045 corosync[16543]:   [pcmk  ] notice:
> pcmk_peer_update: Transitional membership event on ring 945416: memb=1,
> new=0, lost=0
> Oct 28 15:59:29 ns209045 corosync[16543]:   [pcmk  ] info: pcmk_peer_update:
> memb: server2 16820416
> Oct 28 15:59:29 ns209045 corosync[16543]:   [pcmk  ] notice:
> pcmk_peer_update: Stable membership event on ring 945416: memb=1, new=0,
> lost=0
> Oct 28 15:59:29 ns209045 corosync[16543]:   [pcmk  ] info: pcmk_peer_update:
> MEMB: server2 16820416
>
> [...] Message repeat many many times
>
> Now i stop the server1, and i start the server2...server2 start correctly
> and is added to the cluster...but when
> i want to start server1, same thing happens...(so things are inverted but
> result is the same...when i start one the serverX, the other can't start...)
>
> My corosync.conf is configured in broadcast, not multicastI have lots of
> problem with multicast because lots of briged VM on the vlan
> doesn't see the multicast packets, or doesn't join the multicast group
> correctly...
>
> Any hint on this ??

corosync and auth files are the same on server2?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Impossible to add a 4th node to a cluster

2010-10-28 Thread Pavlos Parissis
On 28 October 2010 18:30, Guillaume Chanaud
 wrote:
[...snip...]
>> corosync and auth files are the same on server2?
>>
>
> Yes of course :D (copied by scp), as i told server1 can join when server2 is
> offline, and server 2 can join when server1 is offline, but if one is
> online, the other can't join and log the above things in loop...

xm you said that you server2 is a clone of server1, check if they have
different uuids

> In fact i have lttss of problem with
> corosync/pacemaker...multicast/broadcast between physical
> servers/virtuallots of different shit everywhere, error log are always
> different depending on what i try...

try to go step up step,  make sure you have correct rings, check
related threads about rings

>
> The strange things is that the filer1 filer2 server2 and server1 are all
> running the same distro (gentoo) with same tools and are on the same vlan
> (which is working for lots of services like nfs...)
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker-1.1.4, when?

2010-10-28 Thread Pavlos Parissis
On 28 October 2010 22:55, Andrew Beekhof  wrote:
> Its released already, but the wrong packages got built because I ran
> the wrong command :-(
> Fedora 13 packages are uploading now, I'll do opensuse 11.3 in the morning

I have seen the tag on Mercurial but I haven't seen any rpm on
rpm-next for EPEL and I thought  you are still testing the release.
When do you expect to have the builds for EPEL?

Thanks,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] PE ignores monitor failure of stonith:external/rackpdu

2010-10-28 Thread Pavlos Parissis
Hi,

I wanted to check what happens when the monitor of a fencing agents
fails, thus I disconnected the PDU from network, reduced the monitor
interval and put debug statements on the fencing script.

here is the debug statements on the status code
status)
if [ -z "$pduip" ]; then
exit 1
fi
date >> /tmp/pdu.monitor
if ping -w1 -c1 $pduip >/dev/null 2>&1; then
exit 0
else
echo "failed" >> /tmp/pdu.monitor
exit 1
fi
;;


here is the debug output which states that monitor failed
[r...@node-03 tmp]# cat pdu.monitor
Fri Oct 29 08:29:20 CEST 2010
Fri Oct 29 08:31:05 CEST 2010
failed
Fri Oct 29 08:32:50 CEST 2010
failed

but pacemaker thinks is fine
[r...@node-03 tmp]# crm status|grep pdu
 pdu(stonith:external/rackpdu): Started node-03
[r...@node-03 tmp]#


and here is the resource
primitive pdu stonith:external/rackpdu \
params community="empisteftiko"
names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4"
oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.3" hostlist="AUTO"
pduip="192.168.100.100" stonith-timeout="30" \
op monitor interval="1m" timeout="60s"


Is it the expected behaviour?

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker-1.1.4, when?

2010-10-29 Thread Pavlos Parissis
On 29 October 2010 10:25, Andrew Beekhof  wrote:
> On Fri, Oct 29, 2010 at 8:15 AM, Pavlos Parissis
>  wrote:
>> On 28 October 2010 22:55, Andrew Beekhof  wrote:
>>> Its released already, but the wrong packages got built because I ran
>>> the wrong command :-(
>>> Fedora 13 packages are uploading now, I'll do opensuse 11.3 in the morning
>>
>> I have seen the tag on Mercurial but I haven't seen any rpm on
>> rpm-next for EPEL and I thought  you are still testing the release.
>> When do you expect to have the builds for EPEL?
>
> There wont be unfortunately.
> Some of the changes we needed to make involved the use of
> g_hash_table_get_values() which only appeared in glib 2.14
> So EPEL5 is stuck on the 1.0 series.

Does that mean I shouldn't use 1.1.x (x<4) on EPEL? I guess not since
the change you mentioned is only in 1.1.4
I currently use 1.1.3 on EPEL 5.4

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker-1.1.4, when?

2010-10-29 Thread Pavlos Parissis
On 29 October 2010 11:47, Andrew Beekhof  wrote:
[...snip..]
>>> There wont be unfortunately.
>>> Some of the changes we needed to make involved the use of
>>> g_hash_table_get_values() which only appeared in glib 2.14
>>> So EPEL5 is stuck on the 1.0 series.
>>
>> Does that mean I shouldn't use 1.1.x (x<4) on EPEL? I guess not since
>> the change you mentioned is only in 1.1.4
>> I currently use 1.1.3 on EPEL 5.4
>
> 1.1.3 is generally ok still, it was mostly performance stuff that went into .4
> You could update glib manually and rebuild the 1.1.4 packages though...

I wont go down this path. So, for EPEL 1.1.3 is the last available
release without any upgrade paths, that's not very nice for production
systems. I consider switching back to 1.0.9, which hopefully gets
updated.

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker-1.1.4, when?

2010-10-29 Thread Pavlos Parissis
On 29 October 2010 12:23, Andrew Beekhof  wrote:
> On Fri, Oct 29, 2010 at 11:58 AM, Pavlos Parissis
>  wrote:
>> On 29 October 2010 11:47, Andrew Beekhof  wrote:
>> [...snip..]
>>>>> There wont be unfortunately.
>>>>> Some of the changes we needed to make involved the use of
>>>>> g_hash_table_get_values() which only appeared in glib 2.14
>>>>> So EPEL5 is stuck on the 1.0 series.
>>>>
>>>> Does that mean I shouldn't use 1.1.x (x<4) on EPEL? I guess not since
>>>> the change you mentioned is only in 1.1.4
>>>> I currently use 1.1.3 on EPEL 5.4
>>>
>>> 1.1.3 is generally ok still, it was mostly performance stuff that went into 
>>> .4
>>> You could update glib manually and rebuild the 1.1.4 packages though...
>>
>> I wont go down this path. So, for EPEL 1.1.3 is the last available
>> release without any upgrade paths,
>
> to be fair, there is an upgrade path, it just involves a version of
> glib2 that was released less than 4 years ago
>
>> that's not very nice for production
>> systems. I consider switching back to 1.0.9, which hopefully gets
>> updated.
>
> 1.0.10 is almost done
>
>
Initially, I moved to 1.1.3 in order to see if it solves bug #2500,
which is not solved, and stayed on 1.1.3, even I am using
pacemaker-1.0 schema, because I wanted to use the latest/greatest and
get regular updates.

Since there is no realistic upgrade path to 1.1.4 on EPEL, I am
wondering if there any benefit of staying on 1.1.3 compared to using
1.0.10.


Andrew, thanks for the clarifications, very much appreciated.
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] IP Power 9258HP with external/ippower9258

2010-10-30 Thread Pavlos Parissis
Hi,

Does anyone know if the fencing agent ippower9258 works with IP Power
9258HP PDU?
The readme file of the fencing agent mentions the following

 Especially "IP Power 9258 HP" uses a different http command interface

Doesn't that mean that it wont with 9258 HP?
The fact that Aviosys has different type of ip9258 makes a bit
confusing on what someone should buy.

Any ideas?

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] IP Power 9258HP with external/ippower9258

2010-10-30 Thread Pavlos Parissis
On 30 October 2010 16:03, Pavlos Parissis  wrote:
> Hi,
>
> Does anyone know if the fencing agent ippower9258 works with IP Power
> 9258HP PDU?
> The readme file of the fencing agent mentions the following
>
>  Especially "IP Power 9258 HP" uses a different http command interface
>
> Doesn't that mean that it wont with 9258 HP?
> The fact that Aviosys has different type of ip9258 makes a bit
> confusing on what someone should buy.
>
> Any ideas?

I was too fast to send out the above mail.
Reading http://www.aviosys.com/downloads/manuals/power/9258hp_en.pdf
and http://www.aviosys.com/downloads/manuals/power/9258st_en.pdf
gave me the answer. The external/ippower9258 doesn't work with 9258 HP
due to different http command interface, as it is mentioned in the
readme file of the agent.

Sorry the noise
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Ordering clones and primitives

2010-10-31 Thread Pavlos Parissis
On 30 October 2010 19:55, Lars Kellogg-Stedman  wrote:

> I have a two node cluster that hosts two virtual ips on the same network:
>
>  primitive proxy_0_ip ocf:heartbeat:IPaddr \
>params ip="10.10.10.20" cidr_netmask="255.255.255.0" nic="eth3"
>  primitive proxy_1_ip ocf:heartbeat:IPaddr \
>params ip="10.10.10.30" cidr_netmask="255.255.255.0" nic="eth3"
>
> After the ip address comes up, the system must establish a network
> route and a default route.  I'm having trouble defining the
> relationships between these services.  I started with this:
>
>  primitive public_net_route ocf:heartbeat:Route \
>params destination="10.10.10.0/24"
>   device="eth3" table="1"
>  primitive public_def_route ocf:heartbeat:Route \
>params destination="default" gateway="10.10.10.1"
>   device="eth3" table="1"
>
>  clone clone_public_def_route public_def_route
>  clone clone_public_net_route public_net_route
>

why do you need/want to clone these 2 resources?
For me would make more to 1 group per IP and place the resources in the
order you want


> But having got this before, I don't understand how to estbalish the
> necessary ordering between the routes and the ip address resources.
> The clones can't come up on a host until one of the ip addresses are
> available on the host.  In other words, the cloned resources cannot be
> active on a host unless an ip address resource is also active on that
> host.
>
> I tried this:
>
>  order ip_0_before_routes inf: proxy_0_ip clone_public_net_route
>  order ip_1_before_routes inf: proxy_1_ip clone_public_net_route
>  order net_route_before_def_route \
>   inf: clone_public_net_route clone_public_def_route
>
> ...but the clone services in this case don't start unless both ips are
> started.  Shutting down either ip takes down *all* of the clone
> resources on both nodes.
>
> Is it possible to do what I want?  This seems like exactly the same
> relationship that would exist betwee, say, a cloned Apache instance
> and a set of ip address resources, but I can't find a good example.
>
>
I am not sure if you can place order constraints like this on clones.
Most experience users will know better

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] downgrading to pacemaker-1.0.9.1-1.15.el5

2010-11-01 Thread Pavlos Parissis
Hi,

I have been using 1.1.3 on CentOS and I decided to downgrade to
1.0.9.1-1.15.el5.

The procedure was the following
stop heartbeat on all cluster members

downgrade to 1.0.9 doing the following on all cluster memebrs
yum downgrade pacemaker-1.0.9.1-1.15.el5 pacemaker-libs-1.0.9.1-1.15.el5
pacemaker-debuginfo-1.0.9.1-1.15.el5

starting heartbeat gave me the following and crmd was stopped
crmd: [10772]: debug: debug3: compare_version: 3.0.2 > 3.0.1 (3)
crmd: [10772]: ERROR: revision_check_callback: This build (1.0.9) does not
support the current resource configuration
crmd: [10772]: ERROR: revision_check_callback: We can support up to CRM
feature set 3.0.2 (current=3.0.1)
crmd: [10772]: ERROR: revision_check_callback: Shutting down the CRM

why does crm complain about the resource configuration?
Even I was using 1.1.3, I had pacemaker-schema 1.0
validate-with="pacemaker-1.0" crm_feature_set="3.0.2"

Could be the following the root cause of the problem?
 

Any ideas?
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] downgrading to pacemaker-1.0.9.1-1.15.el5

2010-11-01 Thread Pavlos Parissis
On 1 November 2010 09:19, Pavlos Parissis  wrote:

> Hi,
>
> I have been using 1.1.3 on CentOS and I decided to downgrade to
> 1.0.9.1-1.15.el5.
>
> The procedure was the following
> stop heartbeat on all cluster members
>
> downgrade to 1.0.9 doing the following on all cluster memebrs
> yum downgrade pacemaker-1.0.9.1-1.15.el5 pacemaker-libs-1.0.9.1-1.15.el5
> pacemaker-debuginfo-1.0.9.1-1.15.el5
>
> starting heartbeat gave me the following and crmd was stopped
> crmd: [10772]: debug: debug3: compare_version: 3.0.2 > 3.0.1 (3)
> crmd: [10772]: ERROR: revision_check_callback: This build (1.0.9) does not
> support the current resource configuration
> crmd: [10772]: ERROR: revision_check_callback: We can support up to CRM
> feature set 3.0.2 (current=3.0.1)
> crmd: [10772]: ERROR: revision_check_callback: Shutting down the CRM
>
> why does crm complain about the resource configuration?
> Even I was using 1.1.3, I had pacemaker-schema 1.0
> validate-with="pacemaker-1.0" crm_feature_set="3.0.2"
>
> Could be the following the root cause of the problem?
>   value="1.1.3-9c2342c0378140df9bed7d192f2b9ed157908007"/>
>
> Any ideas?
> Pavlos
>
>
>
Yes, I have!
Solved by doing the following

yum upgrade pacemaker pacemaker-libs
gave me a working crmd on 1.1.3

cibadmin --modify --crm_xml ''
set feature_set to 3.0.1, I had to look at the code to realize that this is
the cause of the problem.
The log line "We can support up to CRM feature set 3.0.2 (current=3.0.1)"
is a bit confusing and makes you to think the feature set version is not the
issue here

heartbeat stop

yum downgrade pacemaker-1.0.9.1-1.15.el5 pacemaker-libs-1.0.9.1-1.15.el5
pacemaker-debuginfo-1.0.9.1-1.15.el5

heartbeat start

and everything is fine again !

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Stonith Device APC AP7900

2010-11-02 Thread Pavlos Parissis
On 1 November 2010 15:01, Rick Cone  wrote:

> Dejan,
>
> Below I had:
>
> primitive res_stonith stonith:apcmastersnmp \
> params ipaddr="192.1.1.109" port="161" community="sps" \
> op start interval="0" timeout="60s" \
> op monitor interval="60s" timeout="60s" \
>  op stop interval="0" timeout="60s"
> clone rc_res_stonith res_stonith \
> meta target-role="Started"
>
> And you commented:
>
> You can also use a single instance setup, i.e. without clones.
>
> What is this "single instance setup", and what would it look like in the
> crm
> configure?


You have one already, res_stonith  is your single instance setup.

What are the pros/cons to this compared to the clone setup I
> have?
>

The decision of cloned or non-cloned stonith resource is mainly driven about
the ability of the fencing device to accept multiple connections
simultaneously.
If you fencing device doesn't allow that then you can only use a non-cloned
resource.
The cloned resource will fence a node a bit faster in a case stonith
resource is running on the node to be fenced, there is no need to move the
resource to another node and then fence the node.
But, cloned resource is has a bit more complex configuration. cloning looks
easy but I bet there is some complexity behind it.

I have experiment with both versions on AP9606 using the rackpdu agent and
in both cases worked as expected. Now I am expecting a Aviosys 9258ST and I
will use the non-cloned version.

My 2 cents,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Stonith Device APC AP7900

2010-11-02 Thread Pavlos Parissis
On 2 November 2010 11:04, Dejan Muhamedagic  wrote:

> Hi,
>
> On Tue, Nov 02, 2010 at 08:08:32AM +0100, Pavlos Parissis wrote:
> > On 1 November 2010 15:01, Rick Cone 
> wrote:
> >
> > > Dejan,
> > >
> > > Below I had:
> > >
> > > primitive res_stonith stonith:apcmastersnmp \
> > > params ipaddr="192.1.1.109" port="161" community="sps" \
> > > op start interval="0" timeout="60s" \
> > > op monitor interval="60s" timeout="60s" \
> > >  op stop interval="0" timeout="60s"
> > > clone rc_res_stonith res_stonith \
> > > meta target-role="Started"
> > >
> > > And you commented:
> > >
> > > You can also use a single instance setup, i.e. without clones.
> > >
> > > What is this "single instance setup", and what would it look like in
> the
> > > crm
> > > configure?
> >
> >
> > You have one already, res_stonith  is your single instance setup.
>
> Yes, just don't clone that resource.
>
> > What are the pros/cons to this compared to the clone setup I
> > > have?
> > >
> >
> > The decision of cloned or non-cloned stonith resource is mainly driven
> about
> > the ability of the fencing device to accept multiple connections
> > simultaneously.
> > If you fencing device doesn't allow that then you can only use a
> non-cloned
> > resource.
>
> Right.
>
>
Do you know under which conditions pacemaker initiates multiple connections
to a fencing device?

> The cloned resource will fence a node a bit faster in a case stonith
> > resource is running on the node to be fenced, there is no need to move
> the
> > resource to another node and then fence the node.
>
> Did you measure the time it takes to start the stonith resource?
>

The last time I tested it was with rackpdu and it took 5 secs for pacemaker
to move the resource and trigger the reboot event.


>
> > But, cloned resource is has a bit more complex configuration. cloning
> looks
> > easy but I bet there is some complexity behind it.
>
> There is, but in this simple case where clones are not in any
> relation to other resources, that shouldn't pose a problem.
>
> > I have experiment with both versions on AP9606 using the rackpdu agent
> and
> > in both cases worked as expected. Now I am expecting a Aviosys 9258ST and
> I
> > will use the non-cloned version.
>
> Thanks,
>
> Dejan
>
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] PE ignores monitor failure of stonith:external/rackpdu

2010-11-02 Thread Pavlos Parissis
On 2 November 2010 11:22, Dejan Muhamedagic  wrote:

> Hi,
>
> On Fri, Oct 29, 2010 at 08:37:04AM +0200, Pavlos Parissis wrote:
> > Hi,
> >
> > I wanted to check what happens when the monitor of a fencing agents
> > fails, thus I disconnected the PDU from network, reduced the monitor
> > interval and put debug statements on the fencing script.
> >
> > here is the debug statements on the status code
> > status)
> > if [ -z "$pduip" ]; then
> > exit 1
> > fi
> > date >> /tmp/pdu.monitor
> > if ping -w1 -c1 $pduip >/dev/null 2>&1; then
> > exit 0
> > else
> > echo "failed" >> /tmp/pdu.monitor
> > exit 1
> > fi
> > ;;
> >
> >
> > here is the debug output which states that monitor failed
> > [r...@node-03 tmp]# cat pdu.monitor
> > Fri Oct 29 08:29:20 CEST 2010
> > Fri Oct 29 08:31:05 CEST 2010
> > failed
> > Fri Oct 29 08:32:50 CEST 2010
> > failed
> >
> > but pacemaker thinks is fine
> > [r...@node-03 tmp]# crm status|grep pdu
> >  pdu(stonith:external/rackpdu): Started node-03
> > [r...@node-03 tmp]#
> >
> >
> > and here is the resource
> > primitive pdu stonith:external/rackpdu \
> > params community="empisteftiko"
> > names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4"
> > oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.3" hostlist="AUTO"
> > pduip="192.168.100.100" stonith-timeout="30" \
> > op monitor interval="1m" timeout="60s"
> >
> > Is it the expected behaviour?
>
> Definitely not. If you do the monitor action from the command
> line does that also return the unexpected exit code:
>

from the code I pasted you can see it returned 1.

>
> # stonith -t external/rackpdu community="empisteftiko"
> names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4" ... -lS
>
> Which pacemaker release do you run? I couldn't reproduce this
> with a recent Pacemaker.
>

that it was on 1.1.3 and now I run 1.0.9.
Do you want me to run the test on 1.0.9?


>
> Thanks,
>
> Dejan
>
> > Cheers,
> > Pavlos
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Stonith Device APC AP7900

2010-11-02 Thread Pavlos Parissis
On 2 November 2010 12:58, Dejan Muhamedagic  wrote:
[...snip...]

>
> > Do you know under which conditions pacemaker initiates multiple
> connections
> > to a fencing device?
>
> There are no specific conditions. It can happen by chance because
> individual clone instances run independently.


> > > The cloned resource will fence a node a bit faster in a case stonith
> > > > resource is running on the node to be fenced, there is no need to
> move
> > > the
> > > > resource to another node and then fence the node.
> > >
> > > Did you measure the time it takes to start the stonith resource?
> > >
> >
> > The last time I tested it was with rackpdu and it took 5 secs for
> pacemaker
> > to move the resource and trigger the reboot event.
>
> So, the time difference is 5 seconds in your case, right?
>
>
> No, this is the time it took to fence a node with non-cloned version.
I don't remember exactly how many secs it took with cloned version, but I do
remember that it was faster.

When I get the new PDU (aviosys 9258) I will run the test again and get back
on this.

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] PE ignores monitor failure of stonith:external/rackpdu

2010-11-02 Thread Pavlos Parissis
On 2 November 2010 13:02, Dejan Muhamedagic  wrote:
[...snip...]

>
> > > Definitely not. If you do the monitor action from the command
> > > line does that also return the unexpected exit code:
> > >
> >
> > from the code I pasted you can see it returned 1.
>
> There is a difference. stonith-ng (stonithd) is a daemon that
> runs a perl script (fencing_legacy) which invokes stonith which
> then invokes the plugin. A problem can occur in any of these
> components. It's important to find out where.
>
> > > # stonith -t external/rackpdu community="empisteftiko"
> > > names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4" ... -lS
> > >
> > > Which pacemaker release do you run? I couldn't reproduce this
> > > with a recent Pacemaker.
> > >
> >
> > that it was on 1.1.3 and now I run 1.0.9.
> > Do you want me to run the test on 1.0.9?
>
> Yes, please. 1.0.9 is still running the old, and well tested,
> stonithd, so the result could be different.
>
>
I have the pdu off because it stopped working anymore! As a result the
resource is stopped.
But I did the test I see that even rackpdu returns 1 on status stonithd
reports 256

here is running stonith, remember pdu is off.


[r...@node-01 ~]# stonith -d -t external/rackpdu
hostlist="node-01,node-02,node-03" pduip="192.168.100.100"
community="empisteftiko" names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4"  -l
** (process:8115): DEBUG: NewPILPluginUniv(0x8f690c8)
** (process:8115): DEBUG: PILS: Plugin path =
/usr/lib/stonith/plugins:/usr/lib/heartbeat/plugins
** (process:8115): DEBUG: NewPILInterfaceUniv(0x8f69768)
** (process:8115): DEBUG: NewPILPlugintype(0x8f69a28)
** (process:8115): DEBUG: NewPILPlugin(0x8f69a40)
** (process:8115): DEBUG: NewPILInterface(0x8f69b50)
** (process:8115): DEBUG:
NewPILInterface(0x8f69b50:InterfaceMgr/InterfaceMgr)*** user_data: 0x0
***
** (process:8115): DEBUG:
InterfaceManager_plugin_init(0x8f69b50/InterfaceMgr)
** (process:8115): DEBUG: Registering Implementation manager for Interface
type 'InterfaceMgr'
** (process:8115): DEBUG: PILS: Looking for InterfaceMgr/generic =>
[/usr/lib/stonith/plugins/InterfaceMgr/generic.so]
** (process:8115): DEBUG: Plugin file
/usr/lib/stonith/plugins/InterfaceMgr/generic.so does not exist
** (process:8115): DEBUG: PILS: Looking for InterfaceMgr/generic =>
[/usr/lib/heartbeat/plugins/InterfaceMgr/generic.so]
** (process:8115): DEBUG: Plugin path for InterfaceMgr/generic =>
[/usr/lib/heartbeat/plugins/InterfaceMgr/generic.so]
** (process:8115): DEBUG: PluginType InterfaceMgr already present
** (process:8115): DEBUG: Plugin InterfaceMgr/generic  init function:
InterfaceMgr_LTX_generic_pil_plugin_init
** (process:8115): DEBUG: NewPILPlugin(0x8f6a1d8)
** (process:8115): DEBUG: Plugin InterfaceMgr/generic loaded and
constructed.
** (process:8115): DEBUG: Calling init function in plugin
InterfaceMgr/generic.
** (process:8115): DEBUG: NewPILInterface(0x8f69cd8)
** (process:8115): DEBUG:
NewPILInterface(0x8f69cd8:InterfaceMgr/stonith2)*** user_data: 0x8f69b18
***
** (process:8115): DEBUG: Registering Implementation manager for Interface
type 'stonith2'
** (process:8115): DEBUG: IfIncrRefCount(1 + 1 )
** (process:8115): DEBUG: PluginIncrRefCount(0 + 1 )
** (process:8115): DEBUG: IfIncrRefCount(1 + 100 )
** (process:8115): DEBUG: PILS: Looking for stonith2/external =>
[/usr/lib/stonith/plugins/stonith2/external.so]
** (process:8115): DEBUG: Plugin path for stonith2/external =>
[/usr/lib/stonith/plugins/stonith2/external.so]
** (process:8115): DEBUG: Creating PluginType for stonith2
** (process:8115): DEBUG: NewPILPlugintype(0x8f6a398)
** (process:8115): DEBUG: Plugin stonith2/external  init function:
stonith2_LTX_external_pil_plugin_init
** (process:8115): DEBUG: NewPILPlugin(0x8f69d68)
** (process:8115): DEBUG: Plugin stonith2/external loaded and constructed.
** (process:8115): DEBUG: Calling init function in plugin stonith2/external.
** (process:8115): DEBUG: NewPILInterface(0x8f6a3b0)
** (process:8115): DEBUG: NewPILInterface(0x8f6a3b0:stonith2/external)***
user_data: 0x9e9fbc ***
** (process:8115): DEBUG: IfIncrRefCount(101 + 1 )
** (process:8115): DEBUG: PluginIncrRefCount(0 + 1 )
** (process:8115): DEBUG: external_set_config: called.
** (process:8115): DEBUG: external_get_confignames: called.
** (process:8115): DEBUG: external_run_cmd: Calling
'/usr/lib/stonith/plugins/external/rackpdu getconfignames'
** (process:8115): DEBUG: external_run_cmd:
'/usr/lib/stonith/plugins/external/rackpdu getconfignames' output: hostlist
pduip community

** (process:8115): DEBUG: external_get_confignames: 'rackpdu getconfignames'
returned 0
** (process:8115): DEBUG: plugin output: hostlist pduip community

** (process:8115): DEBUG: external_get_confignames: rackpdu configname
hostlist
** (process:8115): DEBUG: external_get_confignames: rackpdu configname pduip
** (process:8115): DEBUG: external_get_confignames: rackpdu configname
community
** (process:8115): DEBUG: external_status: called.
** (process:8115): DEBUG: external_run_cmd: C

Re: [Pacemaker] IP Power 9258HP with external/ippower9258

2010-11-02 Thread Pavlos Parissis
On 2 November 2010 13:13, Dejan Muhamedagic  wrote:

> Hi,
>
> On Sat, Oct 30, 2010 at 04:31:38PM +0200, Pavlos Parissis wrote:
> > On 30 October 2010 16:03, Pavlos Parissis 
> wrote:
> > > Hi,
> > >
> > > Does anyone know if the fencing agent ippower9258 works with IP Power
> > > 9258HP PDU?
> > > The readme file of the fencing agent mentions the following
> > >
> > >  Especially "IP Power 9258 HP" uses a different http command interface
> > >
> > > Doesn't that mean that it wont with 9258 HP?
> > > The fact that Aviosys has different type of ip9258 makes a bit
> > > confusing on what someone should buy.
> > >
> > > Any ideas?
> >
> > I was too fast to send out the above mail.
> > Reading http://www.aviosys.com/downloads/manuals/power/9258hp_en.pdf
> > and http://www.aviosys.com/downloads/manuals/power/9258st_en.pdf
> > gave me the answer. The external/ippower9258 doesn't work with 9258 HP
> > due to different http command interface, as it is mentioned in the
> > readme file of the agent.
>
> There was a fairly good implementation of the ippower9258hp
> posted by Johan Verrept. Unfortunately, the discussion somehow
> petered out when the plugin was almost done. Can't recall the
> details anymore, they should be in the list archives, but I guess
> that we can revive it.
>
> Thanks,
>
>
I ordered the st version of 9258 and as a result I can't test that RA.

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] PE ignores monitor failure of stonith:external/rackpdu

2010-11-02 Thread Pavlos Parissis
On 2 November 2010 13:18, Dejan Muhamedagic  wrote:

> Hi,
>
> On Tue, Nov 02, 2010 at 01:09:02PM +0100, Pavlos Parissis wrote:
> > On 2 November 2010 13:02, Dejan Muhamedagic  wrote:
> > [...snip...]
> >
> > >
> > > > > Definitely not. If you do the monitor action from the command
> > > > > line does that also return the unexpected exit code:
> > > > >
> > > >
> > > > from the code I pasted you can see it returned 1.
> > >
> > > There is a difference. stonith-ng (stonithd) is a daemon that
> > > runs a perl script (fencing_legacy) which invokes stonith which
> > > then invokes the plugin. A problem can occur in any of these
> > > components. It's important to find out where.
> > >
> > > > > # stonith -t external/rackpdu community="empisteftiko"
> > > > > names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4" ... -lS
> > > > >
> > > > > Which pacemaker release do you run? I couldn't reproduce this
> > > > > with a recent Pacemaker.
> > > > >
> > > >
> > > > that it was on 1.1.3 and now I run 1.0.9.
> > > > Do you want me to run the test on 1.0.9?
> > >
> > > Yes, please. 1.0.9 is still running the old, and well tested,
> > > stonithd, so the result could be different.
> > >
> > >
> > I have the pdu off because it stopped working anymore! As a result the
> > resource is stopped.
> > But I did the test I see that even rackpdu returns 1 on status stonithd
> > reports 256
>
> Ah, I understand what's going on now. It's a bug in the interface
> to external plugins which was exposed by stonith-ng. It has been
> fixed in August. The fix is here (in hg.linux-ha.org/glue):
>
> changeset:   2427:b7df127fc09e
> user:Dejan Muhamedagic 
> date:Thu Aug 12 14:01:10 2010 +0200
> summary: High: stonith: external: interpret properly exit codes from
> external stonith plugins (bnc#630357)
>
> There hasn't been a glue release since then, but there should be
> one fairly soon. Note that this affects only Pacemaker 1.1.
>
> Thanks,
>
> Dejan
>
>
>
>
Does this bug have to do anything with PE ignoring monitor failure?
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] drbd on heartbeat links

2010-11-02 Thread Pavlos Parissis
Hi,

I am trying to figure out how I can resolve the following scenario

Facts
3 nodes
2 DRBD ms resource
2 group resource
by default drbd1/group1 runs on node-01 and drbd2/group2 runs on node2
drbd1/group1  can only run on node-01 and node-03
drbd2/group2  can only run on node-02 and node-03
DRBD fencing_policy is resource-only [1]
2 heartbeat links and one of them used by DRBD communication

Scenario
1) node-01 loses both heartbeat links
2) DRBD monitor detects first the absence of the drbd communication
and does resource fencing by add location constraint which prevent
drbd1 to run on node3
3) pacemaker fencing kicks in and kills node-01

due to location constraint created at step 2, drbd1/group1 can run in
the cluster


Any ideas?

Cheers,
Pavlos




[1] it is not resource-and-stonith because in the scenario where a
node has the role of primary for drbd1 and secondary for drbd2, could
be fenced because the primary node of drbd2 have in fencing_policy
resource-and-stonith

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] drbd on heartbeat links

2010-11-02 Thread Pavlos Parissis
On 2 November 2010 16:15, Dan Frincu  wrote:
> Hi,
>
> Pavlos Parissis wrote:
>>
>> Hi,
>>
>> I am trying to figure out how I can resolve the following scenario
>>
>> Facts
>> 3 nodes
>> 2 DRBD ms resource
>> 2 group resource
>> by default drbd1/group1 runs on node-01 and drbd2/group2 runs on node2
>> drbd1/group1  can only run on node-01 and node-03
>> drbd2/group2  can only run on node-02 and node-03
>> DRBD fencing_policy is resource-only [1]
>> 2 heartbeat links and one of them used by DRBD communication
>>
>> Scenario
>> 1) node-01 loses both heartbeat links
>> 2) DRBD monitor detects first the absence of the drbd communication
>> and does resource fencing by add location constraint which prevent
>> drbd1 to run on node3
>> 3) pacemaker fencing kicks in and kills node-01
>>
>> due to location constraint created at step 2, drbd1/group1 can run in
>> the cluster
>>
>>
>
> I don't understand exactly what you mean by this. Resource-only fencing
> would create a -inf score on node1 when the node loses the drbd
> communication channel (the only one drbd uses),
Because node-01 is the primary at the moment of the failure,
resource-fencing will create an -inf score for the node-03.

> however you could still have
> heartbeat communication available via the secondary link, then you shouldn't
As I wrote none of the heartbeat links is available.
After I sent the mail, I realized that the node-03 will not see
location constraint created by node-01 because there no heartbeat
communication!
Thus I think my scenario has a flaw, since none of the heartbeat links
are available on node-01.
Resource-fencing from DRBD will be triggered but without any effect
and node-03 or node-02 will fence node-01, and node-03 will be become
the primary for drbd1

> fence the entire node, the resource-only fencing does that for you, the only
> thing you need to do is to add the drbd fence handlers in /etc/drbd.conf.
>       handlers {
>               fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>               after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>       }
>
> Is this what you meant?

No.
Dan thanks for your mail.


Since there is a flaw on the scenario let's define a similar scenario.

status
node-01 primary for drbd1 and group1 runs on it
node-02 primary for drbd2 and group2 runs on it
node-3 secondary for drbd1 and drbd2

2 heartbeat links, and one of them being used for DRBD communication

here is the scenario
1) on node-01 heartbeat link which carries also DRBD communication is lost
2) node-01 does resource-fencing and places score -inf for drbd1 on node-03
3) on node-01 second heartbeat link is lost
4) node-01 will be fenced by one other cluster members
5) drbd1 can't run on node-03 due to location constraint created at step 2

The problem here is that location constraint will be active even
node-01 is fenced.

Any ideas?

Pavlos


drbd.conf
global {
  usage-count yes;
}
common {
  protocol C;

  syncer {
csums-alg sha1;
verify-alg sha1;
rate 10M;
  }

  net {
data-integrity-alg sha1;
max-buffers 20480;
max-epoch-size 16384;
  }

  disk {
on-io-error detach;
### Only when DRBD is under cluster ###
fencing resource-only;
### --- ###
  }

  startup {
wfc-timeout 60;
degr-wfc-timeout 30;
outdated-wfc-timeout 15;
   }

### Only when DRBD is under cluster ###
  handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }
### --- ###
}

resource drbd_resource_01 {

  on node-01 {
device/dev/drbd1;
disk  /dev/sdb1;
address   10.10.10.129:7789;
meta-disk internal;
  }
   on node-03 {
device/dev/drbd1;
disk  /dev/sdb1;
address   10.10.10.131:7789;
meta-disk internal;
  }

  syncer {
cpu-mask 2;
  }
}

resource drbd_resource_02 {

  on node-02 {
device/dev/drbd2;
disk  /dev/sdb1;
address   10.10.10.130:7790;
meta-disk internal;
  }
  on node-03 {
device/dev/drbd2;
disk  /dev/sdc1;
address   10.10.10.131:7790;
meta-disk internal;
  }

  syncer {
cpu-mask 1;
  }
}

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] drbd on heartbeat links

2010-11-02 Thread Pavlos Parissis
On 2 November 2010 22:07, Pavlos Parissis  wrote:
> On 2 November 2010 16:15, Dan Frincu  wrote:
>> Hi,
>>
>> Pavlos Parissis wrote:
>>>
>>> Hi,
>>>
>>> I am trying to figure out how I can resolve the following scenario
>>>
>>> Facts
>>> 3 nodes
>>> 2 DRBD ms resource
>>> 2 group resource
>>> by default drbd1/group1 runs on node-01 and drbd2/group2 runs on node2
>>> drbd1/group1  can only run on node-01 and node-03
>>> drbd2/group2  can only run on node-02 and node-03
>>> DRBD fencing_policy is resource-only [1]
>>> 2 heartbeat links and one of them used by DRBD communication
>>>
>>> Scenario
>>> 1) node-01 loses both heartbeat links
>>> 2) DRBD monitor detects first the absence of the drbd communication
>>> and does resource fencing by add location constraint which prevent
>>> drbd1 to run on node3
>>> 3) pacemaker fencing kicks in and kills node-01
>>>
>>> due to location constraint created at step 2, drbd1/group1 can run in
>>> the cluster
>>>
>>>
>>
>> I don't understand exactly what you mean by this. Resource-only fencing
>> would create a -inf score on node1 when the node loses the drbd
>> communication channel (the only one drbd uses),
> Because node-01 is the primary at the moment of the failure,
> resource-fencing will create an -inf score for the node-03.
>
>> however you could still have
>> heartbeat communication available via the secondary link, then you shouldn't
> As I wrote none of the heartbeat links is available.
> After I sent the mail, I realized that the node-03 will not see
> location constraint created by node-01 because there no heartbeat
> communication!
> Thus I think my scenario has a flaw, since none of the heartbeat links
> are available on node-01.
> Resource-fencing from DRBD will be triggered but without any effect
> and node-03 or node-02 will fence node-01, and node-03 will be become
> the primary for drbd1
>
>> fence the entire node, the resource-only fencing does that for you, the only
>> thing you need to do is to add the drbd fence handlers in /etc/drbd.conf.
>>       handlers {
>>               fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>>               after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>>       }
>>
>> Is this what you meant?
>
> No.
> Dan thanks for your mail.
>
>
> Since there is a flaw on the scenario let's define a similar scenario.
>
> status
> node-01 primary for drbd1 and group1 runs on it
> node-02 primary for drbd2 and group2 runs on it
> node-3 secondary for drbd1 and drbd2
>
> 2 heartbeat links, and one of them being used for DRBD communication
>
> here is the scenario
> 1) on node-01 heartbeat link which carries also DRBD communication is lost
> 2) node-01 does resource-fencing and places score -inf for drbd1 on node-03
> 3) on node-01 second heartbeat link is lost
> 4) node-01 will be fenced by one other cluster members
> 5) drbd1 can't run on node-03 due to location constraint created at step 2
>
> The problem here is that location constraint will be active even
> node-01 is fenced.
>
> Any ideas?
>

I found this related thread,
http://www.gossamer-threads.com/lists/drbd/users/15380#15380

Wouldn't be better if pacemaker/drbd do these instead? Manual actions
add delay on recovering.

Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [DRBD-user] drbd on heartbeat links

2010-11-03 Thread Pavlos Parissis
On 2 November 2010 22:57, Lars Ellenberg  wrote:
> On Tue, Nov 02, 2010 at 10:07:17PM +0100, Pavlos Parissis wrote:
>> On 2 November 2010 16:15, Dan Frincu  wrote:
>> > Hi,
>> >
>> > Pavlos Parissis wrote:
>> >>
>> >> Hi,
>> >>
>> >> I am trying to figure out how I can resolve the following scenario
>> >>
>> >> Facts
>> >> 3 nodes
>> >> 2 DRBD ms resource
>> >> 2 group resource
>> >> by default drbd1/group1 runs on node-01 and drbd2/group2 runs on node2
>> >> drbd1/group1  can only run on node-01 and node-03
>> >> drbd2/group2  can only run on node-02 and node-03
>> >> DRBD fencing_policy is resource-only [1]
>> >> 2 heartbeat links and one of them used by DRBD communication
>> >>
>> >> Scenario
>> >> 1) node-01 loses both heartbeat links
>> >> 2) DRBD monitor detects first the absence of the drbd communication
>> >> and does resource fencing by add location constraint which prevent
>> >> drbd1 to run on node3
>> >> 3) pacemaker fencing kicks in and kills node-01
>> >>
>> >> due to location constraint created at step 2, drbd1/group1 can run in
>> >> the cluster
>> >>
>> >>
>> >
>> > I don't understand exactly what you mean by this. Resource-only fencing
>> > would create a -inf score on node1 when the node loses the drbd
>> > communication channel (the only one drbd uses),
>> Because node-01 is the primary at the moment of the failure,
>> resource-fencing will create an -inf score for the node-03.
>>
>> > however you could still have
>> > heartbeat communication available via the secondary link, then you 
>> > shouldn't
>> As I wrote none of the heartbeat links is available.
>> After I sent the mail, I realized that the node-03 will not see
>> location constraint created by node-01 because there no heartbeat
>> communication!
>> Thus I think my scenario has a flaw, since none of the heartbeat links
>> are available on node-01.
>> Resource-fencing from DRBD will be triggered but without any effect
>> and node-03 or node-02 will fence node-01, and node-03 will be become
>> the primary for drbd1
>>
>> > fence the entire node, the resource-only fencing does that for you, the 
>> > only
>> > thing you need to do is to add the drbd fence handlers in /etc/drbd.conf.
>> >       handlers {
>> >               fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>> >               after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>> >       }
>> >
>> > Is this what you meant?
>>
>> No.
>> Dan thanks for your mail.
>>
>>
>> Since there is a flaw on the scenario let's define a similar scenario.
>>
>> status
>> node-01 primary for drbd1 and group1 runs on it
>> node-02 primary for drbd2 and group2 runs on it
>> node-3 secondary for drbd1 and drbd2
>>
>> 2 heartbeat links, and one of them being used for DRBD communication
>>
>> here is the scenario
>> 1) on node-01 heartbeat link which carries also DRBD communication is lost
>> 2) node-01 does resource-fencing and places score -inf for drbd1 on node-03
>> 3) on node-01 second heartbeat link is lost
>> 4) node-01 will be fenced by one other cluster members
>> 5) drbd1 can't run on node-03 due to location constraint created at step 2
>>
>> The problem here is that location constraint will be active even
>> node-01 is fenced.
>
> Which is good, and intended behaviour, as it protects you from
> going online with stale data (changes between 1) and 4) would be lost).
>
>> Any ideas?
>
> The drbd setting "resource-and-stonith" simply tells DRBD
> that you have stonith configured in your cluster.
> It does not by itself trigger any stonith action.
>
> So if you have stonith enabled, and you want to protect against being
> shot while modifying data, you should say "resource-and-stonith".

I do have stonith enabled in my Cluster, but I don't quite understand
what you have wrote.
The resource-and-stonith setting will add the location constraint as
the fencing resource-only and it will also prevent a node with a role
of primary to be fenced, am I right?
So, what happens when Cluster sends a fence event?

Initially, I thought this setting will trigger a fence event and I
didn't use it because I wanted to avoid a node which have the role of
secondary for drbd1 and the role primar

  1   2   >