Re: [ClusterLabs] resource-stickiness

2015-08-27 Thread Rakovec Jost
Hi


it doesn't work as I expected, I change name to:

location loc-aapche-sles1 aapche role=Started 10: sles1


but after I manual move resource via HAWK to other node it auto add this line:

location cli-prefer-aapche aapche role=Started inf: sles1


so now I have both lines:

location cli-prefer-aapche aapche role=Started inf: sles1
location loc-aapche-sles1 aapche role=Started 10: sles1


and resource-stickiness doesn't work since after fence node1 the resource is 
move back to node1 after node1 come back and this is what I don't like. I know 
that I can remove line  that was added by cluster, but this is not the proper 
solution. Please tell me what is wrong. Thanks.  My config: 

node sles1
node sles2
primitive filesystem Filesystem \
params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \
op start interval=0 timeout=60 \
op stop interval=0 timeout=60 \
op monitor interval=20 timeout=40
primitive myip IPaddr2 \
params ip=10.9.131.86 \
op start interval=0 timeout=20s \
op stop interval=0 timeout=20s \
op monitor interval=10s timeout=20s
primitive stonith_sbd stonith:external/sbd \
params pcmk_delay_max=30
primitive web apache \
params configfile="/etc/apache2/httpd.conf" \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s \
op monitor interval=10 timeout=20s
group aapche filesystem myip web \
meta target-role=Started is-managed=true resource-stickiness=1000
location cli-prefer-aapche aapche role=Started inf: sles1
location loc-aapche-sles1 aapche role=Started 10: sles1
property cib-bootstrap-options: \
stonith-enabled=true \
no-quorum-policy=ignore \
placement-strategy=balanced \
expected-quorum-votes=2 \
dc-version=1.1.12-f47ea56 \
cluster-infrastructure="classic openais (with plugin)" \
last-lrm-refresh=1440502955 \
stonith-timeout=40s
rsc_defaults rsc-options: \
resource-stickiness=1000 \
migration-threshold=3
op_defaults op-options: \
timeout=600 \
record-pending=true


BR

Jost




From: Andrew Beekhof 
Sent: Thursday, August 27, 2015 12:20 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] resource-stickiness

> On 26 Aug 2015, at 10:09 pm, Rakovec Jost  wrote:
>
> Sorry  one typo: problem is the same
>
>
> location cli-prefer-aapche aapche role=Started 10: sles2

Change the name of your constraint.
The 'cli-prefer-’ prefix is reserved for “temporary” constraints created by the 
command line tools (which therefor feel entitled to delete them as necessary).

>
> to:
>
> location cli-prefer-aapche aapche role=Started inf: sles2
>
>
> It keep change to infinity.
>
>
>
> my configuration is:
>
> node sles1
> node sles2
> primitive filesystem Filesystem \
>params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \
>op start interval=0 timeout=60 \
>op stop interval=0 timeout=60 \
>op monitor interval=20 timeout=40
> primitive myip IPaddr2 \
>params ip=x.x.x.x \
>op start interval=0 timeout=20s \
>op stop interval=0 timeout=20s \
>op monitor interval=10s timeout=20s
> primitive stonith_sbd stonith:external/sbd \
>params pcmk_delay_max=30
> primitive web apache \
>params configfile="/etc/apache2/httpd.conf" \
>op start interval=0 timeout=40s \
>op stop interval=0 timeout=60s \
>op monitor interval=10 timeout=20s
> group aapche filesystem myip web \
>meta target-role=Started is-managed=true resource-stickiness=1000
> location cli-prefer-aapche aapche role=Started 10: sles2
> property cib-bootstrap-options: \
>stonith-enabled=true \
>no-quorum-policy=ignore \
>placement-strategy=balanced \
>expected-quorum-votes=2 \
>dc-version=1.1.12-f47ea56 \
>cluster-infrastructure="classic openais (with plugin)" \
>last-lrm-refresh=1440502955 \
>stonith-timeout=40s
> rsc_defaults rsc-options: \
>resource-stickiness=1000 \
>migration-threshold=3
> op_defaults op-options: \
>timeout=600 \
>record-pending=true
>
>
>
> and after migration:
>
>
> node sles1
> node sles2
> primitive filesystem Filesystem \
>params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \
>op start interval=0 timeout=60 \
>op stop interval=0 timeout=60 \
>op monitor interval=20 timeout=40
> primitive myip IPaddr2 \
>params ip=10.9.131.86 \
>op start interval=0 timeout=20s \
>op stop interval=0 timeout=20s \
>op monitor interval=10s timeout=20s
> primitive stonith_sbd stonith:external/sbd \
>params pcmk_delay_max=30
> primitive web apache \
>params configfile="/etc/apache2/httpd.conf" \
>op start interval=0 time

Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-27 Thread Tom Yates

On Mon, 24 Aug 2015, Andrei Borzenkov wrote:


24.08.2015 13:32, Tom Yates пишет:

 if i understand you aright, my problem is that the stop script didn't
 return a 0 (OK) exit status, so CRM didn't know where to go.  is the
 exit status of the stop script how CRM determines the status of the stop
 operation?


correct


 does CRM also use the output of "/etc/init.d/script status" to determine
 continuing successful operation?


It definitely does not use *output* of script - only return code. If the 
question is whether it probes resource additionally to checking stop exit 
code - I do not think so (I know it does it in some cases for systemd 
resources).


i just thought i'd come back and follow-up.  in testing this morning, i 
can confirm that the "pppoe-stop" command returns status 1 if pppd isn't 
running.  that makes a standard init.d script, which passes on the return 
code of the stop command, unhelpful to CRM.


i changed the script so that on stop, having run pppoe-stop, it checks for 
the existence of a working ppp0 interface, and returns 0 IFO there is 
none.


If resource was previously active and stop was attempted as cleanup after 
resource failure - yes, it should attempt to start it again.


that is now what happens.  it seems to try three time to bring up pppd, 
then kicks the service over to the other node.


in the case of extended outages (ie, the ISP goes away for more than about 
10 minutes), where both nodes have time to fail, we end up back in the bad 
old state (service failed on both nodes):


[root@positron ~]# crm status
[...]
Online: [ electron positron ]

 Resource Group: BothIPs
 InternalIP (ocf::heartbeat:IPaddr):Started electron
 ExternalIP (lsb:hb-adsl-helper):   Stopped

Failed actions:
ExternalIP_monitor_6 (node=positron, call=15, rc=7, status=complete): 
not running
ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed Out): 
unknown exec error
ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out): 
unknown exec error

is there any way to configure CRM to keep kicking the service between the 
two nodes forever (ie, try three times on positron, kick service group to 
electron, try three times on electron, kick back to positron, lather rinse 
repeat...)?


for a service like DSL, which can go away for extended periods through no 
local fault then suddenly and with no announcement come back, this would 
be most useful behaviour.


thanks to all for help with this.  thanks also to those who have suggested 
i rewrite this as an OCF agent (especially to ken gaillot who was kind 
enough to point me to documentation); i will look at that if time permits.



--

  Tom Yates  -  http://www.teaparty.net___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] resource-stickiness

2015-08-27 Thread Ken Gaillot
On 08/27/2015 02:42 AM, Rakovec Jost wrote:
> Hi
> 
> 
> it doesn't work as I expected, I change name to:
> 
> location loc-aapche-sles1 aapche role=Started 10: sles1
> 
> 
> but after I manual move resource via HAWK to other node it auto add this line:
> 
> location cli-prefer-aapche aapche role=Started inf: sles1
> 
> 
> so now I have both lines:
> 
> location cli-prefer-aapche aapche role=Started inf: sles1
> location loc-aapche-sles1 aapche role=Started 10: sles1

When you manually move a resource using a command-line tool, those tools
accomplish the moving by adding a constraint, like the one you see added
above.

Such tools generally provide another option to clear any constraints
they added, which you can manually run after you are satisfied with the
state of things. Until you do so, the added constraint will remain, and
will affect resource placement.

> 
> and resource-stickiness doesn't work since after fence node1 the resource is 
> move back to node1 after node1 come back and this is what I don't like. I 
> know that I can remove line  that was added by cluster, but this is not the 
> proper solution. Please tell me what is wrong. Thanks.  My config: 

Resource placement depends on many factors. "Scores" affect the outcome;
stickiness has a score, and each constraint has a score, and the active
node with the highest score wins.

In your config, resource-stickiness has a score of 1000, but
cli-aapche-sles1 has a score of "inf" (infinity), so sles1 wins when it
comes back online (infinity > 1000). By contrast, loc-aapche-sles1 has a
score of 10, so by itself, it would not cause the resource to move back
(10 < 1000).

To achieve what you want, clear the temporary constraint added by hawk,
before sles1 comes back.

> node sles1
> node sles2
> primitive filesystem Filesystem \
> params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \
> op start interval=0 timeout=60 \
> op stop interval=0 timeout=60 \
> op monitor interval=20 timeout=40
> primitive myip IPaddr2 \
> params ip=10.9.131.86 \
> op start interval=0 timeout=20s \
> op stop interval=0 timeout=20s \
> op monitor interval=10s timeout=20s
> primitive stonith_sbd stonith:external/sbd \
> params pcmk_delay_max=30
> primitive web apache \
> params configfile="/etc/apache2/httpd.conf" \
> op start interval=0 timeout=40s \
> op stop interval=0 timeout=60s \
> op monitor interval=10 timeout=20s
> group aapche filesystem myip web \
> meta target-role=Started is-managed=true resource-stickiness=1000
> location cli-prefer-aapche aapche role=Started inf: sles1
> location loc-aapche-sles1 aapche role=Started 10: sles1
> property cib-bootstrap-options: \
> stonith-enabled=true \
> no-quorum-policy=ignore \
> placement-strategy=balanced \
> expected-quorum-votes=2 \
> dc-version=1.1.12-f47ea56 \
> cluster-infrastructure="classic openais (with plugin)" \
> last-lrm-refresh=1440502955 \
> stonith-timeout=40s
> rsc_defaults rsc-options: \
> resource-stickiness=1000 \
> migration-threshold=3
> op_defaults op-options: \
> timeout=600 \
> record-pending=true
> 
> 
> BR
> 
> Jost
> 
> 
> 
> 
> From: Andrew Beekhof 
> Sent: Thursday, August 27, 2015 12:20 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] resource-stickiness
> 
>> On 26 Aug 2015, at 10:09 pm, Rakovec Jost  wrote:
>>
>> Sorry  one typo: problem is the same
>>
>>
>> location cli-prefer-aapche aapche role=Started 10: sles2
> 
> Change the name of your constraint.
> The 'cli-prefer-’ prefix is reserved for “temporary” constraints created by 
> the command line tools (which therefor feel entitled to delete them as 
> necessary).
> 
>>
>> to:
>>
>> location cli-prefer-aapche aapche role=Started inf: sles2
>>
>>
>> It keep change to infinity.
>>
>>
>>
>> my configuration is:
>>
>> node sles1
>> node sles2
>> primitive filesystem Filesystem \
>>params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \
>>op start interval=0 timeout=60 \
>>op stop interval=0 timeout=60 \
>>op monitor interval=20 timeout=40
>> primitive myip IPaddr2 \
>>params ip=x.x.x.x \
>>op start interval=0 timeout=20s \
>>op stop interval=0 timeout=20s \
>>op monitor interval=10s timeout=20s
>> primitive stonith_sbd stonith:external/sbd \
>>params pcmk_delay_max=30
>> primitive web apache \
>>params configfile="/etc/apache2/httpd.conf" \
>>op start interval=0 timeout=40s \
>>op stop interval=0 timeout=60s \
>>op monitor interval=10 timeout=20s
>> group aapche filesystem myip web \
>>meta target-role=Started is-managed=true resource-stickiness=1000
>> location cli-prefer-aapche aapche role=Started 10:

Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-27 Thread Ken Gaillot
On 08/27/2015 03:04 AM, Tom Yates wrote:
> On Mon, 24 Aug 2015, Andrei Borzenkov wrote:
> 
>> 24.08.2015 13:32, Tom Yates пишет:
>>>  if i understand you aright, my problem is that the stop script didn't
>>>  return a 0 (OK) exit status, so CRM didn't know where to go.  is the
>>>  exit status of the stop script how CRM determines the status of the
>>> stop
>>>  operation?
>>
>> correct
>>
>>>  does CRM also use the output of "/etc/init.d/script status" to
>>> determine
>>>  continuing successful operation?
>>
>> It definitely does not use *output* of script - only return code. If
>> the question is whether it probes resource additionally to checking
>> stop exit code - I do not think so (I know it does it in some cases
>> for systemd resources).
> 
> i just thought i'd come back and follow-up.  in testing this morning, i
> can confirm that the "pppoe-stop" command returns status 1 if pppd isn't
> running.  that makes a standard init.d script, which passes on the
> return code of the stop command, unhelpful to CRM.
> 
> i changed the script so that on stop, having run pppoe-stop, it checks
> for the existence of a working ppp0 interface, and returns 0 IFO there
> is none.

Nice

>> If resource was previously active and stop was attempted as cleanup
>> after resource failure - yes, it should attempt to start it again.
> 
> that is now what happens.  it seems to try three time to bring up pppd,
> then kicks the service over to the other node.
> 
> in the case of extended outages (ie, the ISP goes away for more than
> about 10 minutes), where both nodes have time to fail, we end up back in
> the bad old state (service failed on both nodes):
> 
> [root@positron ~]# crm status
> [...]
> Online: [ electron positron ]
> 
>  Resource Group: BothIPs
>  InternalIP (ocf::heartbeat:IPaddr):Started electron
>  ExternalIP (lsb:hb-adsl-helper):   Stopped
> 
> Failed actions:
> ExternalIP_monitor_6 (node=positron, call=15, rc=7,
> status=complete): not running
> ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed
> Out): unknown exec error
> ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out):
> unknown exec error
> 
> is there any way to configure CRM to keep kicking the service between
> the two nodes forever (ie, try three times on positron, kick service
> group to electron, try three times on electron, kick back to positron,
> lather rinse repeat...)?
> 
> for a service like DSL, which can go away for extended periods through
> no local fault then suddenly and with no announcement come back, this
> would be most useful behaviour.

Yes, see migration-threshold and failure-timeout.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options

> thanks to all for help with this.  thanks also to those who have
> suggested i rewrite this as an OCF agent (especially to ken gaillot who
> was kind enough to point me to documentation); i will look at that if
> time permits.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Resources within a Group

2015-08-27 Thread Jorge Fábregas
Hi,

I'm on SLES 11 SP4 (Pacemaker 1.1.12) and still learning all this :)
I'm wondering if there's a way to control the resource startup behaviour
within a group?

For example, I have an LVM resource (to activate a VG) and the next one:
the Filesystem resource (to mount it).  If the VG activation fails I see
errors afterwards trying to mount the filesystem.  If there's something
like "If the first resource fails, stop further processing"? (sort of
like one can control the stacking of PAM modules).

Thanks,
Jorge

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] wait_for_all in SLES 11

2015-08-27 Thread Jorge Fábregas
Hi,

Is there a way to recreate the newest Corosync option: wait_for_all in
SLES 11?  Does anyone (anyone from SUSE?) knows if there any plans to
backport this into SLES 11?

I can't say I miss this option (since I'm just starting out in HA) but
after evaluating many possible situations with fence-loops, that's like
one of the greatest ideas I've seen.  Disabling autostart of openais
don't quite make it like the wait_for_all.

Perhaps I might create a wrapper script around /etc/init.d/openais.

Thanks,
Jorge

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Resources within a Group

2015-08-27 Thread Andrew Beekhof

> On 28 Aug 2015, at 7:54 am, Jorge Fábregas  wrote:
> 
> Hi,
> 
> I'm on SLES 11 SP4 (Pacemaker 1.1.12) and still learning all this :)
> I'm wondering if there's a way to control the resource startup behaviour
> within a group?
> 
> For example, I have an LVM resource (to activate a VG) and the next one:
> the Filesystem resource (to mount it).  If the VG activation fails I see
> errors afterwards trying to mount the filesystem.  If there's something
> like "If the first resource fails, stop further processing"? (sort of
> like one can control the stacking of PAM modules).

Thats how it normally works.
Perhaps the VG agent is swallowing the error instead of reporting it to the 
cluster?

> 
> Thanks,
> Jorge
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: fence_sanlock and pacemaker

2015-08-27 Thread Andrew Beekhof

> On 27 Aug 2015, at 4:20 pm, Ulrich Windl  
> wrote:
> 
 "Laurent B."  schrieb am 27.08.2015 um 08:06 in
> Nachricht
> <55dea8cc.3080...@qmail.re>:
>> Hello,
>> 
>>> You’d have to build it yourself, but sbd could be an option
>>> 
>> 
>> do you have any clue on how to install it on redhat (6.5) ? I installed
>> the gluster glue package and the sbd package (provided by OpenSUSE) but
>> now I'm stuck. The stonith resource creation give me an error saying
>> that the sbd resource was not found.
> 
> sbd has to be started before the cluster software. SUSE does something like:

on RHEL7 the sbd systemd unit file arranges for it to be started/stopped 
whenever corosync is (started/stopped)

> SBD_CONFIG=/etc/sysconfig/sbd
> SBD_BIN="/usr/sbin/sbd"
> if [ -f $SBD_CONFIG ]; then
>. $SBD_CONFIG
> fi
> 
> [ -x "$exec" ] || exit 0
> 
> SBD_DEVS=${SBD_DEVICE%;}
> SBD_DEVICE=${SBD_DEVS//;/ -d }
> 
> : ${SBD_DELAY_START:="no"}
> 
> StartSBD() {
>test -x $SBD_BIN || return
>if [ -n "$SBD_DEVICE" ]; then
>if ! pidofproc $SBD_BIN >/dev/null 2>&1 ; then
>echo -n "Starting SBD - "
>if ! $SBD_BIN -d $SBD_DEVICE -D $SBD_OPTS watch ;
> then
>echo "SBD failed to start; aborting."
>exit 1
>fi
>if env_is_true ${SBD_DELAY_START} ; then
>sleep $(sbd -d "$SBD_DEVICE" dump | grep -m 1
> ms
> gwait | awk '{print $4}') 2>/dev/null
>fi
>fi
>fi
> }
> 
> StopSBD() {
>test -x $SBD_BIN || return
>if [ -n "$SBD_DEVICE" ]; then
>echo -n "Stopping SBD - "
>if ! $SBD_BIN -d $SBD_DEVICE -D $SBD_OPTS message LOCAL exit ;
> t
> hen
>echo "SBD failed to stop; aborting."
>exit 1
>fi
>fi
>while pidofproc $SBD_BIN >/dev/null 2>&1 ; do
>sleep 1
>done
>echo -n "done "
> }
> 
>> 
>> Thank you,
>> 
>> Laurent
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: resource-stickiness

2015-08-27 Thread Andrew Beekhof

> On 27 Aug 2015, at 4:12 pm, Ulrich Windl  
> wrote:
> 
 Andrew Beekhof  schrieb am 27.08.2015 um 00:20 in
> Nachricht
> :
> 
>>> On 26 Aug 2015, at 10:09 pm, Rakovec Jost  wrote:
>>> 
>>> Sorry  one typo: problem is the same
>>> 
>>> 
>>> location cli-prefer-aapche aapche role=Started 10: sles2
>> 
>> Change the name of your constraint.
>> The 'cli-prefer-’ prefix is reserved for “temporary” constraints
> created by 
>> the command line tools (which therefor feel entitled to delete them as 
>> necessary).
> 
> In which ways is "cli-prefer-" handled specially, if I may ask…

we delete them when you use the cli tools to move the resource somewhere else 
(crm_resource —ban, —move, —clear)

> 
>> 
>>> 
>>> to:
>>> 
>>> location cli-prefer-aapche aapche role=Started inf: sles2 
>>> 
>>> 
>>> It keep change to infinity. 
>>> 
>>> 
>>> 
>>> my configuration is:
>>> 
>>> node sles1 
>>> node sles2 
>>> primitive filesystem Filesystem \ 
>>>   params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \
> 
>>>   op start interval=0 timeout=60 \ 
>>>   op stop interval=0 timeout=60 \ 
>>>   op monitor interval=20 timeout=40 
>>> primitive myip IPaddr2 \ 
>>>   params ip=x.x.x.x \ 
>>>   op start interval=0 timeout=20s \ 
>>>   op stop interval=0 timeout=20s \ 
>>>   op monitor interval=10s timeout=20s 
>>> primitive stonith_sbd stonith:external/sbd \ 
>>>   params pcmk_delay_max=30 
>>> primitive web apache \ 
>>>   params configfile="/etc/apache2/httpd.conf" \ 
>>>   op start interval=0 timeout=40s \ 
>>>   op stop interval=0 timeout=60s \ 
>>>   op monitor interval=10 timeout=20s 
>>> group aapche filesystem myip web \ 
>>>   meta target-role=Started is-managed=true resource-stickiness=1000 
>>> location cli-prefer-aapche aapche role=Started 10: sles2 
>>> property cib-bootstrap-options: \ 
>>>   stonith-enabled=true \ 
>>>   no-quorum-policy=ignore \ 
>>>   placement-strategy=balanced \ 
>>>   expected-quorum-votes=2 \ 
>>>   dc-version=1.1.12-f47ea56 \ 
>>>   cluster-infrastructure="classic openais (with plugin)" \ 
>>>   last-lrm-refresh=1440502955 \ 
>>>   stonith-timeout=40s 
>>> rsc_defaults rsc-options: \ 
>>>   resource-stickiness=1000 \ 
>>>   migration-threshold=3 
>>> op_defaults op-options: \ 
>>>   timeout=600 \ 
>>>   record-pending=true 
>>> 
>>> 
>>> 
>>> and after migration:
>>> 
>>> 
>>> node sles1 
>>> node sles2 
>>> primitive filesystem Filesystem \ 
>>>   params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \
> 
>>>   op start interval=0 timeout=60 \ 
>>>   op stop interval=0 timeout=60 \ 
>>>   op monitor interval=20 timeout=40 
>>> primitive myip IPaddr2 \ 
>>>   params ip=10.9.131.86 \ 
>>>   op start interval=0 timeout=20s \ 
>>>   op stop interval=0 timeout=20s \ 
>>>   op monitor interval=10s timeout=20s 
>>> primitive stonith_sbd stonith:external/sbd \ 
>>>   params pcmk_delay_max=30 
>>> primitive web apache \ 
>>>   params configfile="/etc/apache2/httpd.conf" \ 
>>>   op start interval=0 timeout=40s \ 
>>>   op stop interval=0 timeout=60s \ 
>>>   op monitor interval=10 timeout=20s 
>>> group aapche filesystem myip web \ 
>>>   meta target-role=Started is-managed=true resource-stickiness=1000 
>>> location cli-prefer-aapche aapche role=Started inf: sles2 
>>> property cib-bootstrap-options: \ 
>>>   stonith-enabled=true \ 
>>>   no-quorum-policy=ignore \ 
>>>   placement-strategy=balanced \ 
>>>   expected-quorum-votes=2 \ 
>>>   dc-version=1.1.12-f47ea56 \ 
>>>   cluster-infrastructure="classic openais (with plugin)" \ 
>>>   last-lrm-refresh=1440502955 \ 
>>>   stonith-timeout=40s 
>>> rsc_defaults rsc-options: \ 
>>>   resource-stickiness=1000 \ 
>>>   migration-threshold=3 
>>> op_defaults op-options: \ 
>>>   timeout=600 \ 
>>>   record-pending=true
>>> 
>>> 
>>> From: Rakovec Jost
>>> Sent: Wednesday, August 26, 2015 1:33 PM
>>> To: users@clusterlabs.org 
>>> Subject: resource-stickiness
>>> 
>>> Hi list,
>>> 
>>> 
>>> I have configure simple cluster on sles 11 sp4 and have a problem with 
>> “auto_failover off". The problem is that when ever I migrate resource
> group 
>> via HAWK my configuration change from:
>>> 
>>> location cli-prefer-aapche aapche role=Started 10: sles2
>>> 
>>> to:
>>> 
>>> location cli-ban-aapche-on-sles1 aapche role=Started -inf: sles1
>>> 
>>> 
>>> It keep change to inf. 
>>> 
>>> 
>>> and then after fance node, resource is moving back to original node which I
> 
>> don't want. How can I avoid this situation?
>>> 
>>> my configuration is:
>>> 
>>> node sles1 
>>> node sles2 
>>> primitive filesystem Filesystem \ 
>>>   params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \
> 
>>>   op start interval=0 timeout=60 \ 
>>>   op stop interval=0 timeout=60 \ 
>>>   op monitor interval=20 tim

Re: [ClusterLabs] wait_for_all in SLES 11

2015-08-27 Thread Digimer
Not a SUSE user, so I'm not familiar with what is shipped with SLES 11,
but if it's corosync v2.x, you already have it.

If it's not corosync v2, then you might be interested in how I solved
this problem in RHEL 6 (corosync v1 + cman + rgmanager);
'safe_anvil_start'
(https://github.com/digimer/striker/blob/master/tools/safe_anvil_start).
It's perl, but if you're OK with that, it should be fairly easy to port.

Basically, it tries to reach the peer on boot. If it can, it does some
other sanity checks (it expects drbd, cman and rgmanager, hence the need
to port most likely). So long as it can reach the peer, it will start
the cluster. If it can't, it will just sit there. So it's the same idea
as wait_for_all.

It's triggered by an rc3.d script that it will create/remove when called
with --enable/--disable). It expects a file called
/etc/striker/striker.conf and looks for
"tools::safe_anvil_start::enabled = [0|1]", and you can use it with a
skeleton file with just that value in it.

If you're interested in this and if you have any trouble, I'll be happy
to help you adapt it. With luck though, you'll already have corosync v2
and it'll be moot.

Cheers

On 27/08/15 08:21 PM, Jorge Fábregas wrote:
> Hi,
> 
> Is there a way to recreate the newest Corosync option: wait_for_all in
> SLES 11?  Does anyone (anyone from SUSE?) knows if there any plans to
> backport this into SLES 11?
> 
> I can't say I miss this option (since I'm just starting out in HA) but
> after evaluating many possible situations with fence-loops, that's like
> one of the greatest ideas I've seen.  Disabling autostart of openais
> don't quite make it like the wait_for_all.
> 
> Perhaps I might create a wrapper script around /etc/init.d/openais.
> 
> Thanks,
> Jorge
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] "0 Nodes configured" in crm_mon

2015-08-27 Thread Andrew Beekhof

> On 25 Aug 2015, at 1:45 am, Stanislav Kopp  wrote:
> 
> Hi all,
> 
> I'm trying to run corosync2 + pacemaker setup on Debian Jessie (only
> for testing purpose), I've successfully compiled all components using
> this guide: http://clusterlabs.org/wiki/Compiling_on_Debian
> 
> Unfortunately, if I run "crm_mon" I don't see any nodes.
> 
> ###
> Last updated: Mon Aug 24 17:36:00 2015
> Last change: Mon Aug 24 17:17:42 2015
> Current DC: NONE
> 0 Nodes configured
> 0 Resources configured
> 
> 
> I don't see any errors in corosync log either: http://pastebin.com/bJX66B9e

really?

Aug 24 17:16:10 [1723] pm1   crmd:error: cluster_connect_quorum:
Corosync quorum is not configured

Looks like you forgot to uncomment:

   #provider: corosync_votequorum

> 
> This is my corosync.conf
> 
> ###
> 
> # Please read the corosync.conf.5 manual page
> totem {
>version: 2
> 
>crypto_cipher: none
>crypto_hash: none
> 
>interface {
>ringnumber: 0
>bindnetaddr: 192.168.122.0
>mcastport: 5405
>ttl: 1
>}
>transport: udpu
> }
> 
> logging {
>fileline: off
>to_logfile: yes
>to_syslog: no
>logfile: /var/log/cluster/corosync.log
>debug: off
>timestamp: on
>logger_subsys {
>subsys: QUORUM
>debug: off
>}
> }
> 
> nodelist {
>node {
>ring0_addr: 192.168.122.172
>#nodeid: 1
>}
> 
>node {
>ring0_addr: 192.168.122.113
>#nodeid: 2
>}
> }
> 
> quorum {
># Enable and configure quorum subsystem (default: off)
># see also corosync.conf.5 and votequorum.5
>#provider: corosync_votequorum
> }
> 
> 
> 
> used components:
> 
> pacemaker: 1.1.12
> corosync: 2.3.5
> libqb: 0.17.1
> 
> 
> Did I miss something?
> 
> Thanks!
> Stan
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLabs Developers] Resource Agent language discussion

2015-08-27 Thread Andrew Beekhof

> On 21 Aug 2015, at 2:21 am, Jehan-Guillaume de Rorthais  
> wrote:
> 
> On Thu, 20 Aug 2015 15:05:24 +1000
> Andrew Beekhof  wrote:
> 
>> 
>>> On 19 Aug 2015, at 6:59 pm, Jehan-Guillaume de Rorthais 
>>> wrote:
>>> 
>>> On Mon, 17 Aug 2015 09:42:35 +1000
>>> Andrew Beekhof  wrote:
>>> 
> On 11 Aug 2015, at 5:34 pm, Jehan-Guillaume de Rorthais 
> wrote:
> 
> On Tue, 11 Aug 2015 11:30:03 +1000
> Andrew Beekhof  wrote:
>>> [...]
>> You can and should use whatever language you like for your own private
>> RAs. But if you want it accepted and maintained by the resource-agents
>> project, you would be advised to use the language they have standardised
>> on.
> 
> Well, let's imagine our RA was written in bash (in fact, we have a bash
> version pretty close to the current perl version we abandoned). I wonder
> if it would be accepted in the resource-agents project anyway as another
> one already exists there. I can easily list the reasons we rewrote a new
> one, but this is not the subject here.
> 
> The discussion here is more about the language, if I should extract a
> ocf-perl-module from my RA and if there is any chance the resource-agents
> project would accept it.
 
 Well, it depends on the reasons you didn’t list :-)
>>> 
>>> Ok, let's answer the questions then :)
>>> 
 The first questions any maintainer is going to ask are:
 - why did you write a new one?
>>> 
>>> About the existing pgsql RA:
>>> * it supports stateless AND multistate pgsql resource. It makes the code
>>>   bigger, more complexe, hard to follow and understand
>>> * some params are for multistate usage only, some other for stateless only,
>>>   some for both, making the configuration harder to understand
>>> * some params are required for multistate where a recent PostgreSQL can
>>> live without them (restore_command)
>>> * it achieves operations a RA should not take care of (switching from
>>>   synchronous to asynchronous replication on the master if slaves are gone,
>>>   killing all existing xact)
>>> * ...and this makes the code even bigger and complexe again
>>> * it supports too many options and has some conventions the DBA should care
>>>   themselves. This make it way too complex and touchy to setup and maintain
>>> * it does not support demote, making the code lying about the real
>>>   state of the resource to Pacemaker. This was because demote/switchover
>>> was unsafe for postgresql < 9.3.
>>> 
>>> What we tried to achieve with a new pgsql RA:
>>> * multistate only (we already have a stateless RA, in bash)
>>> * should have a simple code: easier to understand, to maintain, achieve one
>>>   goal at a time
>>> * be simple to setup
>>> * should not substitute itself to the DBA
>>> * support safe ("cold") demote/switchover
>>> 
 - can we merge this with the old one?
>>> 
>>> Well, it would make the code even bigger, maybe conflicting and harder to
>>> understand. I already can hear questions about such a frankenstein RA
>>> ("why am I able to setup two different multistate architecture?" "why this
>>> one does not supports this parameter?" "should I create my recovery.conf or
>>> not?")
>>> 
>>> Some of our ideas could be merged to the old one though, we could discuss
>>> and help maintainers if they are interested and have time. But we only have
>>> a limited R&D time and have no time to lead such a development.
>>> 
 - can the new one replace the old one? (ie. full superset)
>>> 
>>> No. It does not support stateless resource, does not mess with replication
>>> synchronism, does not kill queries if all the slaves are gone, does not
>>> "lock" an instance when it failed, only promote the resource using "pg_ctl
>>> promote" (with no restart), ...
>>> 
 Because if both are included, then they will forevermore be answering the
 question “which one should I use?”.
>>> 
>>> True.
>>> 
 Basically, if you want it accepted upstream, then yes, you probably want to
 ditch the perl bit. But not having seen the agent or knowing why it exists,
 its hard to say.
>>> 
>>> Well, it seems our RA will not make it to the upstream repository,
>> 
>> You made a fairly reasonable argument for separate stateless and stateful
>> variants.
> 
> BTW, how other official RA are dealing with this?

They’re not, but in this case it seems there could be good reasons to separate 
them

> A quick look at RA names
> seems to reveal no service have dedicated stateless and stateful RA scripts.
> 
> [...]
>>> What I was discussing here was:
>>> 
>>> * if not using bash, is there any trap we should avoid that are already
>>>   addressed in the ocf-shellfuncs library?
>> 
>> No, you just might have to re-implement some things.
>> Particularly logging.
> 
> Ok, that was my conclusion so far. I'll have a look at the logging funcs then.
> 
>>> * is there a chance a perl version of such library would be accepted
>>> upstream?
>> 
>> Depends if you’re

Re: [ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER

2015-08-27 Thread Andrew Beekhof

> On 21 Aug 2015, at 1:32 pm, Andrei Borzenkov  wrote:
> 
> 21.08.2015 00:35, Brian Campbell пишет:
>> I have a master/slave resource (with a custom resource agent) which,
>> if it uncleanly shut down, will return OCF_FAILED_MASTER on the next
>> "monitor" operation. This seems to be what
>> http://www.linux-ha.org/doc/dev-guides/_literal_ocf_failed_master_literal_9.html
>> suggests that exit code should be used for.
>> 
>> After the node is fenced, and comes up again, Pacemaker probes all of
>> the resources. It gets the OCF_FAILED_MASTER exit code, and decides
>> that it needs to demote the resource. So it executes the demote
>> action. My resource agent returns an error on a demote action if it is
>> not running, which seems to be the suggested behavior according to
>> http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html
>> 
>> This then causes Pacemaker to log a failure for the "demote" action,
>> and then try to recover by stopping (which succeeds cleanly because
>> the resource is stopped) followed by starting it again (which again
>> succeeds, as we can start in slave mode from a failed state). So the
>> end state is correct, but crm_mon shows a failed action that you need
>> to clear out:
>> 
>> Failed actions:
>> 
>> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0
>> (node=es-efs-master2, call=73, rc=1, status=complete, l
>> ast-rc-change=Thu Aug 20 12:52:21 2015
>> , queued=54ms, exec=1ms
>> ): unknown error
>> 
>> I'm curious about whether the behavior of my resource agent is
>> correct. Should I not be returning OCF_FAILED_MASTER upon the
>> "monitor" operation if the resource isn't started?
> 
> Correct. If resource is not started it cannot be master or slave; it can 
> become master only after pacemaker requested it. Unexpected master would be 
> just the same error as well.
> 
> If you can determine that one resource instance is more suitable to become 
> master than another one, you should set master score respectively so 
> pacemaker will promote correct instance.
> 
>>   Or should the
>> "demote" operation do something different in this state, like actually
>> starting up the slave?
>> 
> 
> In general, if current resource state is the same as would be after operation 
> is completed, there is absolutely no reason to return error - just pretend 
> operation succeeded.

Always return the actual state. ie. OCF_NOT_RUNNING in these two cases.

Only return OCF_FAILED_MASTER if you know enough to say that its in the master 
state (ie. lock file, or similar mechanism) but not able to handle requests.

> 
>> It seems like the behavior of Pacemaker is different than what's
>> documented in the resource agent guide, so I'm trying to figure out if
>> this is a bug in my resource agent, a bug in Pacemaker, a
>> misunderstanding on my part, or actually intended behavior.
>> 
>> -- Brian
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Resources within a Group

2015-08-27 Thread Zhen Ren
Hi Jorge,

Like "colocation" is what you want. 

Please take a look at here:
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-sets-colocation.html
 

 >>>
> Hi, 
>  
> I'm on SLES 11 SP4 (Pacemaker 1.1.12) and still learning all this :) 
> I'm wondering if there's a way to control the resource startup behaviour 
> within a group? 
>  
> For example, I have an LVM resource (to activate a VG) and the next one: 
> the Filesystem resource (to mount it).  If the VG activation fails I see 
> errors afterwards trying to mount the filesystem.  If there's something 
> like "If the first resource fails, stop further processing"? (sort of 
> like one can control the stacking of PAM modules). 
>  
> Thanks, 
> Jorge 
>  
> ___ 
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
>  
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
>  
>  



--
Eric, Ren




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org