Re: [ClusterLabs] resource-stickiness
Hi it doesn't work as I expected, I change name to: location loc-aapche-sles1 aapche role=Started 10: sles1 but after I manual move resource via HAWK to other node it auto add this line: location cli-prefer-aapche aapche role=Started inf: sles1 so now I have both lines: location cli-prefer-aapche aapche role=Started inf: sles1 location loc-aapche-sles1 aapche role=Started 10: sles1 and resource-stickiness doesn't work since after fence node1 the resource is move back to node1 after node1 come back and this is what I don't like. I know that I can remove line that was added by cluster, but this is not the proper solution. Please tell me what is wrong. Thanks. My config: node sles1 node sles2 primitive filesystem Filesystem \ params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \ op start interval=0 timeout=60 \ op stop interval=0 timeout=60 \ op monitor interval=20 timeout=40 primitive myip IPaddr2 \ params ip=10.9.131.86 \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ op monitor interval=10s timeout=20s primitive stonith_sbd stonith:external/sbd \ params pcmk_delay_max=30 primitive web apache \ params configfile="/etc/apache2/httpd.conf" \ op start interval=0 timeout=40s \ op stop interval=0 timeout=60s \ op monitor interval=10 timeout=20s group aapche filesystem myip web \ meta target-role=Started is-managed=true resource-stickiness=1000 location cli-prefer-aapche aapche role=Started inf: sles1 location loc-aapche-sles1 aapche role=Started 10: sles1 property cib-bootstrap-options: \ stonith-enabled=true \ no-quorum-policy=ignore \ placement-strategy=balanced \ expected-quorum-votes=2 \ dc-version=1.1.12-f47ea56 \ cluster-infrastructure="classic openais (with plugin)" \ last-lrm-refresh=1440502955 \ stonith-timeout=40s rsc_defaults rsc-options: \ resource-stickiness=1000 \ migration-threshold=3 op_defaults op-options: \ timeout=600 \ record-pending=true BR Jost From: Andrew Beekhof Sent: Thursday, August 27, 2015 12:20 AM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] resource-stickiness > On 26 Aug 2015, at 10:09 pm, Rakovec Jost wrote: > > Sorry one typo: problem is the same > > > location cli-prefer-aapche aapche role=Started 10: sles2 Change the name of your constraint. The 'cli-prefer-’ prefix is reserved for “temporary” constraints created by the command line tools (which therefor feel entitled to delete them as necessary). > > to: > > location cli-prefer-aapche aapche role=Started inf: sles2 > > > It keep change to infinity. > > > > my configuration is: > > node sles1 > node sles2 > primitive filesystem Filesystem \ >params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \ >op start interval=0 timeout=60 \ >op stop interval=0 timeout=60 \ >op monitor interval=20 timeout=40 > primitive myip IPaddr2 \ >params ip=x.x.x.x \ >op start interval=0 timeout=20s \ >op stop interval=0 timeout=20s \ >op monitor interval=10s timeout=20s > primitive stonith_sbd stonith:external/sbd \ >params pcmk_delay_max=30 > primitive web apache \ >params configfile="/etc/apache2/httpd.conf" \ >op start interval=0 timeout=40s \ >op stop interval=0 timeout=60s \ >op monitor interval=10 timeout=20s > group aapche filesystem myip web \ >meta target-role=Started is-managed=true resource-stickiness=1000 > location cli-prefer-aapche aapche role=Started 10: sles2 > property cib-bootstrap-options: \ >stonith-enabled=true \ >no-quorum-policy=ignore \ >placement-strategy=balanced \ >expected-quorum-votes=2 \ >dc-version=1.1.12-f47ea56 \ >cluster-infrastructure="classic openais (with plugin)" \ >last-lrm-refresh=1440502955 \ >stonith-timeout=40s > rsc_defaults rsc-options: \ >resource-stickiness=1000 \ >migration-threshold=3 > op_defaults op-options: \ >timeout=600 \ >record-pending=true > > > > and after migration: > > > node sles1 > node sles2 > primitive filesystem Filesystem \ >params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \ >op start interval=0 timeout=60 \ >op stop interval=0 timeout=60 \ >op monitor interval=20 timeout=40 > primitive myip IPaddr2 \ >params ip=10.9.131.86 \ >op start interval=0 timeout=20s \ >op stop interval=0 timeout=20s \ >op monitor interval=10s timeout=20s > primitive stonith_sbd stonith:external/sbd \ >params pcmk_delay_max=30 > primitive web apache \ >params configfile="/etc/apache2/httpd.conf" \ >op start interval=0 time
Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
On Mon, 24 Aug 2015, Andrei Borzenkov wrote: 24.08.2015 13:32, Tom Yates пишет: if i understand you aright, my problem is that the stop script didn't return a 0 (OK) exit status, so CRM didn't know where to go. is the exit status of the stop script how CRM determines the status of the stop operation? correct does CRM also use the output of "/etc/init.d/script status" to determine continuing successful operation? It definitely does not use *output* of script - only return code. If the question is whether it probes resource additionally to checking stop exit code - I do not think so (I know it does it in some cases for systemd resources). i just thought i'd come back and follow-up. in testing this morning, i can confirm that the "pppoe-stop" command returns status 1 if pppd isn't running. that makes a standard init.d script, which passes on the return code of the stop command, unhelpful to CRM. i changed the script so that on stop, having run pppoe-stop, it checks for the existence of a working ppp0 interface, and returns 0 IFO there is none. If resource was previously active and stop was attempted as cleanup after resource failure - yes, it should attempt to start it again. that is now what happens. it seems to try three time to bring up pppd, then kicks the service over to the other node. in the case of extended outages (ie, the ISP goes away for more than about 10 minutes), where both nodes have time to fail, we end up back in the bad old state (service failed on both nodes): [root@positron ~]# crm status [...] Online: [ electron positron ] Resource Group: BothIPs InternalIP (ocf::heartbeat:IPaddr):Started electron ExternalIP (lsb:hb-adsl-helper): Stopped Failed actions: ExternalIP_monitor_6 (node=positron, call=15, rc=7, status=complete): not running ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed Out): unknown exec error ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out): unknown exec error is there any way to configure CRM to keep kicking the service between the two nodes forever (ie, try three times on positron, kick service group to electron, try three times on electron, kick back to positron, lather rinse repeat...)? for a service like DSL, which can go away for extended periods through no local fault then suddenly and with no announcement come back, this would be most useful behaviour. thanks to all for help with this. thanks also to those who have suggested i rewrite this as an OCF agent (especially to ken gaillot who was kind enough to point me to documentation); i will look at that if time permits. -- Tom Yates - http://www.teaparty.net___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] resource-stickiness
On 08/27/2015 02:42 AM, Rakovec Jost wrote: > Hi > > > it doesn't work as I expected, I change name to: > > location loc-aapche-sles1 aapche role=Started 10: sles1 > > > but after I manual move resource via HAWK to other node it auto add this line: > > location cli-prefer-aapche aapche role=Started inf: sles1 > > > so now I have both lines: > > location cli-prefer-aapche aapche role=Started inf: sles1 > location loc-aapche-sles1 aapche role=Started 10: sles1 When you manually move a resource using a command-line tool, those tools accomplish the moving by adding a constraint, like the one you see added above. Such tools generally provide another option to clear any constraints they added, which you can manually run after you are satisfied with the state of things. Until you do so, the added constraint will remain, and will affect resource placement. > > and resource-stickiness doesn't work since after fence node1 the resource is > move back to node1 after node1 come back and this is what I don't like. I > know that I can remove line that was added by cluster, but this is not the > proper solution. Please tell me what is wrong. Thanks. My config: Resource placement depends on many factors. "Scores" affect the outcome; stickiness has a score, and each constraint has a score, and the active node with the highest score wins. In your config, resource-stickiness has a score of 1000, but cli-aapche-sles1 has a score of "inf" (infinity), so sles1 wins when it comes back online (infinity > 1000). By contrast, loc-aapche-sles1 has a score of 10, so by itself, it would not cause the resource to move back (10 < 1000). To achieve what you want, clear the temporary constraint added by hawk, before sles1 comes back. > node sles1 > node sles2 > primitive filesystem Filesystem \ > params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \ > op start interval=0 timeout=60 \ > op stop interval=0 timeout=60 \ > op monitor interval=20 timeout=40 > primitive myip IPaddr2 \ > params ip=10.9.131.86 \ > op start interval=0 timeout=20s \ > op stop interval=0 timeout=20s \ > op monitor interval=10s timeout=20s > primitive stonith_sbd stonith:external/sbd \ > params pcmk_delay_max=30 > primitive web apache \ > params configfile="/etc/apache2/httpd.conf" \ > op start interval=0 timeout=40s \ > op stop interval=0 timeout=60s \ > op monitor interval=10 timeout=20s > group aapche filesystem myip web \ > meta target-role=Started is-managed=true resource-stickiness=1000 > location cli-prefer-aapche aapche role=Started inf: sles1 > location loc-aapche-sles1 aapche role=Started 10: sles1 > property cib-bootstrap-options: \ > stonith-enabled=true \ > no-quorum-policy=ignore \ > placement-strategy=balanced \ > expected-quorum-votes=2 \ > dc-version=1.1.12-f47ea56 \ > cluster-infrastructure="classic openais (with plugin)" \ > last-lrm-refresh=1440502955 \ > stonith-timeout=40s > rsc_defaults rsc-options: \ > resource-stickiness=1000 \ > migration-threshold=3 > op_defaults op-options: \ > timeout=600 \ > record-pending=true > > > BR > > Jost > > > > > From: Andrew Beekhof > Sent: Thursday, August 27, 2015 12:20 AM > To: Cluster Labs - All topics related to open-source clustering welcomed > Subject: Re: [ClusterLabs] resource-stickiness > >> On 26 Aug 2015, at 10:09 pm, Rakovec Jost wrote: >> >> Sorry one typo: problem is the same >> >> >> location cli-prefer-aapche aapche role=Started 10: sles2 > > Change the name of your constraint. > The 'cli-prefer-’ prefix is reserved for “temporary” constraints created by > the command line tools (which therefor feel entitled to delete them as > necessary). > >> >> to: >> >> location cli-prefer-aapche aapche role=Started inf: sles2 >> >> >> It keep change to infinity. >> >> >> >> my configuration is: >> >> node sles1 >> node sles2 >> primitive filesystem Filesystem \ >>params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \ >>op start interval=0 timeout=60 \ >>op stop interval=0 timeout=60 \ >>op monitor interval=20 timeout=40 >> primitive myip IPaddr2 \ >>params ip=x.x.x.x \ >>op start interval=0 timeout=20s \ >>op stop interval=0 timeout=20s \ >>op monitor interval=10s timeout=20s >> primitive stonith_sbd stonith:external/sbd \ >>params pcmk_delay_max=30 >> primitive web apache \ >>params configfile="/etc/apache2/httpd.conf" \ >>op start interval=0 timeout=40s \ >>op stop interval=0 timeout=60s \ >>op monitor interval=10 timeout=20s >> group aapche filesystem myip web \ >>meta target-role=Started is-managed=true resource-stickiness=1000 >> location cli-prefer-aapche aapche role=Started 10:
Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
On 08/27/2015 03:04 AM, Tom Yates wrote: > On Mon, 24 Aug 2015, Andrei Borzenkov wrote: > >> 24.08.2015 13:32, Tom Yates пишет: >>> if i understand you aright, my problem is that the stop script didn't >>> return a 0 (OK) exit status, so CRM didn't know where to go. is the >>> exit status of the stop script how CRM determines the status of the >>> stop >>> operation? >> >> correct >> >>> does CRM also use the output of "/etc/init.d/script status" to >>> determine >>> continuing successful operation? >> >> It definitely does not use *output* of script - only return code. If >> the question is whether it probes resource additionally to checking >> stop exit code - I do not think so (I know it does it in some cases >> for systemd resources). > > i just thought i'd come back and follow-up. in testing this morning, i > can confirm that the "pppoe-stop" command returns status 1 if pppd isn't > running. that makes a standard init.d script, which passes on the > return code of the stop command, unhelpful to CRM. > > i changed the script so that on stop, having run pppoe-stop, it checks > for the existence of a working ppp0 interface, and returns 0 IFO there > is none. Nice >> If resource was previously active and stop was attempted as cleanup >> after resource failure - yes, it should attempt to start it again. > > that is now what happens. it seems to try three time to bring up pppd, > then kicks the service over to the other node. > > in the case of extended outages (ie, the ISP goes away for more than > about 10 minutes), where both nodes have time to fail, we end up back in > the bad old state (service failed on both nodes): > > [root@positron ~]# crm status > [...] > Online: [ electron positron ] > > Resource Group: BothIPs > InternalIP (ocf::heartbeat:IPaddr):Started electron > ExternalIP (lsb:hb-adsl-helper): Stopped > > Failed actions: > ExternalIP_monitor_6 (node=positron, call=15, rc=7, > status=complete): not running > ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed > Out): unknown exec error > ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out): > unknown exec error > > is there any way to configure CRM to keep kicking the service between > the two nodes forever (ie, try three times on positron, kick service > group to electron, try three times on electron, kick back to positron, > lather rinse repeat...)? > > for a service like DSL, which can go away for extended periods through > no local fault then suddenly and with no announcement come back, this > would be most useful behaviour. Yes, see migration-threshold and failure-timeout. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options > thanks to all for help with this. thanks also to those who have > suggested i rewrite this as an OCF agent (especially to ken gaillot who > was kind enough to point me to documentation); i will look at that if > time permits. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Resources within a Group
Hi, I'm on SLES 11 SP4 (Pacemaker 1.1.12) and still learning all this :) I'm wondering if there's a way to control the resource startup behaviour within a group? For example, I have an LVM resource (to activate a VG) and the next one: the Filesystem resource (to mount it). If the VG activation fails I see errors afterwards trying to mount the filesystem. If there's something like "If the first resource fails, stop further processing"? (sort of like one can control the stacking of PAM modules). Thanks, Jorge ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] wait_for_all in SLES 11
Hi, Is there a way to recreate the newest Corosync option: wait_for_all in SLES 11? Does anyone (anyone from SUSE?) knows if there any plans to backport this into SLES 11? I can't say I miss this option (since I'm just starting out in HA) but after evaluating many possible situations with fence-loops, that's like one of the greatest ideas I've seen. Disabling autostart of openais don't quite make it like the wait_for_all. Perhaps I might create a wrapper script around /etc/init.d/openais. Thanks, Jorge ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Resources within a Group
> On 28 Aug 2015, at 7:54 am, Jorge Fábregas wrote: > > Hi, > > I'm on SLES 11 SP4 (Pacemaker 1.1.12) and still learning all this :) > I'm wondering if there's a way to control the resource startup behaviour > within a group? > > For example, I have an LVM resource (to activate a VG) and the next one: > the Filesystem resource (to mount it). If the VG activation fails I see > errors afterwards trying to mount the filesystem. If there's something > like "If the first resource fails, stop further processing"? (sort of > like one can control the stacking of PAM modules). Thats how it normally works. Perhaps the VG agent is swallowing the error instead of reporting it to the cluster? > > Thanks, > Jorge > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: fence_sanlock and pacemaker
> On 27 Aug 2015, at 4:20 pm, Ulrich Windl > wrote: > "Laurent B." schrieb am 27.08.2015 um 08:06 in > Nachricht > <55dea8cc.3080...@qmail.re>: >> Hello, >> >>> You’d have to build it yourself, but sbd could be an option >>> >> >> do you have any clue on how to install it on redhat (6.5) ? I installed >> the gluster glue package and the sbd package (provided by OpenSUSE) but >> now I'm stuck. The stonith resource creation give me an error saying >> that the sbd resource was not found. > > sbd has to be started before the cluster software. SUSE does something like: on RHEL7 the sbd systemd unit file arranges for it to be started/stopped whenever corosync is (started/stopped) > SBD_CONFIG=/etc/sysconfig/sbd > SBD_BIN="/usr/sbin/sbd" > if [ -f $SBD_CONFIG ]; then >. $SBD_CONFIG > fi > > [ -x "$exec" ] || exit 0 > > SBD_DEVS=${SBD_DEVICE%;} > SBD_DEVICE=${SBD_DEVS//;/ -d } > > : ${SBD_DELAY_START:="no"} > > StartSBD() { >test -x $SBD_BIN || return >if [ -n "$SBD_DEVICE" ]; then >if ! pidofproc $SBD_BIN >/dev/null 2>&1 ; then >echo -n "Starting SBD - " >if ! $SBD_BIN -d $SBD_DEVICE -D $SBD_OPTS watch ; > then >echo "SBD failed to start; aborting." >exit 1 >fi >if env_is_true ${SBD_DELAY_START} ; then >sleep $(sbd -d "$SBD_DEVICE" dump | grep -m 1 > ms > gwait | awk '{print $4}') 2>/dev/null >fi >fi >fi > } > > StopSBD() { >test -x $SBD_BIN || return >if [ -n "$SBD_DEVICE" ]; then >echo -n "Stopping SBD - " >if ! $SBD_BIN -d $SBD_DEVICE -D $SBD_OPTS message LOCAL exit ; > t > hen >echo "SBD failed to stop; aborting." >exit 1 >fi >fi >while pidofproc $SBD_BIN >/dev/null 2>&1 ; do >sleep 1 >done >echo -n "done " > } > >> >> Thank you, >> >> Laurent >> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: resource-stickiness
> On 27 Aug 2015, at 4:12 pm, Ulrich Windl > wrote: > Andrew Beekhof schrieb am 27.08.2015 um 00:20 in > Nachricht > : > >>> On 26 Aug 2015, at 10:09 pm, Rakovec Jost wrote: >>> >>> Sorry one typo: problem is the same >>> >>> >>> location cli-prefer-aapche aapche role=Started 10: sles2 >> >> Change the name of your constraint. >> The 'cli-prefer-’ prefix is reserved for “temporary” constraints > created by >> the command line tools (which therefor feel entitled to delete them as >> necessary). > > In which ways is "cli-prefer-" handled specially, if I may ask… we delete them when you use the cli tools to move the resource somewhere else (crm_resource —ban, —move, —clear) > >> >>> >>> to: >>> >>> location cli-prefer-aapche aapche role=Started inf: sles2 >>> >>> >>> It keep change to infinity. >>> >>> >>> >>> my configuration is: >>> >>> node sles1 >>> node sles2 >>> primitive filesystem Filesystem \ >>> params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \ > >>> op start interval=0 timeout=60 \ >>> op stop interval=0 timeout=60 \ >>> op monitor interval=20 timeout=40 >>> primitive myip IPaddr2 \ >>> params ip=x.x.x.x \ >>> op start interval=0 timeout=20s \ >>> op stop interval=0 timeout=20s \ >>> op monitor interval=10s timeout=20s >>> primitive stonith_sbd stonith:external/sbd \ >>> params pcmk_delay_max=30 >>> primitive web apache \ >>> params configfile="/etc/apache2/httpd.conf" \ >>> op start interval=0 timeout=40s \ >>> op stop interval=0 timeout=60s \ >>> op monitor interval=10 timeout=20s >>> group aapche filesystem myip web \ >>> meta target-role=Started is-managed=true resource-stickiness=1000 >>> location cli-prefer-aapche aapche role=Started 10: sles2 >>> property cib-bootstrap-options: \ >>> stonith-enabled=true \ >>> no-quorum-policy=ignore \ >>> placement-strategy=balanced \ >>> expected-quorum-votes=2 \ >>> dc-version=1.1.12-f47ea56 \ >>> cluster-infrastructure="classic openais (with plugin)" \ >>> last-lrm-refresh=1440502955 \ >>> stonith-timeout=40s >>> rsc_defaults rsc-options: \ >>> resource-stickiness=1000 \ >>> migration-threshold=3 >>> op_defaults op-options: \ >>> timeout=600 \ >>> record-pending=true >>> >>> >>> >>> and after migration: >>> >>> >>> node sles1 >>> node sles2 >>> primitive filesystem Filesystem \ >>> params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \ > >>> op start interval=0 timeout=60 \ >>> op stop interval=0 timeout=60 \ >>> op monitor interval=20 timeout=40 >>> primitive myip IPaddr2 \ >>> params ip=10.9.131.86 \ >>> op start interval=0 timeout=20s \ >>> op stop interval=0 timeout=20s \ >>> op monitor interval=10s timeout=20s >>> primitive stonith_sbd stonith:external/sbd \ >>> params pcmk_delay_max=30 >>> primitive web apache \ >>> params configfile="/etc/apache2/httpd.conf" \ >>> op start interval=0 timeout=40s \ >>> op stop interval=0 timeout=60s \ >>> op monitor interval=10 timeout=20s >>> group aapche filesystem myip web \ >>> meta target-role=Started is-managed=true resource-stickiness=1000 >>> location cli-prefer-aapche aapche role=Started inf: sles2 >>> property cib-bootstrap-options: \ >>> stonith-enabled=true \ >>> no-quorum-policy=ignore \ >>> placement-strategy=balanced \ >>> expected-quorum-votes=2 \ >>> dc-version=1.1.12-f47ea56 \ >>> cluster-infrastructure="classic openais (with plugin)" \ >>> last-lrm-refresh=1440502955 \ >>> stonith-timeout=40s >>> rsc_defaults rsc-options: \ >>> resource-stickiness=1000 \ >>> migration-threshold=3 >>> op_defaults op-options: \ >>> timeout=600 \ >>> record-pending=true >>> >>> >>> From: Rakovec Jost >>> Sent: Wednesday, August 26, 2015 1:33 PM >>> To: users@clusterlabs.org >>> Subject: resource-stickiness >>> >>> Hi list, >>> >>> >>> I have configure simple cluster on sles 11 sp4 and have a problem with >> “auto_failover off". The problem is that when ever I migrate resource > group >> via HAWK my configuration change from: >>> >>> location cli-prefer-aapche aapche role=Started 10: sles2 >>> >>> to: >>> >>> location cli-ban-aapche-on-sles1 aapche role=Started -inf: sles1 >>> >>> >>> It keep change to inf. >>> >>> >>> and then after fance node, resource is moving back to original node which I > >> don't want. How can I avoid this situation? >>> >>> my configuration is: >>> >>> node sles1 >>> node sles2 >>> primitive filesystem Filesystem \ >>> params fstype=ext3 directory="/srv/www/vhosts" device="/dev/xvdd1" \ > >>> op start interval=0 timeout=60 \ >>> op stop interval=0 timeout=60 \ >>> op monitor interval=20 tim
Re: [ClusterLabs] wait_for_all in SLES 11
Not a SUSE user, so I'm not familiar with what is shipped with SLES 11, but if it's corosync v2.x, you already have it. If it's not corosync v2, then you might be interested in how I solved this problem in RHEL 6 (corosync v1 + cman + rgmanager); 'safe_anvil_start' (https://github.com/digimer/striker/blob/master/tools/safe_anvil_start). It's perl, but if you're OK with that, it should be fairly easy to port. Basically, it tries to reach the peer on boot. If it can, it does some other sanity checks (it expects drbd, cman and rgmanager, hence the need to port most likely). So long as it can reach the peer, it will start the cluster. If it can't, it will just sit there. So it's the same idea as wait_for_all. It's triggered by an rc3.d script that it will create/remove when called with --enable/--disable). It expects a file called /etc/striker/striker.conf and looks for "tools::safe_anvil_start::enabled = [0|1]", and you can use it with a skeleton file with just that value in it. If you're interested in this and if you have any trouble, I'll be happy to help you adapt it. With luck though, you'll already have corosync v2 and it'll be moot. Cheers On 27/08/15 08:21 PM, Jorge Fábregas wrote: > Hi, > > Is there a way to recreate the newest Corosync option: wait_for_all in > SLES 11? Does anyone (anyone from SUSE?) knows if there any plans to > backport this into SLES 11? > > I can't say I miss this option (since I'm just starting out in HA) but > after evaluating many possible situations with fence-loops, that's like > one of the greatest ideas I've seen. Disabling autostart of openais > don't quite make it like the wait_for_all. > > Perhaps I might create a wrapper script around /etc/init.d/openais. > > Thanks, > Jorge > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] "0 Nodes configured" in crm_mon
> On 25 Aug 2015, at 1:45 am, Stanislav Kopp wrote: > > Hi all, > > I'm trying to run corosync2 + pacemaker setup on Debian Jessie (only > for testing purpose), I've successfully compiled all components using > this guide: http://clusterlabs.org/wiki/Compiling_on_Debian > > Unfortunately, if I run "crm_mon" I don't see any nodes. > > ### > Last updated: Mon Aug 24 17:36:00 2015 > Last change: Mon Aug 24 17:17:42 2015 > Current DC: NONE > 0 Nodes configured > 0 Resources configured > > > I don't see any errors in corosync log either: http://pastebin.com/bJX66B9e really? Aug 24 17:16:10 [1723] pm1 crmd:error: cluster_connect_quorum: Corosync quorum is not configured Looks like you forgot to uncomment: #provider: corosync_votequorum > > This is my corosync.conf > > ### > > # Please read the corosync.conf.5 manual page > totem { >version: 2 > >crypto_cipher: none >crypto_hash: none > >interface { >ringnumber: 0 >bindnetaddr: 192.168.122.0 >mcastport: 5405 >ttl: 1 >} >transport: udpu > } > > logging { >fileline: off >to_logfile: yes >to_syslog: no >logfile: /var/log/cluster/corosync.log >debug: off >timestamp: on >logger_subsys { >subsys: QUORUM >debug: off >} > } > > nodelist { >node { >ring0_addr: 192.168.122.172 >#nodeid: 1 >} > >node { >ring0_addr: 192.168.122.113 >#nodeid: 2 >} > } > > quorum { ># Enable and configure quorum subsystem (default: off) ># see also corosync.conf.5 and votequorum.5 >#provider: corosync_votequorum > } > > > > used components: > > pacemaker: 1.1.12 > corosync: 2.3.5 > libqb: 0.17.1 > > > Did I miss something? > > Thanks! > Stan > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [ClusterLabs Developers] Resource Agent language discussion
> On 21 Aug 2015, at 2:21 am, Jehan-Guillaume de Rorthais > wrote: > > On Thu, 20 Aug 2015 15:05:24 +1000 > Andrew Beekhof wrote: > >> >>> On 19 Aug 2015, at 6:59 pm, Jehan-Guillaume de Rorthais >>> wrote: >>> >>> On Mon, 17 Aug 2015 09:42:35 +1000 >>> Andrew Beekhof wrote: >>> > On 11 Aug 2015, at 5:34 pm, Jehan-Guillaume de Rorthais > wrote: > > On Tue, 11 Aug 2015 11:30:03 +1000 > Andrew Beekhof wrote: >>> [...] >> You can and should use whatever language you like for your own private >> RAs. But if you want it accepted and maintained by the resource-agents >> project, you would be advised to use the language they have standardised >> on. > > Well, let's imagine our RA was written in bash (in fact, we have a bash > version pretty close to the current perl version we abandoned). I wonder > if it would be accepted in the resource-agents project anyway as another > one already exists there. I can easily list the reasons we rewrote a new > one, but this is not the subject here. > > The discussion here is more about the language, if I should extract a > ocf-perl-module from my RA and if there is any chance the resource-agents > project would accept it. Well, it depends on the reasons you didn’t list :-) >>> >>> Ok, let's answer the questions then :) >>> The first questions any maintainer is going to ask are: - why did you write a new one? >>> >>> About the existing pgsql RA: >>> * it supports stateless AND multistate pgsql resource. It makes the code >>> bigger, more complexe, hard to follow and understand >>> * some params are for multistate usage only, some other for stateless only, >>> some for both, making the configuration harder to understand >>> * some params are required for multistate where a recent PostgreSQL can >>> live without them (restore_command) >>> * it achieves operations a RA should not take care of (switching from >>> synchronous to asynchronous replication on the master if slaves are gone, >>> killing all existing xact) >>> * ...and this makes the code even bigger and complexe again >>> * it supports too many options and has some conventions the DBA should care >>> themselves. This make it way too complex and touchy to setup and maintain >>> * it does not support demote, making the code lying about the real >>> state of the resource to Pacemaker. This was because demote/switchover >>> was unsafe for postgresql < 9.3. >>> >>> What we tried to achieve with a new pgsql RA: >>> * multistate only (we already have a stateless RA, in bash) >>> * should have a simple code: easier to understand, to maintain, achieve one >>> goal at a time >>> * be simple to setup >>> * should not substitute itself to the DBA >>> * support safe ("cold") demote/switchover >>> - can we merge this with the old one? >>> >>> Well, it would make the code even bigger, maybe conflicting and harder to >>> understand. I already can hear questions about such a frankenstein RA >>> ("why am I able to setup two different multistate architecture?" "why this >>> one does not supports this parameter?" "should I create my recovery.conf or >>> not?") >>> >>> Some of our ideas could be merged to the old one though, we could discuss >>> and help maintainers if they are interested and have time. But we only have >>> a limited R&D time and have no time to lead such a development. >>> - can the new one replace the old one? (ie. full superset) >>> >>> No. It does not support stateless resource, does not mess with replication >>> synchronism, does not kill queries if all the slaves are gone, does not >>> "lock" an instance when it failed, only promote the resource using "pg_ctl >>> promote" (with no restart), ... >>> Because if both are included, then they will forevermore be answering the question “which one should I use?”. >>> >>> True. >>> Basically, if you want it accepted upstream, then yes, you probably want to ditch the perl bit. But not having seen the agent or knowing why it exists, its hard to say. >>> >>> Well, it seems our RA will not make it to the upstream repository, >> >> You made a fairly reasonable argument for separate stateless and stateful >> variants. > > BTW, how other official RA are dealing with this? They’re not, but in this case it seems there could be good reasons to separate them > A quick look at RA names > seems to reveal no service have dedicated stateless and stateful RA scripts. > > [...] >>> What I was discussing here was: >>> >>> * if not using bash, is there any trap we should avoid that are already >>> addressed in the ocf-shellfuncs library? >> >> No, you just might have to re-implement some things. >> Particularly logging. > > Ok, that was my conclusion so far. I'll have a look at the logging funcs then. > >>> * is there a chance a perl version of such library would be accepted >>> upstream? >> >> Depends if you’re
Re: [ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER
> On 21 Aug 2015, at 1:32 pm, Andrei Borzenkov wrote: > > 21.08.2015 00:35, Brian Campbell пишет: >> I have a master/slave resource (with a custom resource agent) which, >> if it uncleanly shut down, will return OCF_FAILED_MASTER on the next >> "monitor" operation. This seems to be what >> http://www.linux-ha.org/doc/dev-guides/_literal_ocf_failed_master_literal_9.html >> suggests that exit code should be used for. >> >> After the node is fenced, and comes up again, Pacemaker probes all of >> the resources. It gets the OCF_FAILED_MASTER exit code, and decides >> that it needs to demote the resource. So it executes the demote >> action. My resource agent returns an error on a demote action if it is >> not running, which seems to be the suggested behavior according to >> http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html >> >> This then causes Pacemaker to log a failure for the "demote" action, >> and then try to recover by stopping (which succeeds cleanly because >> the resource is stopped) followed by starting it again (which again >> succeeds, as we can start in slave mode from a failed state). So the >> end state is correct, but crm_mon shows a failed action that you need >> to clear out: >> >> Failed actions: >> >> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0 >> (node=es-efs-master2, call=73, rc=1, status=complete, l >> ast-rc-change=Thu Aug 20 12:52:21 2015 >> , queued=54ms, exec=1ms >> ): unknown error >> >> I'm curious about whether the behavior of my resource agent is >> correct. Should I not be returning OCF_FAILED_MASTER upon the >> "monitor" operation if the resource isn't started? > > Correct. If resource is not started it cannot be master or slave; it can > become master only after pacemaker requested it. Unexpected master would be > just the same error as well. > > If you can determine that one resource instance is more suitable to become > master than another one, you should set master score respectively so > pacemaker will promote correct instance. > >> Or should the >> "demote" operation do something different in this state, like actually >> starting up the slave? >> > > In general, if current resource state is the same as would be after operation > is completed, there is absolutely no reason to return error - just pretend > operation succeeded. Always return the actual state. ie. OCF_NOT_RUNNING in these two cases. Only return OCF_FAILED_MASTER if you know enough to say that its in the master state (ie. lock file, or similar mechanism) but not able to handle requests. > >> It seems like the behavior of Pacemaker is different than what's >> documented in the resource agent guide, so I'm trying to figure out if >> this is a bug in my resource agent, a bug in Pacemaker, a >> misunderstanding on my part, or actually intended behavior. >> >> -- Brian >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Resources within a Group
Hi Jorge, Like "colocation" is what you want. Please take a look at here: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-sets-colocation.html >>> > Hi, > > I'm on SLES 11 SP4 (Pacemaker 1.1.12) and still learning all this :) > I'm wondering if there's a way to control the resource startup behaviour > within a group? > > For example, I have an LVM resource (to activate a VG) and the next one: > the Filesystem resource (to mount it). If the VG activation fails I see > errors afterwards trying to mount the filesystem. If there's something > like "If the first resource fails, stop further processing"? (sort of > like one can control the stacking of PAM modules). > > Thanks, > Jorge > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > -- Eric, Ren ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org