Re: [ClusterLabs] resource start after network reconnected

2021-11-21 Thread Strahil Nikolov via Users
 You are right, but usually when the SBD disk has failed, I always focus on 
recovering it as soon as possible. Once the disk is recovered and the watcher 
detects it back - shutting down is possible.

And of course disk-based sbd is better than nothing.

Best Regards,Strahil Nikolov
 
  On Sun, Nov 21, 2021 at 8:47, Andrei Borzenkov wrote:   
On 21.11.2021 00:39, Strahil Nikolov via Users wrote:
> Nope, as long as you use SBD's integration with pacemaker. As the 2 nodes can 
> communicate between each other sbd won't act. I thinkt it was an entry like 
> this in the /etc/sysconfig/sbd: 'SBD_PACEMAKER=yes'
>  

That's correct except it is impossible to stop pacemaker on one node under this 
condition because the remaining node will immediately commit suicide. It is not 
even possible to perform normal cluster shutdown.

I wish SBD supported "deactivate" message to stop pretending that it knows 
better than administrator or - even better - understood that pacemaker is 
stopping intentionally. Currently there is no way around it (short of pkill -9 
sbd) because systemd unit refuses manual SBD stop.

>  
>  On Sat, Nov 20, 2021 at 23:24, Valentin Vidić via 
>Users wrote:  On Sat, Nov 20, 2021 at 08:33:26PM +, 
>Strahil Nikolov via Users wrote:
>> You can also use this 3rd node to provide iSCSI and then the SBD will
>> be disk-full :D . The good thing about this type of setup is that you
>> do won't need to put location constraints for the 3rd node.
> 
> Wouldn't that make the iSCSI node a SPOF? If the iSCSI goes down, SBD
> resets both cluster nodes.
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
  
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-20 Thread Andrei Borzenkov
On 21.11.2021 00:39, Strahil Nikolov via Users wrote:
> Nope, as long as you use SBD's integration with pacemaker. As the 2 nodes can 
> communicate between each other sbd won't act. I thinkt it was an entry like 
> this in the /etc/sysconfig/sbd: 'SBD_PACEMAKER=yes'
>  

That's correct except it is impossible to stop pacemaker on one node under this 
condition because the remaining node will immediately commit suicide. It is not 
even possible to perform normal cluster shutdown.

I wish SBD supported "deactivate" message to stop pretending that it knows 
better than administrator or - even better - understood that pacemaker is 
stopping intentionally. Currently there is no way around it (short of pkill -9 
sbd) because systemd unit refuses manual SBD stop.

>  
>   On Sat, Nov 20, 2021 at 23:24, Valentin Vidić via 
> Users wrote:   On Sat, Nov 20, 2021 at 08:33:26PM 
> +, Strahil Nikolov via Users wrote:
>> You can also use this 3rd node to provide iSCSI and then the SBD will
>> be disk-full :D . The good thing about this type of setup is that you
>> do won't need to put location constraints for the 3rd node.
> 
> Wouldn't that make the iSCSI node a SPOF? If the iSCSI goes down, SBD
> resets both cluster nodes.
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-20 Thread Strahil Nikolov via Users
Nope, as long as you use SBD's integration with pacemaker. As the 2 nodes can 
communicate between each other sbd won't act. I thinkt it was an entry like 
this in the /etc/sysconfig/sbd: 'SBD_PACEMAKER=yes'
 
 
  On Sat, Nov 20, 2021 at 23:24, Valentin Vidić via 
Users wrote:   On Sat, Nov 20, 2021 at 08:33:26PM +, 
Strahil Nikolov via Users wrote:
> You can also use this 3rd node to provide iSCSI and then the SBD will
> be disk-full :D . The good thing about this type of setup is that you
> do won't need to put location constraints for the 3rd node.

Wouldn't that make the iSCSI node a SPOF? If the iSCSI goes down, SBD
resets both cluster nodes.

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
  
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-20 Thread Valentin Vidić via Users
On Sat, Nov 20, 2021 at 08:33:26PM +, Strahil Nikolov via Users wrote:
> You can also use this 3rd node to provide iSCSI and then the SBD will
> be disk-full :D . The good thing about this type of setup is that you
> do won't need to put location constraints for the 3rd node.

Wouldn't that make the iSCSI node a SPOF? If the iSCSI goes down, SBD
resets both cluster nodes.

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-20 Thread Strahil Nikolov via Users
 You can also use this 3rd node to provide iSCSI and then the SBD will be 
disk-full :D . The good thing about this type of setup is that you do won't 
need to put location constraints for the 3rd node.


Also, check the ping resource -> you can set it up to "kick-out" all resources 
on failure of ping to a specific ip (for example the gateway). Once the network 
is restored, the node automatically becomes eligible to host the resources.


Also consider more advanced resource agents like ocf:heartbeat:mysql to control 
your mysql/mariadb database and also a replication between a primary and 
secondary (a.k.a master-slave ) replication.


Best Regards,
Strahil Nikolov
 В петък, 19 ноември 2021 г., 21:46:22 Гринуич+2, john tillman 
 написа:  
 
 > On Fri, Nov 19, 2021 at 11:26:01AM -0500, john tillman wrote:
>> Anyone have any other ideas for a configuration setting that will
>> effectively do whatever 'pcs resource refresh' is doing when quorum is
>> restored?
>
> Since you have three nodes you may want to use the third node as QDevice
> instead:
>
> https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-qdevice.html
>
> After that SBD can be configured in diskless mode to reset the node that
> loses quorum:
>
> https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-protect.html#sec-ha-storage-protect-diskless-sbd
>

Thank you.  I'll look into using the Qdevice in the next release.  For
now, I just have the three nodes with "vanilla" cluster packages.

> --
> Valentin
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
  ___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread Andrei Borzenkov
On 19.11.2021 20:45, Ken Gaillot wrote:
> On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote:
> 
> 
> 
>>> If pacemaker tries to stop resources due to out of quorum
>>> condition, you
>>> could set suitable failure-timeout; this will be equivalent to
>>> using "pcs
>>> resource refresh". Keep in mind that pacemaker only checks for
>>> failure-timeout expiration every cluster-recheck-interval (15 
> 
> That's true only for Pacemaker versions less than 2.0.3; since 2.0.3,
> the cluster rechecks as soon as the timeout hits.
> 

Indeed. Thank you!
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread Andrei Borzenkov
On 19.11.2021 19:26, john tillman wrote:
...
>>>
>>> If pacemaker tries to stop resources due to out of quorum condition, you
>>> could set suitable failure-timeout; this will be equivalent to using
>>> "pcs
>>> resource refresh". Keep in mind that pacemaker only checks for
>>> failure-timeout expiration every cluster-recheck-interval (15 minutes by
>>> default). This still is not directly related to network availability,
>>> but
>>> if network outage resulted in node going out of quorum, when network is
>>> back and node joined cluster again it will allow resources to be started
>>> on node.
>>>
>>
>> When quorum is lost I want all the resources to stop.  The cluster is
>> performing this step correctly for me.
>>
>> That cluster-recheck-interval would explain the intermittence I saw this
>> morning.  If I set that to 1 minute would that cause any gross negative
>> issues?
>>
> 
> 
> I tried setting cluster-recheck-interval to 1 minute and I saw no change
> to the resources after reconnecting the network.  They were still listed
> as However, "pcs resource refresh" started it, as usual in this scenario.
> 
> Anyone have any other ideas for a configuration setting that will
> effectively do whatever 'pcs resource refresh' is doing when quorum is
> restored?
> 

I already told you above and it most certainly works here.

Without failure-timeout resource is stuck in blocked state:

Cluster Summary:

  * Stack: corosync

  * Current DC: ha1 (version 
2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum

  * Last updated: Sat Nov 20 10:48:48 2021

  * Last change:  Sat Nov 20 10:46:55 2021 by root via cibadmin on ha1

  * 3 nodes configured

  * 3 resource instances configured (1 BLOCKED from further action due to 
failure)



Node List:

  * Online: [ ha1 ha2 qnetd ]



Full List of Resources:

  * Clone Set: cln_Test [rsc_Test]:

* rsc_Test  (ocf::_local:Dummy): FAILED ha1 (blocked)

* Started: [ ha2 ]

* Stopped: [ qnetd ]



Operations:

  * Node: ha2:

* rsc_Test: migration-threshold=100:

  * (10) start

  * (11) monitor: interval="1ms"

  * Node: ha1:

* rsc_Test: migration-threshold=100 fail-count=100 
last-failure='Sat Nov 20 10:47:14 2021':

  * (18) start

  * (30) stop



Failed Resource Actions:

  * rsc_Test_stop_0 on ha1 'error' (1): call=30, status='complete', 
exitreason='forced to fail stop operation', last-rc-change='2021-11-20 10:47:14 
+03:00', queued=0ms, exec=27ms



With failure-timeout resource is restarted after expiration.

Cluster Summary:

  * Stack: corosync

  * Current DC: ha1 (version 
2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum

  * Last updated: Sat Nov 20 10:53:51 2021

  * Last change:  Sat Nov 20 10:50:37 2021 by root via cibadmin on ha2

  * 3 nodes configured

  * 3 resource instances configured



Node List:

  * Online: [ ha1 ha2 qnetd ]



Full List of Resources:

  * Clone Set: cln_Test [rsc_Test]:

* Started: [ ha1 ha2 ]

* Stopped: [ qnetd ]



Operations:

  * Node: ha2:

* rsc_Test: migration-threshold=100:

  * (18) probe

  * (18) probe

  * (19) monitor: interval="1ms"

  * Node: ha1:

* rsc_Test: migration-threshold=100:

  * (40) probe

  * (40) probe

  * (41) monitor: interval="1ms"


Configuration:

node 1: ha1 \

attributes pingd=1 \

utilization cpu=20

node 2: ha2 \

attributes pingd=1 \

utilization cpu=20

node 3: qnetd

primitive rsc_Test ocf:_local:Dummy \

meta failure-timeout=30s \

op monitor interval=10s

clone cln_Test rsc_Test

location not_on_qnetd cln_Test -inf: qnetd

property cib-bootstrap-options: \

cluster-infrastructure=corosync \

cluster-name=ha \

dc-version="2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c" \

last-lrm-refresh=1637394576 \

stonith-enabled=false \

have-watchdog=true \

stonith-watchdog-timeout=0 \

placement-strategy=balanced

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread Ken Gaillot
On Fri, 2021-11-19 at 14:57 -0500, john tillman wrote:
> > On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote:
> > 
> > 
> > 
> > > > If pacemaker tries to stop resources due to out of quorum
> > > > condition, you
> > > > could set suitable failure-timeout; this will be equivalent to
> > > > using "pcs
> > > > resource refresh". Keep in mind that pacemaker only checks for
> > > > failure-timeout expiration every cluster-recheck-interval (15
> > 
> > That's true only for Pacemaker versions less than 2.0.3; since
> > 2.0.3,
> > the cluster rechecks as soon as the timeout hits.
> 
> I'm using pacemaker 2.0.5 and it is *not* starting MySQL when quorum
> is
> restored, at least not every time (~1 in 10).  So I have seen it work

That's due to a stop failure, not the recheck interval

> before but I'm more willing to believe that there was a user error in
> that
> one successful sample.
> 
> We (actual a team mate) got mysql to start when quorum is
> restored.  It
> required both setting the cluster-recheck-interval to something more
> frequent than 15min  and  setting the mysql resource's failure-
> timeout to
> non-zero.  In our case we set both to 1 minute with good results for
> the
> last few tests.  We can raise the frequency to something greater than
> 1
> but for our tests, 1 proves it out.

The failure-timeout is equivalent to running refresh when the timeout
hits. The cluster will then re-probe the status of the resource and
decide what, if anything, needs to be done about it.

I can only see that working if the stop failure is transient -- i.e.,
either the stop actually succeeded but returned a failure code (or
maybe timed out), and when the failure timeout or refresh happens, the
re-probe sees the database is actually not running; or the stop really
does fail, but by the time the failure timeout or refresh happens,
another stop attempt after the re-probe is able to succeed.

> 
> 
> > > > minutes by
> > > > default). This still is not directly related to network
> > > > availability, but
> > > > if network outage resulted in node going out of quorum, when
> > > > network is
> > > > back and node joined cluster again it will allow resources to
> > > > be
> > > > started
> > > > on node.
> > > > 
> > > 
> > > When quorum is lost I want all the resources to stop.  The
> > > cluster is
> > > performing this step correctly for me.
> > 
> > As long as it's working properly. If quorum is lost because one of
> > the
> > nodes is malfunctioning -- maybe a device driver locked up the
> > system,
> > or CPU wait is horrific due to an out-of-control process or disk
> > failure -- then that node will not know quorum has been lost and
> > will
> > not stop resources. If the condition then clears up, suddenly you
> > have
> > split-brain with two nodes running resources.
> > 
> > > That cluster-recheck-interval would explain the intermittence I
> > > saw
> > > this
> > > morning.  If I set that to 1 minute would that cause any gross
> > > negative
> > > issues?
> > 
> > It increases CPU usage and IPC traffic. For Pacemaker 2.0.3 or
> > later, I
> > definitely wouldn't bother. For older versions, 1 minute feels a
> > bit
> > much, I would go with around 5.
> > 
> > > Is there another setting besides cluster-recheck-interval to
> > > consider
> > > adjusting to start mysql when quorum is returned?
> > > 
> > > Thank you for the feedback.
> > > 
> > > -John
> > 
> > --
> > Ken Gaillot 
> > 
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> > 
> > 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread john tillman
> On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote:
>
> 
>
>> > If pacemaker tries to stop resources due to out of quorum
>> > condition, you
>> > could set suitable failure-timeout; this will be equivalent to
>> > using "pcs
>> > resource refresh". Keep in mind that pacemaker only checks for
>> > failure-timeout expiration every cluster-recheck-interval (15
>
> That's true only for Pacemaker versions less than 2.0.3; since 2.0.3,
> the cluster rechecks as soon as the timeout hits.


I'm using pacemaker 2.0.5 and it is *not* starting MySQL when quorum is
restored, at least not every time (~1 in 10).  So I have seen it work
before but I'm more willing to believe that there was a user error in that
one successful sample.

We (actual a team mate) got mysql to start when quorum is restored.  It
required both setting the cluster-recheck-interval to something more
frequent than 15min  and  setting the mysql resource's failure-timeout to
non-zero.  In our case we set both to 1 minute with good results for the
last few tests.  We can raise the frequency to something greater than 1
but for our tests, 1 proves it out.


>
>> > minutes by
>> > default). This still is not directly related to network
>> > availability, but
>> > if network outage resulted in node going out of quorum, when
>> > network is
>> > back and node joined cluster again it will allow resources to be
>> > started
>> > on node.
>> >
>>
>> When quorum is lost I want all the resources to stop.  The cluster is
>> performing this step correctly for me.
>
> As long as it's working properly. If quorum is lost because one of the
> nodes is malfunctioning -- maybe a device driver locked up the system,
> or CPU wait is horrific due to an out-of-control process or disk
> failure -- then that node will not know quorum has been lost and will
> not stop resources. If the condition then clears up, suddenly you have
> split-brain with two nodes running resources.
>
>>
>> That cluster-recheck-interval would explain the intermittence I saw
>> this
>> morning.  If I set that to 1 minute would that cause any gross
>> negative
>> issues?
>
> It increases CPU usage and IPC traffic. For Pacemaker 2.0.3 or later, I
> definitely wouldn't bother. For older versions, 1 minute feels a bit
> much, I would go with around 5.
>
>>
>> Is there another setting besides cluster-recheck-interval to consider
>> adjusting to start mysql when quorum is returned?
>>
>> Thank you for the feedback.
>>
>> -John
>
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread john tillman
> On Fri, Nov 19, 2021 at 11:26:01AM -0500, john tillman wrote:
>> Anyone have any other ideas for a configuration setting that will
>> effectively do whatever 'pcs resource refresh' is doing when quorum is
>> restored?
>
> Since you have three nodes you may want to use the third node as QDevice
> instead:
>
> https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-qdevice.html
>
> After that SBD can be configured in diskless mode to reset the node that
> loses quorum:
>
> https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-protect.html#sec-ha-storage-protect-diskless-sbd
>

Thank you.  I'll look into using the Qdevice in the next release.  For
now, I just have the three nodes with "vanilla" cluster packages.

> --
> Valentin
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread Valentin Vidić via Users
On Fri, Nov 19, 2021 at 11:26:01AM -0500, john tillman wrote:
> Anyone have any other ideas for a configuration setting that will
> effectively do whatever 'pcs resource refresh' is doing when quorum is
> restored?

Since you have three nodes you may want to use the third node as QDevice
instead:

https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-qdevice.html

After that SBD can be configured in diskless mode to reset the node that
loses quorum:

https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-protect.html#sec-ha-storage-protect-diskless-sbd

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread Ken Gaillot
On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote:



> > If pacemaker tries to stop resources due to out of quorum
> > condition, you
> > could set suitable failure-timeout; this will be equivalent to
> > using "pcs
> > resource refresh". Keep in mind that pacemaker only checks for
> > failure-timeout expiration every cluster-recheck-interval (15 

That's true only for Pacemaker versions less than 2.0.3; since 2.0.3,
the cluster rechecks as soon as the timeout hits.

> > minutes by
> > default). This still is not directly related to network
> > availability, but
> > if network outage resulted in node going out of quorum, when
> > network is
> > back and node joined cluster again it will allow resources to be
> > started
> > on node.
> > 
> 
> When quorum is lost I want all the resources to stop.  The cluster is
> performing this step correctly for me.

As long as it's working properly. If quorum is lost because one of the
nodes is malfunctioning -- maybe a device driver locked up the system,
or CPU wait is horrific due to an out-of-control process or disk
failure -- then that node will not know quorum has been lost and will
not stop resources. If the condition then clears up, suddenly you have
split-brain with two nodes running resources.

> 
> That cluster-recheck-interval would explain the intermittence I saw
> this
> morning.  If I set that to 1 minute would that cause any gross
> negative
> issues?

It increases CPU usage and IPC traffic. For Pacemaker 2.0.3 or later, I
definitely wouldn't bother. For older versions, 1 minute feels a bit
much, I would go with around 5.

> 
> Is there another setting besides cluster-recheck-interval to consider
> adjusting to start mysql when quorum is returned?
> 
> Thank you for the feedback.
> 
> -John

-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread john tillman
>> On 19.11.2021 17:36, john tillman wrote:
 On 18.11.2021 22:33, john tillman wrote:
>
> Greetings all,
>
> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5
>
> I have a mysql resource, cloned, that is behaving the way I wanted.
> When
> the node it is on is unplugged from the network quorum is lost and
> the
> mysqld service stops.  Great.  Oh, and fencing is disabled.
>
> When the network connectivity is restored I'd like it to restart but
> it
> doesn't.  What needs to be done to make this happen automatically?
> Or
> what section of the doc should reread more thoroughly?
>
> When mysql is stopped because of the above, if I run "pcs resource
> refresh" it starts?  Any ideas why the "refresh" would do that?
>

 You provided zero information about your setup and how you configured
 pacemaker to stop mysqld on network connectivity loss, so it is rather
 hard to guess.

 Logs covering period when you unplug network, and later plug again,
 could
 be also helpful.

>>>
>>> Fair point.  I didn't want to put too much into the first email.  There
>>> are 3 nodes but 2 nodes are actually used for processing and the 3rd
>>> node
>>> is there just for quorum purposes.  When quorum is lost my resources
>>> stop.
>>>  There are 3 resources: a VIP, MySQL service, and controld (a project
>>> specific service).
>>>
>>> And this problem has now become intermittent as 1 in 4 tests this
>>> morning
>>> succeeded in starting mysqld when the network was reconnected.  Figures
>>> :-/
>>>
>>> More info.  After reconnecting the network on spm238 the mysql resource
>>> was listed as:
>>>   * spmDB   (systemd:mysqld):   FAILED spm238 (blocked)
>>>
>>> This was cleared and mysqld started after issuing a "pcs resource
>>> refresh".
>>>
>>
>> pcs resource refresh deletes failure history so pacemaker tries to start
>> resource again. It is completely unrelated to network interface
>> conditions.
>>
>> "blocked" is default when resource stop operation fails and stonith is
>> disabled.
>>
>>> So as requested here's how I setup my cluster.  It's copied from an
>>> ansible playbook so there are some variables shown but should be easy
>>> enough to understand.  If not, I will gladly clarify anything.
>>>
>>> My 3 resources:
>>>
>>> pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }}
>>> cidr_netmask=24 op monitor interval=10s
>>> pcs resource create spmControl systemd:controld op monitor interval=10s
>>> pcs resource create spmDB systemd:mysqld op monitor interval=10s clone
>>>
>>> My constraints:
>>> pcs constraint colocation add spmControl with spmVIP INFINITY
>>> pcs constraint colocation add spmVIP with spmDB-clone 200
>>> crm_resource -r spmVIP -p resource-stickiness -m -v 100
>>> crm_resource -r spmControl -p resource-stickiness -m -v 100
>>>
>>> Don't run resources on the quorum only node:
>>> pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY
>>> pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY
>>> pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY
>>>
>>
>> I have no idea what QOnlynode means here.
>>
>
> This is the quorum-only node of mine.  Resources are not run on it and the
> 3 constraints above are how I configured that.
>
>>> and stonith is false:
>>> pcs property set stonith-enabled=false
>>>
>>
>> I do not see anything in your configuration that would cause mysql to be
>> stopped on network connectivity issues. Either mysql does it on its own,
>> or pacemaker attempts to stop all resources on node when it goes out of
>> quorum.
>>
>> If mysql does it on its own, there is nothing that can be done from
>> pacemaker side. Pacemaker is not aware of network state at all and
>> certainly cannot initiate actions when network becomes available.
>>
>> If pacemaker tries to stop resources due to out of quorum condition, you
>> could set suitable failure-timeout; this will be equivalent to using
>> "pcs
>> resource refresh". Keep in mind that pacemaker only checks for
>> failure-timeout expiration every cluster-recheck-interval (15 minutes by
>> default). This still is not directly related to network availability,
>> but
>> if network outage resulted in node going out of quorum, when network is
>> back and node joined cluster again it will allow resources to be started
>> on node.
>>
>
> When quorum is lost I want all the resources to stop.  The cluster is
> performing this step correctly for me.
>
> That cluster-recheck-interval would explain the intermittence I saw this
> morning.  If I set that to 1 minute would that cause any gross negative
> issues?
>


I tried setting cluster-recheck-interval to 1 minute and I saw no change
to the resources after reconnecting the network.  They were still listed
as However, "pcs resource refresh" started it, as usual in this scenario.

Anyone have any other ideas for a configuration setting that will

Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread john tillman
> On 19.11.2021 17:36, john tillman wrote:
>>> On 18.11.2021 22:33, john tillman wrote:

 Greetings all,

 preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5

 I have a mysql resource, cloned, that is behaving the way I wanted.
 When
 the node it is on is unplugged from the network quorum is lost and the
 mysqld service stops.  Great.  Oh, and fencing is disabled.

 When the network connectivity is restored I'd like it to restart but
 it
 doesn't.  What needs to be done to make this happen automatically?  Or
 what section of the doc should reread more thoroughly?

 When mysql is stopped because of the above, if I run "pcs resource
 refresh" it starts?  Any ideas why the "refresh" would do that?

>>>
>>> You provided zero information about your setup and how you configured
>>> pacemaker to stop mysqld on network connectivity loss, so it is rather
>>> hard to guess.
>>>
>>> Logs covering period when you unplug network, and later plug again,
>>> could
>>> be also helpful.
>>>
>>
>> Fair point.  I didn't want to put too much into the first email.  There
>> are 3 nodes but 2 nodes are actually used for processing and the 3rd
>> node
>> is there just for quorum purposes.  When quorum is lost my resources
>> stop.
>>  There are 3 resources: a VIP, MySQL service, and controld (a project
>> specific service).
>>
>> And this problem has now become intermittent as 1 in 4 tests this
>> morning
>> succeeded in starting mysqld when the network was reconnected.  Figures
>> :-/
>>
>> More info.  After reconnecting the network on spm238 the mysql resource
>> was listed as:
>>   * spmDB   (systemd:mysqld):   FAILED spm238 (blocked)
>>
>> This was cleared and mysqld started after issuing a "pcs resource
>> refresh".
>>
>
> pcs resource refresh deletes failure history so pacemaker tries to start
> resource again. It is completely unrelated to network interface
> conditions.
>
> "blocked" is default when resource stop operation fails and stonith is
> disabled.
>
>> So as requested here's how I setup my cluster.  It's copied from an
>> ansible playbook so there are some variables shown but should be easy
>> enough to understand.  If not, I will gladly clarify anything.
>>
>> My 3 resources:
>>
>> pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }}
>> cidr_netmask=24 op monitor interval=10s
>> pcs resource create spmControl systemd:controld op monitor interval=10s
>> pcs resource create spmDB systemd:mysqld op monitor interval=10s clone
>>
>> My constraints:
>> pcs constraint colocation add spmControl with spmVIP INFINITY
>> pcs constraint colocation add spmVIP with spmDB-clone 200
>> crm_resource -r spmVIP -p resource-stickiness -m -v 100
>> crm_resource -r spmControl -p resource-stickiness -m -v 100
>>
>> Don't run resources on the quorum only node:
>> pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY
>> pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY
>> pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY
>>
>
> I have no idea what QOnlynode means here.
>

This is the quorum-only node of mine.  Resources are not run on it and the
3 constraints above are how I configured that.

>> and stonith is false:
>> pcs property set stonith-enabled=false
>>
>
> I do not see anything in your configuration that would cause mysql to be
> stopped on network connectivity issues. Either mysql does it on its own,
> or pacemaker attempts to stop all resources on node when it goes out of
> quorum.
>
> If mysql does it on its own, there is nothing that can be done from
> pacemaker side. Pacemaker is not aware of network state at all and
> certainly cannot initiate actions when network becomes available.
>
> If pacemaker tries to stop resources due to out of quorum condition, you
> could set suitable failure-timeout; this will be equivalent to using "pcs
> resource refresh". Keep in mind that pacemaker only checks for
> failure-timeout expiration every cluster-recheck-interval (15 minutes by
> default). This still is not directly related to network availability, but
> if network outage resulted in node going out of quorum, when network is
> back and node joined cluster again it will allow resources to be started
> on node.
>

When quorum is lost I want all the resources to stop.  The cluster is
performing this step correctly for me.

That cluster-recheck-interval would explain the intermittence I saw this
morning.  If I set that to 1 minute would that cause any gross negative
issues?

Is there another setting besides cluster-recheck-interval to consider
adjusting to start mysql when quorum is returned?

Thank you for the feedback.

-John


>> If you'd rather see the cib file I can supply that.
>>
>> With respect to logs, pacemaker.log has the most relevant info, right,
>> but
>> there's a lot.  It's 900+ lines from the time I unplug the network until
>> mysql is restarted by the 'pcs resource refresh'.  Any 

Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread Andrei Borzenkov
On 19.11.2021 17:36, john tillman wrote:
>> On 18.11.2021 22:33, john tillman wrote:
>>>
>>> Greetings all,
>>>
>>> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5
>>>
>>> I have a mysql resource, cloned, that is behaving the way I wanted.
>>> When
>>> the node it is on is unplugged from the network quorum is lost and the
>>> mysqld service stops.  Great.  Oh, and fencing is disabled.
>>>
>>> When the network connectivity is restored I'd like it to restart but it
>>> doesn't.  What needs to be done to make this happen automatically?  Or
>>> what section of the doc should reread more thoroughly?
>>>
>>> When mysql is stopped because of the above, if I run "pcs resource
>>> refresh" it starts?  Any ideas why the "refresh" would do that?
>>>
>>
>> You provided zero information about your setup and how you configured
>> pacemaker to stop mysqld on network connectivity loss, so it is rather
>> hard to guess.
>>
>> Logs covering period when you unplug network, and later plug again, could
>> be also helpful.
>>
> 
> Fair point.  I didn't want to put too much into the first email.  There
> are 3 nodes but 2 nodes are actually used for processing and the 3rd node
> is there just for quorum purposes.  When quorum is lost my resources stop.
>  There are 3 resources: a VIP, MySQL service, and controld (a project
> specific service).
> 
> And this problem has now become intermittent as 1 in 4 tests this morning
> succeeded in starting mysqld when the network was reconnected.  Figures
> :-/
> 
> More info.  After reconnecting the network on spm238 the mysql resource
> was listed as:
>   * spmDB   (systemd:mysqld):   FAILED spm238 (blocked)
> 
> This was cleared and mysqld started after issuing a "pcs resource refresh".
> 

pcs resource refresh deletes failure history so pacemaker tries to start 
resource again. It is completely unrelated to network interface conditions.

"blocked" is default when resource stop operation fails and stonith is 
disabled. 

> So as requested here's how I setup my cluster.  It's copied from an
> ansible playbook so there are some variables shown but should be easy
> enough to understand.  If not, I will gladly clarify anything.
> 
> My 3 resources:
> 
> pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }}
> cidr_netmask=24 op monitor interval=10s
> pcs resource create spmControl systemd:controld op monitor interval=10s
> pcs resource create spmDB systemd:mysqld op monitor interval=10s clone
> 
> My constraints:
> pcs constraint colocation add spmControl with spmVIP INFINITY
> pcs constraint colocation add spmVIP with spmDB-clone 200
> crm_resource -r spmVIP -p resource-stickiness -m -v 100
> crm_resource -r spmControl -p resource-stickiness -m -v 100
> 
> Don't run resources on the quorum only node:
> pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY
> pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY
> pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY
> 

I have no idea what QOnlynode means here.

> and stonith is false:
> pcs property set stonith-enabled=false
> 

I do not see anything in your configuration that would cause mysql to be 
stopped on network connectivity issues. Either mysql does it on its own, or 
pacemaker attempts to stop all resources on node when it goes out of quorum.

If mysql does it on its own, there is nothing that can be done from pacemaker 
side. Pacemaker is not aware of network state at all and certainly cannot 
initiate actions when network becomes available.

If pacemaker tries to stop resources due to out of quorum condition, you could 
set suitable failure-timeout; this will be equivalent to using "pcs resource 
refresh". Keep in mind that pacemaker only checks for failure-timeout 
expiration every cluster-recheck-interval (15 minutes by default). This still 
is not directly related to network availability, but if network outage resulted 
in node going out of quorum, when network is back and node joined cluster again 
it will allow resources to be started on node.

> If you'd rather see the cib file I can supply that.
> 
> With respect to logs, pacemaker.log has the most relevant info, right, but
> there's a lot.  It's 900+ lines from the time I unplug the network until
> mysql is restarted by the 'pcs resource refresh'.  Any suggestions for how
> to present the info here?  Maybe use grep for some key words and include
> those lines here?
> 
> 
>>> It is definitely that call to refresh that triggers the start because
>>> I've
>>> run a handful of tests and the time between reconnecting the network and
>>> pcs resource refresh call varied by as much as 10 minutes.
>>>
>>> Any suggestion would be appreciated.
>>>
>>> Regards,
>>> -John
>>>
>>>
>>>
>>> ___
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>
>> 

Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread john tillman
> On 18.11.2021 22:33, john tillman wrote:
>>
>> Greetings all,
>>
>> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5
>>
>> I have a mysql resource, cloned, that is behaving the way I wanted.
>> When
>> the node it is on is unplugged from the network quorum is lost and the
>> mysqld service stops.  Great.  Oh, and fencing is disabled.
>>
>> When the network connectivity is restored I'd like it to restart but it
>> doesn't.  What needs to be done to make this happen automatically?  Or
>> what section of the doc should reread more thoroughly?
>>
>> When mysql is stopped because of the above, if I run "pcs resource
>> refresh" it starts?  Any ideas why the "refresh" would do that?
>>
>
> You provided zero information about your setup and how you configured
> pacemaker to stop mysqld on network connectivity loss, so it is rather
> hard to guess.
>
> Logs covering period when you unplug network, and later plug again, could
> be also helpful.
>

Fair point.  I didn't want to put too much into the first email.  There
are 3 nodes but 2 nodes are actually used for processing and the 3rd node
is there just for quorum purposes.  When quorum is lost my resources stop.
 There are 3 resources: a VIP, MySQL service, and controld (a project
specific service).

And this problem has now become intermittent as 1 in 4 tests this morning
succeeded in starting mysqld when the network was reconnected.  Figures
:-/

More info.  After reconnecting the network on spm238 the mysql resource
was listed as:
  * spmDB   (systemd:mysqld):   FAILED spm238 (blocked)

This was cleared and mysqld started after issuing a "pcs resource refresh".

So as requested here's how I setup my cluster.  It's copied from an
ansible playbook so there are some variables shown but should be easy
enough to understand.  If not, I will gladly clarify anything.

My 3 resources:

pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }}
cidr_netmask=24 op monitor interval=10s
pcs resource create spmControl systemd:controld op monitor interval=10s
pcs resource create spmDB systemd:mysqld op monitor interval=10s clone

My constraints:
pcs constraint colocation add spmControl with spmVIP INFINITY
pcs constraint colocation add spmVIP with spmDB-clone 200
crm_resource -r spmVIP -p resource-stickiness -m -v 100
crm_resource -r spmControl -p resource-stickiness -m -v 100

Don't run resources on the quorum only node:
pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY
pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY
pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY

and stonith is false:
pcs property set stonith-enabled=false

If you'd rather see the cib file I can supply that.

With respect to logs, pacemaker.log has the most relevant info, right, but
there's a lot.  It's 900+ lines from the time I unplug the network until
mysql is restarted by the 'pcs resource refresh'.  Any suggestions for how
to present the info here?  Maybe use grep for some key words and include
those lines here?


>> It is definitely that call to refresh that triggers the start because
>> I've
>> run a handful of tests and the time between reconnecting the network and
>> pcs resource refresh call varied by as much as 10 minutes.
>>
>> Any suggestion would be appreciated.
>>
>> Regards,
>> -John
>>
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-19 Thread john tillman
> On Thu, Nov 18, 2021 at 03:42:48PM -0500, john tillman wrote:
>> I don't believe I can since I do not have a fencing device available.
>
> As this page explains, fencing is required for the cluster to behave
> correctly:
>
>   https://www.alteeve.com/w/The_2-Node_Myth
>
> Can you share what kind of nodes are you working with? Perhaps some
> simple form of fencing is possible.

Thank you, Valentin.  I'll add more information in my next response.

> --
> Valentin
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-18 Thread Andrei Borzenkov
On 18.11.2021 22:33, john tillman wrote:
> 
> Greetings all,
> 
> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5
> 
> I have a mysql resource, cloned, that is behaving the way I wanted.  When
> the node it is on is unplugged from the network quorum is lost and the
> mysqld service stops.  Great.  Oh, and fencing is disabled.
> 
> When the network connectivity is restored I'd like it to restart but it
> doesn't.  What needs to be done to make this happen automatically?  Or
> what section of the doc should reread more thoroughly?
> 
> When mysql is stopped because of the above, if I run "pcs resource
> refresh" it starts?  Any ideas why the "refresh" would do that?
> 

You provided zero information about your setup and how you configured pacemaker 
to stop mysqld on network connectivity loss, so it is rather hard to guess.

Logs covering period when you unplug network, and later plug again, could be 
also helpful.

> It is definitely that call to refresh that triggers the start because I've
> run a handful of tests and the time between reconnecting the network and
> pcs resource refresh call varied by as much as 10 minutes.
> 
> Any suggestion would be appreciated.
> 
> Regards,
> -John
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-18 Thread Valentin Vidić via Users
On Thu, Nov 18, 2021 at 03:42:48PM -0500, john tillman wrote:
> I don't believe I can since I do not have a fencing device available.

As this page explains, fencing is required for the cluster to behave
correctly:

  https://www.alteeve.com/w/The_2-Node_Myth

Can you share what kind of nodes are you working with? Perhaps some
simple form of fencing is possible.

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-18 Thread john tillman
> On Thu, Nov 18, 2021 at 02:33:28PM -0500, john tillman wrote:
>> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5
>>
>> I have a mysql resource, cloned, that is behaving the way I wanted.
>> When
>> the node it is on is unplugged from the network quorum is lost and the
>> mysqld service stops.  Great.  Oh, and fencing is disabled.
>
> Can you test how it behaves with fencing enabled?

I don't believe I can since I do not have a fencing device available.

>
> --
> Valentin
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource start after network reconnected

2021-11-18 Thread Valentin Vidić via Users
On Thu, Nov 18, 2021 at 02:33:28PM -0500, john tillman wrote:
> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5
> 
> I have a mysql resource, cloned, that is behaving the way I wanted.  When
> the node it is on is unplugged from the network quorum is lost and the
> mysqld service stops.  Great.  Oh, and fencing is disabled.

Can you test how it behaves with fencing enabled?

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] resource start after network reconnected

2021-11-18 Thread john tillman


Greetings all,

preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5

I have a mysql resource, cloned, that is behaving the way I wanted.  When
the node it is on is unplugged from the network quorum is lost and the
mysqld service stops.  Great.  Oh, and fencing is disabled.

When the network connectivity is restored I'd like it to restart but it
doesn't.  What needs to be done to make this happen automatically?  Or
what section of the doc should reread more thoroughly?

When mysql is stopped because of the above, if I run "pcs resource
refresh" it starts?  Any ideas why the "refresh" would do that?

It is definitely that call to refresh that triggers the start because I've
run a handful of tests and the time between reconnecting the network and
pcs resource refresh call varied by as much as 10 minutes.

Any suggestion would be appreciated.

Regards,
-John



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/