Re: [ClusterLabs] [EXTERNAL] What's the best practice to scale-out/increase the cluster size?

Tomas Jelinek Mon, 15 Jul 2019 05:20:26 -0700

Dne 11. 07. 19 v 16:01 Michael Powell napsal(a):

Thanks again for the feedback.  As a novice to Pacemaker, I am learning a great 
deal and have a great deal more to learn.


I'm afraid I was not precise in my choice of the term "stand-alone".  As you point out, 
our situation is really case b) "as the 2-node cluster but only one is up at the moment". 
 That said, there are cases where we would want to bring up a single node at a time.  These occur 
during maintenance periods, though, not during normal operation.  Hence we believe we can live with 
the STONITH/reboot.

I was not aware that corosync.conf could be reloaded while Corosync was 
running.  I'll review the Corosync documentation again.

Be aware that not all corosync settings can be changed while corosync isrunning. See corosync wiki[1] for details.


[1]: https://github.com/corosync/corosync/wiki/Config-file-values


Regards,
Tomas


Regards,
   Michael

-----Original Message-----
From: Roger Zhou <zz...@suse.com>
Sent: Thursday, July 11, 2019 1:16 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>; Michael Powell <michael.pow...@harmonicinc.com>; Ken Gaillot 
<kgail...@redhat.com>
Cc: Venkata Reddy Chappavarapu <venkata.chappavar...@harmonicinc.com>
Subject: [EXTERNAL] What's the best practice to scale-out/increase the cluster size? 
(was: [ClusterLabs] "node is unclean" leads to gratuitous reboot)


On 7/11/19 2:15 AM, Michael Powell wrote:

Thanks to you and Andrei for your responses.  In our particular situation, we 
want to be able to operate with either node in stand-alone mode, or with both 
nodes protected by HA.  I did not mention this, but I am working on upgrading 
our product from a version which used Pacemaker version 1.0.13 and Heartbeat to 
run under CentOS 7.6 (later 8.0).  The older version did not exhibit this 
behavior, hence my concern.

I do understand the "wait_for_all" option better, and now that I know why the 
"gratuitous" reboot is happening, I'm more comfortable with that behavior.  I think the 
biggest operational risk would occur following a power-up of the chassis.  If one node were 
significantly delayed during bootup, e.g. because of networking issues, the other node would issue 
the STONITH and reboot the delayed node.  That would be an annoyance, but it would be relatively 
infrequent.  Our customers almost always keep at least one node (and usually both nodes) 
operational 24/7.


2 cents,

I think your requirement is very clear. Well, I view this is a tricky design 
challenge. There are two different situations likely fool people:

a) the situation of being stand-alone (one node, really not a cluster)
b) the situation as the 2-node cluster but only one is up at the moment

Being not define the concepts clearly and not clarify their difference, people 
could mix them together and set wrong expectation but on a different concept, 
really.

In your case, the configuration is 2-node cluster. The log indicates the 
correct behavior for b), eg. those STONITH actions are by-design indeed.
But, people set the wrong expectation to treat it as a).

With that, could be a cleaner design to let it be a stand-alone system first, 
then smoothly grow it to two nodes?

Furthermore, this trigger me to raise a question, mostly for corosync:

What's the best practice to scale-out/increase the cluster size?

I can think one of the approach is to modify corosync.conf and reload it in 
run-time. Well, it doesn't look like as smart as the reverse way, namely, 
allow_downscale/auto_tie_breaker/last_man_standing of the advanced corosync 
feature set, see `man votequorum`.


Cheers,
Roger

Regards,
    Michael

-----Original Message-----
From: Ken Gaillot <kgail...@redhat.com>
Sent: Tuesday, July 09, 2019 12:42 PM
To: Cluster Labs - All topics related to open-source clustering
welcomed <users@clusterlabs.org>
Cc: Michael Powell <michael.pow...@harmonicinc.com>; Venkata Reddy
Chappavarapu <venkata.chappavar...@harmonicinc.com>
Subject: [EXTERNAL] Re: [ClusterLabs] "node is unclean" leads to
gratuitous reboot

On Tue, 2019-07-09 at 12:54 +0000, Michael Powell wrote:

I have a two-node cluster with a problem.  If I start


Not so much a problem as a configuration choice :)

There are trade-offs in any case.

- wait_for_all in corosync.conf: If set, this will make each starting node wait 
until it sees the other before gaining quorum for the first time. The downside 
is that both nodes must be up for the cluster to start; the upside is a clean 
starting point and no fencing.

- startup-fencing in pacemaker properties: If disabled, either node
can start without fencing the other. This is unsafe; if the other node
is actually active and running resources, but unreachable from the
newly up node, the newly up node may start the same resources, causing
split- brain. (Easier than you might think: consider taking a node
down for hardware maintenance, bringing it back up without a network,
then plugging it back into the network -- by that point it may have
brought up resources and starts causing havoc.)

- Start corosync on both nodes, then start pacemaker. This avoids start-up 
fencing since when pacemaker starts on either node, it already sees the other 
node present, even if that node's pacemaker isn't up yet.

Personally I'd go for wait_for_all in normal operation. You can always disable 
it if there are special circumstances where a node is expected to be out of the 
cluster for a long time.

Corosync/Pacemaker on one node, and then delay startup on the 2nd
node (which is otherwise up and running), the 2nd node will be
rebooted very soon after STONITH is enabled on the first node.  This
reboot seems to be gratuitous and could under some circumstances be
problematic.  While, at present,  I “manually” start
Corosync/Pacemaker by invoking a script from an ssh session, in a
production environment, this script would be started by a systemd
service.  It’s not hard to imagine that if both nodes were started at
approximately the same time (each node runs on a separate motherboard
in the same chassis), this behavior could cause one of the nodes to
be rebooted while it’s in the process of booting up.

The two nodes’ host names are mgraid-16201289RN00023-0 and mgraid-

16201289RN00023-1.  Both hosts are running, but Pacemaker has been
started on neither.  If Pacemaker is started on mgraid-
16201289RN00023-0, within a few seconds after STONITH is enabled, the
following messages will appear in the system log file, and soon
thereafter STONITH will be invoked to reboot the other node, on which
Pacemaker has not yet been started.  (NB: The fence agent is a
process named mgpstonith which uses the ipmi interface to reboot the
other node.  For debugging, it prints the data it receives from
stdin. )

2019-07-08T13:11:14.907668-07:00 mgraid-16201289RN00023-0

HA_STARTSTOP: Configure mgraid-stonith    # This message indicates
that STONITH is about to be configured and enabled …
2019-07-08T13:11:15.018131-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16299]: info:... action=metadata#012
2019-07-08T13:11:15.050817-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16301]: info:... action=metadata#012 …
2019-07-08T13:11:21.085092-07:00 mgraid-16201289RN00023-0
pengine[16216]:  warning: Scheduling Node mgraid-16201289RN00023-1
for STONITH
2019-07-08T13:11:21.085615-07:00 mgraid-16201289RN00023-0
pengine[16216]:   notice:  * Fence (reboot) mgraid-16201289RN00023-1
'node is unclean'
2019-07-08T13:11:21.085663-07:00 mgraid-16201289RN00023-0
pengine[16216]:   notice:  * Promote    SS16201289RN00023:0     (
Stopped -> Master mgraid-16201289RN00023-0 )
2019-07-08T13:11:21.085704-07:00 mgraid-16201289RN00023-0
pengine[16216]:   notice:  * Start      mgraid-stonith:0
(                   mgraid-16201289RN00023-0 )
2019-07-08T13:11:21.091673-07:00 mgraid-16201289RN00023-0
pengine[16216]:  warning: Calculated transition 0 (with warnings),
saving inputs in /var/lib/pacemaker/pengine/pe-warn-3.bz2
2019-07-08T13:11:21.093155-07:00 mgraid-16201289RN00023-0
crmd[16218]:   notice: Initiating monitor operation
SS16201289RN00023:0_monitor_0 locally on mgraid-16201289RN00023-0
2019-07-08T13:11:21.124403-07:00 mgraid-16201289RN00023-0
crmd[16218]:   notice: Initiating monitor operation mgraid-
stonith:0_monitor_0 locally on mgraid-16201289RN00023-0 …
2019-07-08T13:11:21.132994-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16361]: info:... action=metadata#012 …
2019-07-08T13:11:22.128139-07:00 mgraid-16201289RN00023-0
crmd[16218]:   notice: Requesting fencing (reboot) of node mgraid-
16201289RN00023-1
2019-07-08T13:11:22.129150-07:00 mgraid-16201289RN00023-0
crmd[16218]:   notice: Result of probe operation for
SS16201289RN00023 on mgraid-16201289RN00023-0: 7 (not running)
2019-07-08T13:11:22.129191-07:00 mgraid-16201289RN00023-0
crmd[16218]:   notice: mgraid-16201289RN00023-0-
SS16201289RN00023_monitor_0:6 [ \n\n ]
2019-07-08T13:11:22.133846-07:00 mgraid-16201289RN00023-0 stonith-
ng[16213]:   notice: Client crmd.16218.a7e3cbae wants to fence
(reboot) 'mgraid-16201289RN00023-1' with device '(any)'
2019-07-08T13:11:22.133997-07:00 mgraid-16201289RN00023-0 stonith-
ng[16213]:   notice: Requesting peer fencing (reboot) of mgraid-
16201289RN00023-1
2019-07-08T13:11:22.136287-07:00 mgraid-16201289RN00023-0
crmd[16218]:   notice: Result of probe operation for mgraid-stonith
on mgraid-16201289RN00023-0: 7 (not running)
2019-07-08T13:11:22.141393-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16444]: info:... action=status#012   # Status requests
always return 0.
2019-07-08T13:11:22.141418-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16444]: info:... nodename=mgraid-16201289RN00023-1#012
2019-07-08T13:11:22.141432-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16444]: info:... port=mgraid-16201289RN00023-1#012
2019-07-08T13:11:22.141444-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16444]: info:Ignoring: port …
2019-07-08T13:11:22.148973-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16445]: info:... action=status#012
2019-07-08T13:11:22.148997-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16445]: info:... nodename=mgraid-16201289RN00023-1#012
2019-07-08T13:11:22.149009-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16445]: info:... port=mgraid-16201289RN00023-1#012
2019-07-08T13:11:22.149019-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16445]: info:Ignoring: port …
2019-07-08T13:11:22.155226-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16446]: info:... action=reboot#012
2019-07-08T13:11:22.155250-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16446]: info:... nodename=mgraid-16201289RN00023-1#012
2019-07-08T13:11:22.155263-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16446]: info:... port=mgraid-16201289RN00023-1#012
2019-07-08T13:11:22.155273-07:00 mgraid-16201289RN00023-0
MGPSTONITH[16446]: info:Ignoring: port Following is a relevant
excerpt of the corosync.log file –
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: unpack_config:  STONITH timeout: 60000
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: unpack_config:  STONITH of failed nodes is enabled
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: unpack_config:  Concurrent fencing is disabled
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: unpack_config:  Stop all active resources: false
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: unpack_config:  Cluster is symmetric - resources can run
anywhere by default
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: unpack_config:  Default stickiness: 0
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: unpack_config:  On loss of CCM Quorum: Stop ALL resources
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: unpack_config:  Node scores: 'red' = -INFINITY, 'yellow' = 0,
'green' = 0
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: determine_online_status_fencing: Node mgraid-16201289RN00023-0
is active
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: determine_online_status:       Node mgraid-16201289RN00023-0 is
online
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: unpack_node_loop:     Node 1 is already processed
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: unpack_node_loop:     Node 1 is already processed
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: clone_print:    Master/Slave Set: ms-SS16201289RN00023
[SS16201289RN00023]
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: short_print:         Stopped: [ mgraid-16201289RN00023-0
mgraid-16201289RN00023-1 ]
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: clone_print:    Clone Set: mgraid-stonith-clone [mgraid-
stonith]
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: short_print:         Stopped: [ mgraid-16201289RN00023-0
mgraid-16201289RN00023-1 ]
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_rsc_location:       Constraint (ms-SS16201289RN00023-
master-w1-rule) is not active (role : Master vs. Unknown)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_rsc_location:       Constraint (ms-SS16201289RN00023-
master-w1-rule) is not active (role : Master vs. Unknown)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_rsc_location:       Constraint (ms-SS16201289RN00023-
master-w1-rule) is not active (role : Master vs. Unknown)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: distribute_children:       Allocating up to 2 ms-
SS16201289RN00023 instances to a possible 1 nodes (at most 1 per
host,
2 optimal)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_assign_node:       Assigning mgraid-16201289RN00023-0
to SS16201289RN00023:0
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_assign_node:   All nodes for resource
SS16201289RN00023:1 are unavailable, unclean or shutting down
(mgraid-16201289RN00023-1: 0, -1000000)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_assign_node:   Could not allocate a node for
SS16201289RN00023:1
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: native_color:   Resource SS16201289RN00023:1 cannot run
anywhere
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: distribute_children:       Allocated 1 ms-SS16201289RN00023
instances of a possible 2
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: master_color:       SS16201289RN00023:0 master score: 99
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: master_color:   Promoting SS16201289RN00023:0 (Stopped mgraid-
16201289RN00023-0)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: master_color:       SS16201289RN00023:1 master score: 0
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: master_color:   ms-SS16201289RN00023: Promoted 1 instances of a
possible 1 to master
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: distribute_children:       Allocating up to 2 mgraid-stonith-
clone instances to a possible 1 nodes (at most 1 per host, 2 optimal)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_assign_node:       Assigning mgraid-16201289RN00023-0
to mgraid-stonith:0
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_assign_node:   All nodes for resource mgraid-stonith:1
are unavailable, unclean or shutting down (mgraid-16201289RN00023-1:
0, -1000000)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_assign_node:   Could not allocate a node for mgraid-
stonith:1
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: native_color:   Resource mgraid-stonith:1 cannot run anywhere
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: distribute_children:       Allocated 1 mgraid-stonith-clone
instances of a possible 2
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_create_probe:       Probing SS16201289RN00023:0 on
mgraid-16201289RN00023-0 (Stopped) 1 (nil)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_create_probe:       Probing mgraid-stonith:0 on mgraid-
16201289RN00023-0 (Stopped) 1 (nil)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: master_create_actions:       Creating actions for ms-
SS16201289RN00023
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: RecurringOp:    Start recurring monitor (3s) for
SS16201289RN00023:0 on mgraid-16201289RN00023-0
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: RecurringOp:    Start recurring monitor (3s) for
SS16201289RN00023:0 on mgraid-16201289RN00023-0
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
warning: stage6:  Scheduling Node mgraid-16201289RN00023-1 for STONITH
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: native_start_constraints:       Ordering mgraid-
stonith:0_start_0 after mgraid-16201289RN00023-1 recovery
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
notice: LogNodeActions: * Fence (reboot) mgraid-16201289RN00023-1
'node is unclean'
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
notice: LogAction:      * Promote    SS16201289RN00023:0     (
Stopped -> Master mgraid-16201289RN00023-0 )
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: LogActions:     Leave   SS16201289RN00023:1 (Stopped)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
notice: LogAction:      * Start      mgraid-stonith:0
(                   mgraid-16201289RN00023-0 )
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
info: LogActions:     Leave   mgraid-stonith:1    (Stopped)
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: action2xml:     Using anonymous clone name SS16201289RN00023
for SS16201289RN00023:0 (aka. (null))
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: action2xml:     Using anonymous clone name SS16201289RN00023
for SS16201289RN00023:0 (aka. (null))
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: action2xml:     Using anonymous clone name SS16201289RN00023
for SS16201289RN00023:0 (aka. (null))
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: action2xml:     Using anonymous clone name SS16201289RN00023
for SS16201289RN00023:0 (aka. (null))
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: action2xml:     Using anonymous clone name SS16201289RN00023
for SS16201289RN00023:0 (aka. (null))
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: action2xml:     Using anonymous clone name SS16201289RN00023
for SS16201289RN00023:0 (aka. (null))
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: action2xml:     Using anonymous clone name SS16201289RN00023
for SS16201289RN00023:0 (aka. (null))
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: action2xml:     Using anonymous clone name mgraid-stonith for
mgraid-stonith:0 (aka. (null))
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
debug: action2xml:     Using anonymous clone name mgraid-stonith for
mgraid-stonith:0 (aka. (null))
Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
warning: process_pe_message:       Calculated transition 0 (with
warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-3.bz2

Here is the status of the first node, once Pacemaker is started –[root@mgraid-16201289RN00023-0 bin]# pcs status Cluster name:

Stack: corosync
Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-
c3c624ea3d) - partition with quorum
Last updated: Mon Jul  8 17:51:22 2019 Last change: Mon Jul  8
16:11:23 2019 by root via cibadmin on mgraid-
16201289RN00023-0

2 nodes configured

4 resources configured

Online: [ mgraid-16201289RN00023-0 ]

OFFLINE: [ mgraid-16201289RN00023-1 ]

Full list of resources:Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023]

       SS16201289RN00023  (ocf::omneon:ss):       Starting mgraid-
16201289RN00023-0
       Stopped: [ mgraid-16201289RN00023-1 ] Clone Set:
mgraid-stonith-clone [mgraid-stonith]
       Started: [ mgraid-16201289RN00023-0 ]
       Stopped: [ mgraid-16201289RN00023-1 ]

Daemon Status:

    corosync: active/disabled
    pacemaker: active/disabled
    pcsd: inactive/disabled
Here’s the configuration, from the first node –

[root@mgraid-16201289RN00023-0 bin]# pcs status Cluster name:

Stack: corosync
Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-
c3c624ea3d) - partition with quorum
Last updated: Mon Jul  8 17:51:22 2019 Last change: Mon Jul  8
16:11:23 2019 by root via cibadmin on mgraid-
16201289RN00023-0

2 nodes configured

4 resources configured

Online: [ mgraid-16201289RN00023-0 ]

OFFLINE: [ mgraid-16201289RN00023-1 ]

Full list of resources:Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023]

       SS16201289RN00023  (ocf::omneon:ss):       Starting mgraid-
16201289RN00023-0
       Stopped: [ mgraid-16201289RN00023-1 ] Clone Set:
mgraid-stonith-clone [mgraid-stonith]
       Started: [ mgraid-16201289RN00023-0 ]
       Stopped: [ mgraid-16201289RN00023-1 ]

Daemon Status:

    corosync: active/disabled
    pacemaker: active/disabled
    pcsd: inactive/disabled
[root@mgraid-16201289RN00023-0 bin]# pcs config Cluster Name:
Corosync Nodes:
mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 Pacemaker Nodes:
mgraid-16201289RN00023-0 mgraid-16201289RN00023-1

Resources:

Master: ms-SS16201289RN00023
    Meta Attrs: clone-max=2 notify=true globally-unique=false target-
role=Started
    Resource: SS16201289RN00023 (class=ocf provider=omneon type=ss)
     Attributes: ss_resource=SS16201289RN00023
ssconf=/var/omneon/config/config.16201289RN00023
     Operations: monitor interval=3s role=Master timeout=7s
(SS16201289RN00023-monitor-3s)
                 monitor interval=10s role=Slave timeout=7
(SS16201289RN00023-monitor-10s)
                 stop interval=0 timeout=20 (SS16201289RN00023-stop-0)
                 start interval=0 timeout=300
(SS16201289RN00023-start-
0)
Clone: mgraid-stonith-clone
    Resource: mgraid-stonith (class=stonith type=mgpstonith)
     Operations: monitor interval=0 timeout=20s (mgraid-stonith-
monitor-interval-0)

Stonith Devices:

Fencing Levels:

Location Constraints:

    Resource: ms-SS16201289RN00023
      Constraint: ms-SS16201289RN00023-master-w1
        Rule: role=master score=100  (id:ms-SS16201289RN00023-master-
w1-rule)
          Expression: #uname eq mgraid-16201289rn00023-0  (id:ms-
SS16201289RN00023-master-w1-rule-expression)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:

No alerts defined

Resources Defaults:

failure-timeout: 1min
Operations Defaults:
No defaults set

Cluster Properties:

cluster-infrastructure: corosync
cluster-recheck-interval: 1min
dc-deadtime: 5s
dc-version: 1.1.19-8.el7-c3c624ea3d
have-watchdog: false
last-lrm-refresh: 1562513532
stonith-enabled: true

Quorum:

    Options:
      wait_for_all: 0
Interestingly, as you’ll note below, the “two_node” option is also
set to 1, but is not reported as such above.

Finally, here’s /etc/corosync/corosync.conf – totem {

          version: 2

crypto_cipher: none

          crypto_hash: none

interface {

                  ringnumber: 0
          bindnetaddr: 169.254.1.1
                  mcastaddr: 239.255.1.1
                  mcastport: 5405
                  ttl: 1
          }
}

logging {

          fileline: off
          to_stderr: no
          to_logfile: yes
          logfile: /var/log/cluster/corosync.log
          to_syslog: yes
          debug: on
          timestamp: on
          logger_subsys {
                  subsys: QUORUM
                  debug: on
          }
}

nodelist {

          node {
                  ring0_addr: mgraid-16201289RN00023-0
                  nodeid: 1
          }

node {

                  ring0_addr: mgraid-16201289RN00023-1
                  nodeid: 2
          }
}

quorum {

          provider: corosync_votequorum

two_node: 1wait_for_all: 0

I’d appreciate any insight you can offer into this behavior, and any

suggestions you may have.

Regards,
    Michael

Michael Powell

      Sr. Staff Engineer

15220 NW Greenbrier Pkwy

          Suite 290
      Beaverton, OR   97006
      T 503-372-7327    M 503-789-3019   H 503-625-5332

_______________________________________________

Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

--
Ken Gaillot <kgail...@redhat.com>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [EXTERNAL] What's the best practice to scale-out/increase the cluster size?

Reply via email to