Re: [ClusterLabs] Pacemaker startup retries

2018-08-30 Thread Cesar Hernandez
Hi

> 
> 
> Do you mean you have a custom fencing agent configured? If so, check
> the return value of each attempt. Pacemaker should request fencing only
> once as long as it succeeds (returns 0), but if the agent fails
> (returns nonzero or times out), it will retry, even if the reboot
> worked in reality.
> 

Yes, custom fencing agent, and it always returns 0 
> 
> 
> FYI, corosync 2 has a "two_node" setting that includes "wait_for_all"
> -- with that, you don't need to ignore quorum in pacemaker, and the
> cluster won't start until both nodes have seen each other at least
> once.

Well I'm ok with the quorum behaviour but I want to know why it reboots 3 times 
on startup.
When both nodes are up and running, and if one node stops responding, the other 
node fences it only 1 time, not 3
> 
> 
Do you know why it happens?

Thanks
Cesar

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup retries

2018-08-30 Thread Ken Gaillot
On Thu, 2018-08-30 at 17:24 +0200, Cesar Hernandez wrote:
> Hi
> 
> I have a two-node corosync+pacemaker which, starting only one node,
> it fences the other node. It's ok as the default behaviour as the
> default "startup-fencing" is set to true.
> But, the other node is rebooted 3 times, and then, the remaining node
> starts resources and doesn't fence the node anymore.
> 
> How can I change these 3 times, to, for example, 1 reboot , or more,
> 5? I use a custom fencing script so I'm sure these retries are not
> done by the script but pacemaker, and I also see the reboot
> operations on the logs:
> 
> Aug 30 17:22:08 [12978] 1       crmd:   notice: te_fence_node:
> Executing reboot fencing operation (81) on 2 (timeout=18)
> Aug 30 17:22:31 [12978] 1       crmd:   notice: te_fence_node:
> Executing reboot fencing operation (87) on 2 (timeout=18)
> Aug 30 17:22:48 [12978] 1       crmd:   notice: te_fence_node:
> Executing reboot fencing operation (89) on 2 (timeout=18)

Do you mean you have a custom fencing agent configured? If so, check
the return value of each attempt. Pacemaker should request fencing only
once as long as it succeeds (returns 0), but if the agent fails
(returns nonzero or times out), it will retry, even if the reboot
worked in reality.

If instead you mean you have a script that can request fencing (e.g.
via stonith_admin), then check the logs before each attempt to see if
the request was initiated by the cluster (which should show a policy
engine transition for it) or your script.

FYI, corosync 2 has a "two_node" setting that includes "wait_for_all"
-- with that, you don't need to ignore quorum in pacemaker, and the
cluster won't start until both nodes have seen each other at least
once.

> Software versions:
> 
> corosync-1.4.8
> crmsh-2.1.5
> libqb-0.17.2
> Pacemaker-1.1.14
> resource-agents-3.9.6
> Reusable-Cluster-Components-glue--glue-1.0.12
> 
> Some parameters:
> 
> property cib-bootstrap-options: \
>   have-watchdog=false \
>   dc-version=1.1.14-70404b0e5e \
>   cluster-infrastructure="classic openais (with plugin)" \
>   expected-quorum-votes=2 \
>   stonith-enabled=true \
>   no-quorum-policy=ignore \
>   default-resource-stickiness=200 \
>   stonith-timeout=180s \
>   last-lrm-refresh=1534489943
> 
> 
> Thanks
> 
> César Hernández Bañó
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Q: native_color scores for clones

2018-08-30 Thread Ken Gaillot
On Thu, 2018-08-30 at 12:23 +0200, Ulrich Windl wrote:
> Hi!
> 
> After having found showscores.sh, I thought I can improve the
> perfomance by porting it to Perl, but it seems the slow part actually
> is calling pacemakers helper scripts like crm_attribute,
> crm_failcount, etc...
> 
> But anyway: Being quite confident what my program produces (;-)), I
> found some odd score values for clones that run in a two node
> cluster. For example:
> 
> Resource   Node Score Stickin. Fail Count Migr. Thr.
> --  -  -- --
> prm_DLM:1  h02  10  0  0
> prm_DLM:1  h06  00  0  0
> prm_DLM:0  h02  -INFINITY0  0  0
> prm_DLM:0  h06  10  0  0
> prm_O2CB:1 h02  10  0  0
> prm_O2CB:1 h06  -INFINITY0  0  0
> prm_O2CB:0 h02  -INFINITY0  0  0
> prm_O2CB:0 h06  10  0  0
> prm_cfs_locks:0h02  -INFINITY0  0  0
> prm_cfs_locks:0h06  10  0  0
> prm_cfs_locks:1h02  10  0  0
> prm_cfs_locks:1h06  -INFINITY0  0  0
> prm_s02_ctdb:0 h02  -INFINITY0  0  0
> prm_s02_ctdb:0 h06  10  0  0
> prm_s02_ctdb:1 h02  10  0  0
> prm_s02_ctdb:1 h06  -INFINITY0  0  0
> 
> For prm_DLM:1 for example one node has score 0, the other node has
> score 1, but for prm:DLM:0 the host that has 1 for prm_DLM:1 has
> -INFINITY (not 0), while the other host has the usual 1. So I guess
> that even without -INFINITY the configuration would be stable. For
> prm_O2CB two nodes have -INFINITY as score. For prm_cfs_locks the
> pattern is as usual, and for rpm_s02_ctdb to nodes have -INFINITY
> again.
> 
> I don't understand where those -INFINITY scores come from. Pacemaker
> is SLES11 SP4 (1.1.12-f47ea56).

Scores are something of a mystery without tracing through code line by
line. It's similar to (though much simpler than) AI in respect of the
cause of an output being obscured by the combination of so many inputs
and processing steps.

Resources (including clone instances) are placed one by one. My guess
is that prm_DLM:1 was placed first, on h02, which made it no longer
possible for any other instance to be placed on h02 due to clone-node-
max being 1 (which I'm assuming is the case here, the default). All
instances processed after that (which is only prm_DLM:0 in this case)
get -INFINITY on h02 for that reason.

The clones with mirrored -INFINITY scores likely had additional
processing due to constraints or whatnot.

> 
> It might also be a bug, because when I look at a three-node cluster,
> I see that a ":0" resource had score 1 once, and 0 twice, but the
> corrsponding ":2" resource has scores 0, 1, and -INFINITY, and the
> ":1" resource has score 1 once and -INFINITY twice.
> 
> When I look at the "clone_solor" scores, the prm_DLM:* primitives
> look as expected (no -INFINITY). However the cln_DLM clones have
> score like 1, 8200 and 2200 (depending on the node).
> 
> Can someone explain, please?
> 
> Regards,
> Ulrich
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker startup retries

2018-08-30 Thread Cesar Hernandez
Hi

I have a two-node corosync+pacemaker which, starting only one node, it fences 
the other node. It's ok as the default behaviour as the default 
"startup-fencing" is set to true.
But, the other node is rebooted 3 times, and then, the remaining node starts 
resources and doesn't fence the node anymore.

How can I change these 3 times, to, for example, 1 reboot , or more, 5? I use a 
custom fencing script so I'm sure these retries are not done by the script but 
pacemaker, and I also see the reboot operations on the logs:

Aug 30 17:22:08 [12978] 1   crmd:   notice: te_fence_node:  
Executing reboot fencing operation (81) on 2 (timeout=18)
Aug 30 17:22:31 [12978] 1   crmd:   notice: te_fence_node:  
Executing reboot fencing operation (87) on 2 (timeout=18)
Aug 30 17:22:48 [12978] 1   crmd:   notice: te_fence_node:  
Executing reboot fencing operation (89) on 2 (timeout=18)



Software versions:

corosync-1.4.8
crmsh-2.1.5
libqb-0.17.2
Pacemaker-1.1.14
resource-agents-3.9.6
Reusable-Cluster-Components-glue--glue-1.0.12

Some parameters:

property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0e5e \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes=2 \
stonith-enabled=true \
no-quorum-policy=ignore \
default-resource-stickiness=200 \
stonith-timeout=180s \
last-lrm-refresh=1534489943


Thanks

César Hernández Bañó


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Q: ordering clones with interleave=false

2018-08-30 Thread Ken Gaillot
On Thu, 2018-08-30 at 08:28 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot  schrieb am 29.08.2018 um
> > > > 20:30 in
> 
> Nachricht
> <1535567455.5594.5.ca...@redhat.com>:
> > On Wed, 2018‑08‑29 at 13:30 +0200, Ulrich Windl wrote:
> > > Hi!
> > > 
> > > Reading the docs I have a question: WHen I run a clone with
> > > interleave=false in a three‑node cluster, and the clne cannot be
> > > started on one node, will ordering for such a clone be possible?
> > > Does
> > > it make a difference, whether the resource cannot run on an
> > > online
> > > node, or is unable due to a standby or offline node?
> > > 
> > > Regards,
> > > Ulrich
> > 
> > Interleave=false only applies to instances that will be started in
> > the
> > current transition, so offline nodes don't prevent dependent
> > resources
> > from starting on online nodes.
> 
> Ken,
> 
> thanks for responding and explaining. However as I read about "the
> transition
> thing" more than once in an answer: Where is that concept explained
> in greater
> detail?

Good question, it's not. There are additions such as a Troubleshooting
chapter planned for the new "Pacemaker Administration" document, when
that mythical available time appears, and that would be a good place
for it.

A transition is simply the set of actions Pacemaker plans in response
to a new situation. The situation is defined by the CIB (including
status section). Whenever something interesting happens (configuration
change, unexpected action result, or cluster re-check interval), the
scheduler uses the current live CIB to calculate what (if anything)
needs to be done, and that is a transition.

> 
> Regards,
> Ulrich
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Q: native_color scores for clones

2018-08-30 Thread Ulrich Windl
Hi!

After having found showscores.sh, I thought I can improve the perfomance by 
porting it to Perl, but it seems the slow part actually is calling pacemakers 
helper scripts like crm_attribute, crm_failcount, etc...

But anyway: Being quite confident what my program produces (;-)), I found some 
odd score values for clones that run in a two node cluster. For example:

Resource   Node Score Stickin. Fail Count Migr. Thr.
--  -  -- --
prm_DLM:1  h02  10  0  0
prm_DLM:1  h06  00  0  0
prm_DLM:0  h02  -INFINITY0  0  0
prm_DLM:0  h06  10  0  0
prm_O2CB:1 h02  10  0  0
prm_O2CB:1 h06  -INFINITY0  0  0
prm_O2CB:0 h02  -INFINITY0  0  0
prm_O2CB:0 h06  10  0  0
prm_cfs_locks:0h02  -INFINITY0  0  0
prm_cfs_locks:0h06  10  0  0
prm_cfs_locks:1h02  10  0  0
prm_cfs_locks:1h06  -INFINITY0  0  0
prm_s02_ctdb:0 h02  -INFINITY0  0  0
prm_s02_ctdb:0 h06  10  0  0
prm_s02_ctdb:1 h02  10  0  0
prm_s02_ctdb:1 h06  -INFINITY0  0  0

For prm_DLM:1 for example one node has score 0, the other node has score 1, but 
for prm:DLM:0 the host that has 1 for prm_DLM:1 has -INFINITY (not 0), while 
the other host has the usual 1. So I guess that even without -INFINITY the 
configuration would be stable. For prm_O2CB two nodes have -INFINITY as score. 
For prm_cfs_locks the pattern is as usual, and for rpm_s02_ctdb to nodes have 
-INFINITY again.

I don't understand where those -INFINITY scores come from. Pacemaker is SLES11 
SP4 (1.1.12-f47ea56).

It might also be a bug, because when I look at a three-node cluster, I see that 
a ":0" resource had score 1 once, and 0 twice, but the corrsponding ":2" 
resource has scores 0, 1, and -INFINITY, and the ":1" resource has score 1 once 
and -INFINITY twice.

When I look at the "clone_solor" scores, the prm_DLM:* primitives look as 
expected (no -INFINITY). However the cln_DLM clones have score like 1, 8200 
and 2200 (depending on the node).

Can someone explain, please?

Regards,
Ulrich


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org