from:"Jan Pokorný"

Re: [ClusterLabs] Antw: [EXT] True time periods/CLOCK_MONOTONIC node vs. cluster wide (Was: Coming in Pacemaker 2.0.4: dependency on monotonic clock for systemd resources)

2020-03-12 Thread Jan Pokorný

On 12/03/20 08:22 +0100, Ulrich Windl wrote:
> Sorry for top-posting, but if you have NTP-synced your nodes,
> CLOCK_MONOTONIC will not have much advantage over CLOCK_REALTIME
> as the clocks will be rather the same,

I guess you mean they will be rather the same amongst the nodes
relative to each other (not MONOTONIC vs. REALTIME, since they
will trivially differ at some point), regardless if you use
either of these.

How is exactly the same point in time to synchronize going to be
achieved?  Otherwise, you'll suffer "clock not the same" eventually.

How is the failure to NTP-sync going to be detected and what
consequences will it impose on cluster?  If you'll happily continue
regardless, you'll suffer "clock not the same" eventually.

NTP is currently highly recommended as an upstream guidance, but
it can easily be out of the scope of HA projects ... just consider
the autonomous units Digimer was talking about.  Still, they
could at least in theory keep some cluster-private time measuring
sync'd without any exact external calendar anchoring
(i.e., "stop-watching" rather than "calendar-watching").

Also, at barebone level and when REALTIME wouldn't be ever used
for anything in cluster (but logging perhaps, for which
"calendar-watching" [see above] may be of practical value), cluster
would _not_ be interested in calendar-correct time relating (but
NTP is, primarily) --- rather just in the sense of measuring the
periods, hence that particular clocks amongst the nodes are
reasonably comparable (ticking rather synchronously).

> and they won't "jump".
> IMHO the latter is the main reason for using CLOCK_MONOTONIC

yes

> (if the admin

or said NTP client, conventional change (see "leap second",
for instance), HW fault (battery for instance) or anything else

> decides to adjust the real-time clock).  So far the theory. In
> practice the clock jumps, even with NTP, especially if the node had
> been running for a long time, is no[uz]t updating the RTC, and then
> is fenced. The clock may be off by minutes after boot then, and NTP
> is quite conservative when adjusting the time (first it won't
> believe that the clock if off that far, then after minutes the clock
> will actually jump. That's why some fast pre-ntpd adjustment is
> generally used.) The point is (if you can sync your clocks to
> real-time at all): How long do you want to wait for all your nodes
> to agree on some common time?

Specifically with NTP and systemd picture, I think you can order
startup of some units once "NTP synchronized" target is reached.

> Maybe CLOCK_MONOTONIC could help here...

NTP in calendar-correct sense can be skipped altogether
when cluster is happy just relying on that (and MONOTONIC-like
synchronization exists across nodes for good measure).

> The other useful application is any sort of timeout or repeat thing
> that should not be affected by adjusting the real-time clock.

That's exactly the point of using that for node-local measurements
and why I propose users would need to willingly opt-in for cluster
stack components to compile at all where MONOTONIC is not present,
provided that the components are timeout/interval sensitive.

-- 
Jan (Poki)

pgp4CJ7Ki7woj.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] True time periods/CLOCK_MONOTONIC node vs. cluster wide (Was: Coming in Pacemaker 2.0.4: dependency on monotonic clock for systemd resources)

2020-03-11 Thread Jan Pokorný

On 11/03/20 09:04 -0500, Ken Gaillot wrote:
> On Wed, 2020-03-11 at 08:20 +0100, Ulrich Windl wrote:
>> You only have to take care not to compare CLOCK_MONOTONIC
>> timestamps between nodes or node restarts. 
> 
> Definitely :)
> 
> They are used only to calculate action queue and run durations

Both these ... from an isolated perspective of a single node only.
E.g., run durations related to the one currently responsible to act
upon the resource in some way (the "atomic" operation is always
bound to the single host context and when retried or logically
followed with another operation, it's measured anew on pertaining,
perhaps different node).

I feel that's a rather important detail, and just recently this
surface received some slight scratching on the conceptual level...

Current inability to synchronize measurements of CLOCK_MONOTONIC
like notions of time amongst nodes (especially tranfer from old,
possibly failed DC to new DC, likely involving some admitted loss
of precisenesss -- mind you, cluster is never fully synchronous,
you'd need the help of specialized HW for that) in as lossless
way as possible is what I believe is the main show stopper for
being able to accurately express the actual "availability score"
for given resource or resource group --- yep, that famous number,
the holy grail of anyone taking HA seriously --- while at the
same time, something the cluster stack currently cannot readily
present to users (despite it having all or most of the relevant
information, just piecewise).

IOW, this sort of non-localized measurement is what asks for
emulation of cluster-wide CLOCK_MONOTONIC-like measurement, which is
not that trivial if you think about it.  Sort of a corollary of what
Ulrich said, because emulating that pushes you exactly in these waters
of relating CLOCK_MONOTONIC measurements from different nodes
together.

  Not to speak of evaluating whether any node is totally off in its
  own CLOCK_MONOTONIC measurements and hence shall rather be fenced
  as "brain damaged", and perhaps even using the measurements of the
  nodes keeping up together to somehow calculate what's the average
  rate of measured time progress so as to self-maintain time-bound
  cluster-wide integrity, which may just as well be important for
  sbd(!).  (nope, this doesn't get anywhere close to near-light
  speed concerns, just imprecise HW and possibly implied/or
  inter-VM differences)

Perhaps cheapest way out would be to use NTP-level algorithms to
synchronize two CLOCK_MONOTIC timers at the point the worker node
for resource in question claimed "resource stopped", between this
worker node and DC, so that the DC can synchronize again like that
with a new worker node at the point in time when this new claims
"resource started".  At that point, DC would have a rather accurate
knowledge of how long this fail-/move-over, hence down-time, lasted,
hence being able to reflect it to the "availability score" equations.

  Hmm, no wonder that businesses with deep pockets and serious
  synchronicity requirements across the globe resort to using atomic
  clocks, incredibly precise CLOCK_MONOTONIC by default :-)

> For most resource types those are optional (for reporting only), but
> systemd resources require them (multiple status checks are usually
> necessary to verify a start or stop worked, and we need to check the
> remaining timeout each time).

Coincidentally, IIRC systemd alone strictly requires CLOCK_MONOTIC
(and we shall get a lot more strict as well to provide reasonable
expectations to the users as mentioned recently[*]), so said
requirement is just a logical extension without corner cases.

[*] https://lists.clusterlabs.org/pipermail/users/2019-November/026647.html

-- 
Jan (Poki)

pgp88itDpsGVE.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Finding attributes of a past resource agent invocation

2020-03-04 Thread Jan Pokorný

Hi Feri,

just to this one...

On 03/03/20 15:22 +0100, wf...@niif.hu wrote:
> Is there a way to find out what attributes were passed to the OCF
> agent in that fateful invocation?

AFAIK, not possible after-the-fact, unless you add TRACE_RA=1 as
another (real) parameter to the agent and it happens to respond
to it (very likely with standard agents from resource-agents project).
And even then, logs generated like that will likely get lost when
the node is fenced (depends on path/mount particulars).

I think that not exposing such details about invocation directly
at pacemaker logging level is by design ... safer than to leave
the cat out of the bag.  Consider that any incidents reported are
promptly followed with soliciting the logs, and making the
authentication tokens, password and other secrets leaked this
way would be bad for general reputation, wouldn't it?

(this was also part of the reasoning behind CVE-2019-3885)

-- 
Poki

pgpBT4ivPPPXM.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [RFC] LoadSharingIP agent idea, xt_cluster/IPv6/nftables

2020-01-15 Thread Jan Pokorný

On 02/01/20 21:45 +0100, Jan Pokorný wrote:
> So, any takers? :-)
> 
> Of course, as discussed before, this overhaul would deserve a strict
> separation of the function in question under a dedicated agent like
> LoadSharingIP, letting the equivalent one in IPaddr2 agent and based
> on something practically deprecated on arrival slowly die out.
> Existing agent is pretty complex even without that semantically
> overriding piggy-backing already -- no more gravity boosting like
> that, please (not to speak of its naming that simply won't "click").
> 
> Also note there are missing puzzles regarding IPv6 support, still:
> https://lists.clusterlabs.org/pipermail/users/2019-December/026710.html

Shameless plug, I am not aware of any central place regarding nice
to have (someone worked on) projects, so I've stuck this one at the
ClusterLabs wiki if anyone would be interested in scratching their
heads further and move it from the mere concept further:

https://wiki.clusterlabs.org/wiki/Project_Ideas#new_LoadSharingIP_agent

Of course, other such project ideas may emerge there alike.

-- 
Jan (Poki)


pgp_r8UX_tpqY.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Making xt_cluster IP load-sharing work with IPv6

2020-01-14 Thread Jan Pokorný

On 14/01/20 21:16 +0300, Andrei Borzenkov wrote:
> 14.01.2020 17:47, Jan Pokorný пишет:
>> What confused me is that 00:zz:yy:xx:5a:27 appears as if the same
>> address shall be used -- but in your explanation, it would definitely
>> be that case, correct?
> 
> I expect MAC addresses be different (they are on different
> interfaces).  Copy-paste result?

Mea culpa, at least one thing I managed to distort ... meant to say
"it would definitely _not_ be the case" ... and we are in concordance,
eventually.

> If this is intentional and actually denotes same MAC, I have no
> explanation and my guess is probably wrong.

Hopefully anyone more knowledgeable would enlighten us if need be.
Many thanks for the enlightenment so far!

Definitely stuff that will come handy if anyone sets out to
create that LoadSharingIP (or whatever it'd be called) agent.

>> ($DEITY bless all the good people documenting even what
>> seems obvious to them at the moment :-)

(Yep, such an agent shall rather explain each step along.)

-- 
Poki

pgpl7qgsDF76B.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Making xt_cluster IP load-sharing work with IPv6

2020-01-14 Thread Jan Pokorný

On 11/01/20 19:47 +0300, Andrei Borzenkov wrote:
> 04.01.2020 01:42, Valentin Vidić пишет:
>> On Thu, Jan 02, 2020 at 09:52:09PM +0100, Jan Pokorný wrote:
>>> What you've used appears to be akin to what this chunk of manpage
>>> suggests (amongst others):
>>> https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.man
>>> 
>>> which is (yet another) indicator to me that xt_cluster extension
>>> doesn't carry that functionality on its own (like CLUSTERIP target
>>> did, as mentioned).
>> 
> ...
>> 
>>> * But it doesn't explain the suggested destination MAC renormalization
>>> * on INPUT, which is currently yet to be heard of for our purpose...
>> 
>> I did not use the INPUT rules from the xt_cluster documentation and
>> to be honest don't understand the setup described there.
>> 
> 
> ARP RFC says that on reply source and target hardware addresses are
> swapped, so reply is supposed to carry original source MAC as target
> MAC. AFAICT Linux ARP driver does not check it, but I guess it is good
> practice to make sure received packet conforms to standard's requirement.

Ah, thanks.

So does it mean that the initiator of the ARP request would assume the
native MAC address of the interface was used (possibly remembering it),
then OUTPUT rule would overwrite the source unconditionally, and upon
delivery of the response back (with said source-target flip performed
by the responder), the INPUT rule would overwrite it back, so that
said initiator would be happy even if it performed said
guarantee-verification per said RFC (or possibly connection
tracking facility of the firewall that might make these
RFC-imposed assumptions, even!)?

Makes sense, unless I am distoring it even more :-)

What confused me is that 00:zz:yy:xx:5a:27 appears as if the same
address shall be used -- but in your explanation, it would definitely
be that case, correct?

($DEITY bless all the good people documenting even what
seems obvious to them at the moment :-)

-- 
Poki

pgpECH0GGxpBG.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Prevent Corosync Qdevice Failback in split brain scenario.

2020-01-07 Thread Jan Pokorný

On 02/01/20 14:30 +0100, Jan Friesse wrote:
>> I am planning to use Corosync Qdevice version 3.0.0 with corosync
>> version 2.4.4 and pacemaker 1.1.16 in a two node cluster.
>> 
>> I want to know if failback can be avoided in the below situation.
>> 
>> 
>>   1.  The pcs cluster is in split brain scenario after a network
>>   break between two nodes. But both nodes are visible and reachable
>>   from qdevice node.
>>   2.  The qdevice with ffsplit algorithm selects node with id 1
>>   (lowest node id) and node 1 becomes quorate.
>>   3.  Now if node 1 goes down/is not reachable from qdevice node,
>>   the node 2 becomes quorate.
>> 
>> But when node 1 becomes again reachable from qdevice , it becomes
>> quorate and node 2 again goes down. i.e The resources failback to
>> node 1.
>> 
>> Is there any way to prevent this failback.
> 
> No on the qdevice/qnetd level (or at least not generic solution - it
> is possible to set tie_breaker to node 2, so resources wouldn't be
> then shifted for this specific example, but it is probably not what
> was question about).
> 
> I can imagine to have an algorithm option so qnetd would try to keep active
> partition active as long as other requirements are fulfilled - so basically
> add one more test just before calling tie-breaker.
> 
> Such option shouldn't be hard to implement and right now I'm not
> aware about any inconsistencies which may it bring.

While this is not to bring anything new, it may be interesting to
point out these similarities of the more generic:

  "you gotta keep in about the best shape possible"

  vs.

  "oh no, you are taking it rather too literally, man,
  calm down, each transition is costly so you must
  use your head, capito?!"

dilemma.

We observe the same in the field of resource allocation across the
cluster: a resource is preferred to run on particular node, but when
circumstances prevent that while allowing for alternative nodes,
the respective migration happens, but when the original node is back
later on, it may be very undesired to redraw the map once again,
i.e. to tolerate suboptimality wrt. the original predestination.
In pacemaker, this suboptimality tolerance (or original predestination
neutralizer) is coined as a "stickiness" property, and it depends
whether this tolerance wins over the "strength" (score) of the
resource-to-node intention.

My guess is corosync, free of such fuzzy scales of preference (when
there are any competing partitions still, all other break-the-symmetry
options were exhausted already) strengths would just want a binary
toggle "keep quorum per partition decision sticky if there's no other
winner in the standard-except-tie_breaker competition".

Also, any attempt to go further and melt strict rules into relativizing
score system akin to pacemaker (e.g. do not even change "quorate
partition" decision when the winner partition has a summary score
just up to X points more) would, at the quorum level, effectively
subvert the goals of perpetually sustainable HA, which cannot be told
about said binary toggle, I think, that'd be merely a non-conflicting
(as also Honza evaluated) balance tweak in particular corner cases.

-- 
Jan (Poki)

pgpwPN5fzfjM5.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker-controld getting respawned

2020-01-07 Thread Jan Pokorný

On 06/01/20 11:53 -0600, Ken Gaillot wrote:
> On Fri, 2020-01-03 at 13:23 +, S Sathish S wrote:
>> Pacemaker-controld process is getting restarted frequently reason for
>> failure disconnect from CIB/Internal Error (or) high cpu on the
>> system, same has been recorded in our system logs, Please find the
>> pacemaker and corosync version installed on the system.  
>>  
>> Kindly let us know why we are getting below error on the system.
>>  
>> corosync-2.4.4 à  https://github.com/corosync/corosync/tree/v2.4.4
>> pacemaker-2.0.2 à 
>> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2

libqb version is missing (to be explained later on)

>> [root@vmc0621 ~]# ps -eo pid,lstart,cmd  | grep -iE
>> 'corosync|pacemaker' | grep -v grep
>> 2039 Wed Dec 25 15:56:15 2019 corosync
>> 3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f
>> 3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based
>> 3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced
>> 3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd
>> 3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd
>> 3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-
>> schedulerd
>> 25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker-
>> controld
>>  
>>  
>> In system message logs :
>>  
>> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4419 
>> failed: Timer expired (-62)
>> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4420 
>> failed: Timer expired (-62)
> 
> This means that the controller is not getting a response back from the
> CIB manager (pacemaker-based) within a reasonable time. If the DC can't
> record the status of nodes, it can't make correct decisions, so it has
> no choice but to exit (which should lead another node to fence it).

I am not sure if it would be feasible in this other, mutual daemon
relationship, but my first idea was that it might have something to
do with deadlock-prone arrangement of prioririties, akin to what
was resolved between pacemaker-fenced and pacemaker-based
(perhaps -based would be bombarding -controld with updates rather
than responding to some of its prior queries?) not too long ago:
https://github.com/ClusterLabs/pacemaker/commit/3401f25994e8cc059898550082f9b75f2d07f103

Satish, you haven't included any metrics of you cluster (nodes #,
resources #, load of the affected machine/all machines around the
problem occurrence), nor you provided wider excerpts of the log.

All in all, I'd start with updating libqb to 1.9.0 that supposedly
contains https://github.com/ClusterLabs/libqb/pull/352, fix for
the event priority concerning glitch, just in case.

> The default timeout is the number of active nodes in the cluster times
> 10 seconds, with a minimum of 30 seconds. That's a lot of time, so I
> would be concerned if the CIB isn't responsive for that long.
> 
> The logs from pacemaker-based before this point might be helpful,
> although if it's not getting scheduled any CPU time there wouldn't be
> any indication of that.
> 
> It is possible to set the timeout explicitly using the PCMK_cib_timeout
> environment variable, but the underlying problem would be likely to
> cause other issues.

-- 
Poki


pgp4EM7BNVwVC.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Making xt_cluster IP load-sharing work with IPv6 (Was: Concept of a Shared ipaddress/resource for generic applicatons)[

2020-01-02 Thread Jan Pokorný

On 27/12/19 15:04 +0100, Valentin Vidić wrote:
> On Wed, Dec 04, 2019 at 02:44:49PM +0100, Jan Pokorný wrote:
>> For the record, based on my feedback, iptables-extensions man page is
>> headed to (finally) align with the actual in-kernel deprecation
>> message:
>> https://lore.kernel.org/netfilter-devel/20191204130921.2914-1-p...@nwl.cc/
> 
> From a quick run of xt_cluster it seems to be working as expected
> for IPv4

FTR. when having "netfilter"/nftables backend available, you can either
make use of iptables-translate conversion utility, or deduce a similar
takeaway from 
https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.txlate?h=v1.8.4
possibly allowing to ditch any dependency on iptables-* tooling, and on
xt_cluster.ko just as well!

As mentioned in a newer incarnation asking about xt_cluster:
https://lists.clusterlabs.org/pipermail/users/2020-January/026718.html

for the envisioned agent, it would be a way to (optionally) allow for
a rather lightweight operation in the future (where iptables may not
get installed by default with some Linux distros at all; well, even
firewalld-as-a-middleware variant controlled just via DBus calls might
be thinkable, meaning that "nft" tool wouldn't be required, too).

> It requires iptables rules and ARP reply rewrite like:
> 
> arptables -A OUTPUT -o eth1 --h-length 6 -j mangle --mangle-mac-s 
> 01:00:5e:00:01:01

pardon my ignorance but you currently appear to be the greatest expert
with practical experience on this list regarding the topic.

* * *

1. Is this based solely on experience with xt_cluster extension that
   led you to this ARP-level rewrite unique to using netfilter backend,
   or would the same actually be needed with true CLUSTERIP target?

Actually, I took a look at the code of CLUSTERIP extension, and it in
fact is used to do the very same ARP level mangling, even though, it is
slightly more precise, akin to (with stray in-line comments):

  arptables -A OUTPUT \
--h-type 1 \  # Ethernet
--proto-type 0x800 \  # IPv4
--h-length 6 \  # perhaps redundant to --h-type?
\ # cannot express limitation on the size of network address
\ # but that would perhaps be redundant to --proto-type
--opcode 2 \  # this time for Reply
-j mangle --mangle-mac-s CLUSTERMAC

  # see also
  # 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4095ebf1e641b0f37ee1cd04c903bb85cf4ed25b
  arptables -A OUTPUT \
--h-type 1 \  # Ethernet
--proto-type 0x800 \  # IPv4
--h-length 6 \  # perhaps redundant to --h-type?
\ # cannot express limitation on the size of network address
\ # but that would perhaps be redundant to --proto-type
--opcode 1 \  # this time for Request
-j mangle --mangle-mac-s CLUSTERMAC

What you've used appears to be akin to what this chunk of manpage
suggests (amongst others):
https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.man

which is (yet another) indicator to me that xt_cluster extension
doesn't carry that functionality on its own (like CLUSTERIP target
did, as mentioned).

*
* Anyway, I'd like to understand why is this necessary in the first
* place, getting to my second question.
*

2. Is the following, for me viable explanation correct?

That arrangement is to prevent here unexpectedly leaky specific
associations (I'd call "fixations") of the interface's true (hence
non-multicast) MAC address with meant-to-be-shared IP address at hand,
and hence cancelling the effect of link-multicasted frames (to which
at most a single recipient would respond per the firewall matching
rules), and therefore botching the "shared IP" concept altogether from
the perspective of network members that would undesirably learn
non-multicast address association for the particular
meant-to-be-shared IP leaked like this.

*
* But it doesn't explain the suggested destination MAC renormalization
* on INPUT, which is currently yet to be heard of for our purpose...
*

3. Is, perhaps, the following plausible explanations sound?

- this is so as not spoil the local ARP cache/dependent interactions,
  such as when actual ARP request is sent with link layer address
  identical with the particular multicast MAC in use -- likely
  feasible with the other host on the network configured the same
  way and with already mentioned source-MAC rewriting on egress
  in place (so this would actually neutralize harmful network
  effect it could otherwise be causing)

- or any other reasons?

*
* Finally, when referring to the suggestive example above, there is
* one more question to ask...
*

4. Shall not even existing IPaddr2 (whether in CLUSTERIP-based mode
   or not) actually verify that
   /proc/sys/net/netfilter/nf_conntrack_tcp_loose
   gets cleared, at least until told not to through configuration?

- looks like a good idea not to allow any after-cut packets
  interaction (would only apply to any

[ClusterLabs] [RFC] LoadSharingIP agent idea, xt_cluster/IPv6/nftables (Was: Support for xt_cluster)

2020-01-02 Thread Jan Pokorný

On 19/12/19 10:18 -0600, Ken Gaillot wrote:
> On Thu, 2019-12-19 at 15:01 +, Marcus Vinicius wrote:
>> Is there any intention to abandon CLUSTERIP
> 
> yes
> 
>>  in favor of xt_cluster.ko? 
> 
> no
> 
> :)
> 
> A recent thread about this:
> https://lists.clusterlabs.org/pipermail/users/2019-December/026663.html
> 
> resulted in a change to allow IPaddr2 clones to continue working on
> newer systems if "iptables-legacy" is available:
> https://github.com/ClusterLabs/resource-agents/pull/1439

Note also a closer look later in that thread:
https://lists.clusterlabs.org/pipermail/users/2019-December/026671.html

that eventually, on the side axis, led to an updated man page
(coming to iptables v1.8.5 or so):
https://git.netfilter.org/iptables/commit/?id=37ff0d667d346dd455e08f52a785685185f01b15

so there are no doubts about where this is headed, and shall call
for action where the risk is imminent, as also mentioned:

> tl;dr Cloned IPaddr2 is supported only on platforms that support
> CLUSTERIP, and can be considered deprecated since CLUSTERIP itself is
> deprecated. A pull request with an xt_cluster implementation would be
> very welcome, as it's a low priority for available developers.

Good news is that sinking the support for that in would be rather
non-intrusive from the operational perspective, since xt_cluster
(aliased also as ipt_cluster, for that matter) was around with
Linux kernels as old as 2.6.30[*1] -- of course, YMMV.

Another good news regarding possible future decline from
iptables/xtables as such in the face of nftables -- technology that
is itself around in the Linux kernel for ages, since Linux 3.13 in
particular[*2] -- slowly getting prefered, I think, is that you can
manually rewrite the respective firewall rules using xt_cluster
extension statically (hence right in the agent) even if you lack
iptables-legacy at those very cluster machines.
All you need is a side (not necessarily local) installation of
iptables with iptables-translate (comes with iptables-nft package
in a fairly recent Fedora) command working so that you can yield the
right incantation of the nft tool controlling said nftables,
see examples of that:
https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.txlate?h=v1.8.4

Capturing generalizations of the intention-to-exact-recipe translation
logic in the agent itself would then be the best way to proceed, since
that way, no dependency on "transitional legacy stuff"
(read: unnecessary code cruft in the agent when supported in parallel)
would be needed, yay!

* * *

So, any takers? :-)

Of course, as discussed before, this overhaul would deserve a strict
separation of the function in question under a dedicated agent like
LoadSharingIP, letting the equivalent one in IPaddr2 agent and based
on something practically deprecated on arrival slowly die out.
Existing agent is pretty complex even without that semantically
overriding piggy-backing already -- no more gravity boosting like
that, please (not to speak of its naming that simply won't "click").

Also note there are missing puzzles regarding IPv6 support, still:
https://lists.clusterlabs.org/pipermail/users/2019-December/026710.html

[*1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0269ea4937343536ec7e85649932bc8c9686ea78
[*2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=96518518cc417bb0a8c80b9fb736202e28acdf96

-- 
Jan (Poki)

pgpBKXj7MZ5xg.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fuzzy/misleading references to "restart" of a resource

2019-12-05 Thread Jan Pokorný

On 05/12/19 10:41 +0300, Andrei Borzenkov wrote:
> On Thu, Dec 5, 2019 at 1:04 AM Jan Pokorný  wrote:
>> 
>> On 04/12/19 21:19 +0100, Jan Pokorný wrote:
>>> OTOH, this enforced split of state transitions is perhaps what makes
>>> the transaction (comprising perhaps countless other interdependent
>>> resources) serializable and thus feasible at all (think: you cannot
>>> nest any further handling -- so as to satisfy given constraints -- in
>>> between stop and start when that's an atom, otherwise), and that's
>>> exactly how, say, systemd approaches that, likely for that very reason:
>>> https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5
>> 
>> Yet, systemd started to allow for certain stop-start ("restart")
>> optimizations at "stop" phase, I've just learnt:
>> https://github.com/systemd/systemd/pull/13696#discussion_r330186864
>> But it doesn't merge/atomicize the two discrete steps, still.
>> 
> 
> systemd development consists of series of ad hoc single use case
> extensions, each done completely isolated, without considering impact
> on other parts which is usually "fixed" by adding yet another ad hoc
> extension. I do not think that is the best example to follow.

Didn't meant to run into this debate, noticed this was perhaps
mainly to satisfy their in-project services, but nonetheless,
pragmatic value for a wide audience here is that any "why being
stopped?" discrimination is now possible, lending itself to
"restart optimization enabler" label should that be handy.

Re style of evolutionary additions that are perhaps too tunnel-visioned,
you'll find examples everywhere, incl. ClusterLabs/cluster projects :-)
Common problem appears to be a lack of formalized/documented enough (as
if it wasn't a proprietary knowledge but rather a fully baked
programming interface) intermediate representations (next to some
further confinements related to transitioning from one set of states to
another), easy to externalize for an immediate feedback ("state dump")
and to asses input-to-output transformation correctness (ad-hoc or
unit testing) to assist thinking in both low-level isolated realms and
in the higher-level architectural perspective (how the primitive
"components" fit together).  Another way of thinking about this is
a directly observable "full state buffer", that would naturally tend
to prevent code-degrading on-the-fly and ad-hoc merging of what are
individual phases.  Without deeper knowledge admittedly, I consider
this something that, for instance, LLVM project got intriguingly and
intrinsically right, and that's perhaps where to take a better
example from.

>> OCF could possibly be amended to allow for a similar semantic
>> indication of "stop to be reversed shortly on this very node if
>> things go well" if there was a tangible use case, say using
>> "stop-with-start-pending" action instead of "stop"
>> (and the amendment possibly building on an idea of addon profiles
>> https://github.com/ClusterLabs/OCF-spec/issues/17 if there was
>> an actual infrastructure for that and not just a daydreaming).
>> 
> 
> I do not see how it is possible to shorthand resource restart. Cluster
> resource manager manages not isolated resources, but groups of
> interdependent resources. In general it is impossible to restart
> single resource without coordinate restart of multiple resources. And
> this should happen in defined order (you cannot "restart" mount point
> without stopping any user of it first).
> 
> Moreover, restart is expected to clean up resources and actually
> result in pristine state. This is implicit assumption.

I tend to agree, but I am far from being a creative author of
resources agents or service life-cycle focused person.  That was more
to cater hypothetical optimizations that were once considered, see the
referred scenario I linked up-thread:
https://github.com/ClusterLabs/OCF-spec/blob/start/resource_agent/API/02#L225
(I dare not to evaluate the value it would bring or not).

-- 
Jan (Poki)

pgphJ6C1F9jH0.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fuzzy/misleading references to "restart" of a resource

2019-12-04 Thread Jan Pokorný

On 04/12/19 21:19 +0100, Jan Pokorný wrote:
> OTOH, this enforced split of state transitions is perhaps what makes
> the transaction (comprising perhaps countless other interdependent
> resources) serializable and thus feasible at all (think: you cannot
> nest any further handling -- so as to satisfy given constraints -- in
> between stop and start when that's an atom, otherwise), and that's
> exactly how, say, systemd approaches that, likely for that very reason:
> https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5

Yet, systemd started to allow for certain stop-start ("restart")
optimizations at "stop" phase, I've just learnt:
https://github.com/systemd/systemd/pull/13696#discussion_r330186864
But it doesn't merge/atomicize the two discrete steps, still.

OCF could possibly be amended to allow for a similar semantic
indication of "stop to be reversed shortly on this very node if
things go well" if there was a tangible use case, say using
"stop-with-start-pending" action instead of "stop"
(and the amendment possibly building on an idea of addon profiles
https://github.com/ClusterLabs/OCF-spec/issues/17 if there was
an actual infrastructure for that and not just a daydreaming).

-- 
Jan (Poki)

pgp4GfhUvs0xd.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Fuzzy/misleading references to "restart" of a resource (Was: When does pacemaker call 'restart'/'force-reload' operations on LSB resource?)

2019-12-04 Thread Jan Pokorný

On 04/12/19 14:53 +0900, Ondrej wrote:
> When adding 'LSB' script to pacemaker cluster I can see that
> pacemaker advertises 'restart' and 'force-reload' operations to be
> present - regardless if the LSB script supports it or not.  This
> seems to be coming from following piece of code.
> 
> https://github.com/ClusterLabs/pacemaker/blob/92b0c1d69ab1feb0b89e141b5007f8792e69655e/lib/services/services_lsb.c#L39-L40
> 
> Questions:
> 1.  When the 'restart' and 'force-reload' operations are called on
> the LSB script cluster resource?

[reordered]

> I would have expected that 'restart' operation would be called when
> using 'crm_resource --restart --resource myResource', but I can see
> that 'stop' and 'start' operations are used in that case instead.

This is due to how "crm_resource --restart" is arranged,
directly in the implementation of this CLI tool itself
(see tools/crm_resource_runtime.c:cli_resource_restart):

- first, target-role meta-attribute for resource is set to Stopped

- then, once the activity settled, it is set back to the target-role
  it was originally at

Performing this stepwise like this, there's no reasonably
implementable mapping back to a single step being the actual
composition (stop, start -> restart) when the plan is not shared
in full in advance (it is not) with the respective moving parts.
And there's plain common sense that would still preclude it (below).

Hence, it is in actuality a great discovery that "restart" trigerring
verb/action is in fact completely neglected and bogus when it comes
to handling by pacemaker.  If it implements any optimizations (thanks
to having the intimate knowledge of the resource at hand, plus knowing
before-after state combo and possibly how to transition in one go),
cluster resource management won't benefit from that in any way.

Interestingly, such optimizations are exactly what the original
OCF draft had in mind :-)
https://github.com/ClusterLabs/OCF-spec/blob/start/resource_agent/API/02#L225
(even more interestingly, only to be reconsidered again some decades
later: https://github.com/ClusterLabs/OCF-spec/issues/10;
yeah, aren't we masters of following targets moving to the extent they
are sometimes contradictory?  I'd blame a desperate lack of written
[and easily obtainable] design decisions made in the past for that)

They are mandated by LSB as well, but hey, in systemd era, we are
now _free_ to call LSB severely broken as it (shamefully, I'd say)
never even tried to accommodate proper dealing with dependency
chains (and actual serializability thereof!), as explained
in an example below.  Or put in other words, LSB was never meant
to stand for a holistic resource management, something both systemd
and pacemaker attempt to cover (single/multi-machine wide).

OTOH, this enforced split of state transitions is perhaps what makes
the transaction (comprising perhaps countless other interdependent
resources) serializable and thus feasible at all (think: you cannot
nest any further handling -- so as to satisfy given constraints -- in
between stop and start when that's an atom, otherwise), and that's
exactly how, say, systemd approaches that, likely for that very reason:
https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5

So I see a room for improvement here as our takeaway:

* resource agents:

  - some agents declare/implement "restart" action when there is
no practical reason to (AudibleAlarm, Xinetd, dhcpd, etc.)
[as a side note, there are non-sensical considerations, such as
when default "start" and "stop" timeouts for dhcpd are 20 seconds
each, how come, then, that "restart" defined as "stop; start"
would also make do with 20 seconds altogher, unless there is
some amortized work I fail to see :-)]

* pacemaker:

  - artificially generated meta-data mention "restart" action when
there is no good reason to (lib/services/services_lsb.c)

  - there are some correct clues in Pacemaker Explained, but perhaps,
it shall take a time to emphasize that whenever "restart" is
referred, it is never an atomic step, but always a sequence
of two steps that may be considered atomic on their own,
but possibly interleaved with other steps so as to retain
soundness wrt. the imposed constraints and/or changes made
in parallel

  - the same gist of "restart" shall be sketched in a help screen
of crm_resource

> For 'force-reload' I have no idea on how to try trigger it looking
> at 'crm_resource --help' output.

Sorry, that's even more bogus, as there's no relevance whatsoever.
It needs to either be dropped from artificially generated meta-data
as well, or investigated further whether there's any reason to make
of such an operation triggerable by users, and if positive, how
much of impact spread to be expected when implemented (do the
dependent services need to be reloaded or "restarted" as well,
since the change might be non-local? any precedent there?
again, hard to

Re: [ClusterLabs] Concept of a Shared ipaddress/resource for generic applicatons

2019-12-04 Thread Jan Pokorný

On 03/12/19 23:38 +0100, Valentin Vidić wrote:
> On Tue, Dec 03, 2019 at 11:14:41PM +0100, Jan Pokorný wrote:
>> The conclusion is hence that even with bleeding edge software
>> collection, there's no real problem in using ipt_CLUSTERIP
>> (when compiled in or alongside kernel) when a proper interface
>> is used, which may boil down to using an appropriate version of
>> iptables command.  The respective logic to select the proper one
>> could be easily extended in the IPaddr2 agent (sorry, I mis-cased
>> this name previously; in a nutshell: if there's iptables-legacy
>> command, prefer that instead), which looks far more attainable
>> than porting to xt_cluster any time soon unless there are
>> volunteers.
> 
> Indeed, I have tested with 2 nodes and TCP connections work as
> expected: packets arrive at both nodes but only one of them
> responds - sometimes the first node and sometimes the second.
> 
> For ARP both nodes respond with the same multicast MAC:
> 
> 22:33:14.231779 ARP, Request who-has 192.168.122.101 tell 192.168.122.1, 
> length 28
> 22:33:14.231833 ARP, Reply 192.168.122.101 is-at 21:53:69:51:3e:b1, length 28
> 22:33:14.231833 ARP, Reply 192.168.122.101 is-at 21:53:69:51:3e:b1, length 28
> 
>> Is there any iptables-legacy command equivalent in Debian?
> 
> Yes, iptables package in Debian installs both:
> 
>   /usr/sbin/iptables-legacy
>   /usr/sbin/iptables-nft
> 
> so the agent can be modified to prefer iptables-legacy over
> iptables.

Perfect, thanks for the affirmation, Valentin.

-> https://github.com/ClusterLabs/resource-agents/pull/1439

For the record, based on my feedback, iptables-extensions man page is
headed to (finally) align with the actual in-kernel deprecation
message:
https://lore.kernel.org/netfilter-devel/20191204130921.2914-1-p...@nwl.cc/
Woot!

-- 
Jan (Poki)


pgpi0b3Lz8I_i.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Concept of a Shared ipaddress/resource for generic applicatons

2019-12-03 Thread Jan Pokorný

On 03/12/19 23:19 +0100, Valentin Vidić wrote:
> Interesting enough, ipt_CLUSTERIP still seems to work when using
> iptables-legacy :)

5 minutes ago, in another part of this thread :)

-- 
Jan (Poki)


pgpS99Ce6dNGb.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Concept of a Shared ipaddress/resource for generic applicatons

2019-12-03 Thread Jan Pokorný

On 03/12/19 20:38 +0100, Valentin Vidić wrote:
> On Tue, Dec 03, 2019 at 03:06:14PM +0100, Jan Pokorný wrote:
>> You likely refer to
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=43270b1bc5f1e33522dacf3d3b9175c29404c36c
>> 
>> however this extension is activelly maintained to this day, so don't
>> see any immediate risks other than something related to containers
>> as referred to from said commit -- that is good to know about in
>> such scenarios nonetheless.
>> 
>> My up2date Fedora Rawhide iptables installation, or rather its
>> iptables-extensions(8) man page does not mention any deprecation
>> at all (unlike with ULOG extension).
>> 
>> OTOH, what may be a true show stopper is a support for IPv4 only,
>> which xt_cluster seems to rectify.
> 
> The module might still work but the iptables command from the agent fails:
> 
> [  842.536916] ipt_CLUSTERIP: ClusterIP Version 0.8 loaded successfully
> [  842.539215] ipt_CLUSTERIP: cannot use CLUSTERIP target from nftables compat
> 
> # uname -a
> Linux sid 5.3.0-2-amd64 #1 SMP Debian 5.3.9-3 (2019-11-19) x86_64 GNU/Linux
> 
> # iptables --version
> iptables v1.8.3 (nf_tables)

Hm, you made me feel so much behind in this area :-/

Actually thank you, since this is going to bite the meat,
different story than looming proclamations :-)

TBH, there's so much going on in Fedora in the firewall area that
I momentarily thought I was completely drowned -- not covered by
firewall at all (per casual "iptables -nvL", haha):
https://fedoraproject.org/wiki/Changes/firewalld_default_to_nftables

So, in order to reproduce your situation, I had to install something
that comes as "iptables-nft" in Fedora Rawhide, with the command
being canonically named iptables-nft:

 # uname -srvmp
 > Linux 5.4.0-0.rc6.git2.1.fc32.x86_64 #1 SMP Thu Nov 7 16:31:36 UTC
 > 2019 x86_64 x86_64

 # modprobe ipt_CLUSTERIP
 dmesg <- "ipt_CLUSTERIP: ClusterIP Version 0.8 loaded successfully"

 # iptables-nft -I INPUT -i lo -d 127.0.0.200 -j CLUSTERIP --new \
   --hashmode sourceip-sourceport --clustermac 01:00:5E:00:00:c8
   --total-nodes 2 --local-node 1
 > iptables v1.8.3 (nf_tables):  RULE_INSERT failed (Operation not
 > supported): rule in chain INPUT
 dmesg <- "pt_CLUSTERIP: cannot use CLUSTERIP target from nftables
   compat"

 where --version matches your standard "iptables"

 # iptables-nft --version
 > iptables v1.8.3 (nf_tables)

However!

 # readlink -f /usr/sbin/iptables{,-legacy}
 > /usr/sbin/xtables-legacy-multi
 > /usr/sbin/xtables-legacy-multi

 # iptables --version
 > iptables v1.8.3 (legacy)

 # iptables -I INPUT -i lo -d 127.0.0.200 -j CLUSTERIP --new \
   --hashmode sourceip-sourceport --clustermac 01:00:5E:00:00:20
   --total-nodes 2 --local-node 1
 # echo $?
 > 0
 dmesg <- "ipt_CLUSTERIP: ipt_CLUSTERIP is deprecated and it will
   removed soon, use xt_cluster instead"

And it even works (hypothesis: there will be about 50% probability
the "virtual IP" 127.0.0.200 will be unavailable as long as source
port randomization of the client side is fair)!

 # mkdir /tmp/hello; touch hello.world; python3 -m http.server&
 CMD1> Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/)

 # curl -m1 127.0.0.200:8000
 CMD1> 127.0.0.1 - - [03/Dec/2019 22:55:46] "GET / HTTP/1.1" 200 -
 CMD2>  "http://www.w3.org/TR/html4/strict.dtd;>
 CMD2> 
 CMD2> 
 CMD2> 
 CMD2> Directory listing for /
 CMD2> 
 CMD2> 
 CMD2> Directory listing for /
 CMD2> 
 CMD2> 
 CMD2> hello.world
 CMD2> 
 CMD2> 
 CMD2> 
 CMD2> 

 # curl -m1 127.0.0.200:8000
 CMD2> curl: (28) Connection timed out after 1001 milliseconds
 # fg
 > python3 -m http.server
 ^C
 > Keyboard interrupt received, exiting

Ok, it is relatively fair, so the hypothesis holds based on
these empiric data.

The conclusion is hence that even with bleeding edge software
collection, there's no real problem in using ipt_CLUSTERIP
(when compiled in or alongside kernel) when a proper interface
is used, which may boil down to using an appropriate version of
iptables command.  The respective logic to select the proper one
could be easily extended in the IPaddr2 agent (sorry, I mis-cased
this name previously; in a nutshell: if there's iptables-legacy
command, prefer that instead), which looks far more attainable
than porting to xt_cluster any time soon unless there are
volunteers.

Is there any iptables-legacy command equivalent in Debian?

What I want to demonstrate with this is that no bridge appears
to be burnt, regardless any declarations and worries, yet.

Once again, thanks for pushing for this in-depth analysis,
so we could gain more knowledge on the matter.

-- 
Jan (Poki)


pgpdQC1VnGfqg.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Concept of a Shared ipaddress/resource for generic applicatons

2019-12-03 Thread Jan Pokorný

On 02/12/19 09:50 -0600, Ken Gaillot wrote:
> On Sat, 2019-11-30 at 18:58 +0300, Andrei Borzenkov wrote:
>> 29.11.2019 17:46, Jan Pokorný пишет:
>>> "Clone" feature for IPAddr2 is actually sort of an overloading that
>>> agent with an alternative functionality -- trivial low-level load
>>> balancing.  You can ignore that if you don't need any such.
>>> 
>> 
>> I would say IPaddr2 in clone mode does something similar to
>> SharedAddress.
> 
> Just a side note about something that came up recently:
> 
> IPaddr2 cloning utilizes the iptables "clusterip" feature, which has
> been deprecated in the Linux kernel since 2015. IPaddr2 cloning
> therefore must be considered deprecated as well. (Using it for a single
> floating IP is still fully supported.)
> 
> IPaddr2 could be modified to use a newer iptables capability called
> "xt_cluster", but someone would have to volunteer to do that as it's
> not a priority.

You likely refer to

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=43270b1bc5f1e33522dacf3d3b9175c29404c36c

however this extension is activelly maintained to this day, so don't
see any immediate risks other than something related to containers
as referred to from said commit -- that is good to know about in
such scenarios nonetheless.

My up2date Fedora Rawhide iptables installation, or rather its
iptables-extensions(8) man page does not mention any deprecation
at all (unlike with ULOG extension).

OTOH, what may be a true show stopper is a support for IPv4 only,
which xt_cluster seems to rectify.

-- 
Jan (Poki)


pgpBadtkZA8r0.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Concept of a Shared ipaddress/resource for generic applicatons

2019-11-29 Thread Jan Pokorný

On 27/11/19 20:13 +, matt_murd...@amat.com wrote:
> I finally understand that there is a Apache Resource for Pacemaker
> that assigns a single virtual ipaddress that "floats" between two
> nodes as in webservers.
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/ch-service-haaa
> 
> Can generic applications use this same resource type or is there an
> API to use to create a floating ip or  a generic resource to use?

Be assured generic applications can use the same "floating IP
address", effectively IPAddr2 resource (particular instances
thereof) to their own benefit.

The only crux that can be seen here stems from loose started/stopped
kind of inter-resource (putting resource-node relations aside)
integration facilitated by pacemaker, imposing these constraints
for a generic/convenient use case:

  1. the resource relying on the particular floating IP instance
 needs to start at later moment than this IP instance
 (it could miss listening on the newly/at later point
 appearing IP otherwise)

  2. the resource relying on the particular floating IP instance,
 in order to retain enough configuration flexibility, must
 _not_ be restricted regarding where to listen (bind the
 server side socket) at, for several reasons:

 - different names of the interfaces across the nodes
   to appoint the "externally provided service" network

 - fragile, possibly downright cluster-management hidden
   interdependencies whereby two parts of the overall
   configuration must exactly agree on the address to bind at,
   for given point in time

 apparently, this approach is suboptimal for constraining
 the allowed data paths (think, information security)

Btw. As a slight paradox, rgmanager used for HA resource clustering
in RHEL in the past allowed for natural resource "composability"
(combinability, stackability) with a straightforward propagation of
configuration values in the hierarchical composition of the resources
-- it would then be enough if a floating IP dependent resource
explicitly referred to IP to be borrowed from the cluster-tracked
configuration of its hierarchically preceding "virtual IP" resource
instance, avoiding thus hindrance stated in item 2. (item 1. remains
rather universal).  (Similar thing can, of course, be emulated with
explicit pacemaker configuration introspection in the agent itself,
however it feels rather dirty [resource agents would preferably avoid
back-channel introspection of their own runners as much as possible,
it'd break the rule of loose one-way coupling] in comparison to said
built-in mechanism.) [*]

> In other HA system, Oracle Solaris Cluster, HPUX Service Guard, IBM
> PowerHA, they provide a "SharedAddress" resource type for
> applications to use.

I suppose our ocf:heartbeat:IPAddr2 resource agent is a direct
equivalent, but don't have the knowledge of these other products.

> I am getting confused by the Clone feature, the virtualized feature,
> and now the Apache resource as to how they all differ.

"Clone" feature for IPAddr2 is actually sort of an overloading that
agent with an alternative functionality -- trivial low-level load
balancing.  You can ignore that if you don't need any such.

Regarding "virtualized", virtual and floating are being used
interchangably to refer to said "IP address" resource agent
instances.

Of course, you then have various other contexts of "virtualized",
you can have virtual machines (VMs) as resources managed by pacemaker,
your cluster can consist of a set of VMs rather than set of bare
metal machines, "remote" instances of pacemaker can be detached
in VMs, and so forth.

> If this isn't right group, let me know, and be kind, im just trying
> to get something working and make recommendations to my developers.

This venue is a spot-on, welcome :-)

[*] I've touched this topic slightly in the past:

https://github.com/ClusterLabs/resource-agents/issues/1304#issuecomment-473525495
https://github.com/ClusterLabs/OCF-spec/issues/22

-- 
Jan (Poki)

pgpbTf6wCz_5v.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] ftime-imposed problems and compatibility going forward (Was: Final Pacemaker 2.0.3 release now available)

2019-11-26 Thread Jan Pokorný

On 26/11/19 20:30 +0100, Jan Pokorný wrote:
> And yet another, this time late, stay-compatible type of problem with
> pacemaker vs. concurrent changes in build dependencies has popped up,
> this time with Inkscape.  Whenever its 1.0 version is released,
> when with my ask https://github.com/ClusterLabs/pacemaker/pull/1953

sorry, https://gitlab.com/inkscape/inbox/issues/1244, apparently

> refused, the just released version (and equally the older ones) will
> choke when documentation gets prepared (corollary of the respective
> packages incl. Inkscape being around, to begin with).
> 
> Workaround is then to either disable that step of the build process:
> 
>   ./configure --with-brand=
>   ("equals nothing")
> 
> or to apply this patch:
> 
>   https://github.com/ClusterLabs/pacemaker/pull/1953

-- 
Jan (Poki)


pgpcvPeREbwhY.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] ftime-imposed problems and compatibility going forward (Was: Final Pacemaker 2.0.3 release now available)

2019-11-26 Thread Jan Pokorný

On 26/11/19 16:08 +0100, Jan Pokorný wrote:
> On 25/11/19 20:32 -0600, Ken Gaillot wrote:
>> The final release of Pacemaker version 2.0.3 is now available at:
>> 
>> https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.0.3
> 
> For downstream and individual builders of the codebase
> ==
> 
> ... let it be known that we suffered some late moment difficulties
> imposed by the fact that upcoming glibc (allegedly targeting v2.31)
> will make it even more tough to use ftime(3) function (otherwise
> marked deprecated for ages, including in POSIX specifications)
> in the face our purposefully imposed -Werror, turning the respective
> warning into an error and consequently failing the out-of-the-box
> build process.  Due to this timing, the effect of the fix was kept
> minimal for the time being, mostly allowing for a smooth opt-in
> continuity down the road.

And yet another, this time late, stay-compatible type of problem with
pacemaker vs. concurrent changes in build dependencies has popped up,
this time with Inkscape.  Whenever its 1.0 version is released,
when with my ask https://github.com/ClusterLabs/pacemaker/pull/1953
refused, the just released version (an equally the older ones) will
choke when documentation gets prepared (corollary of the respective
packages incl. Inkscape being around, to begin with).

Workaround is then to either disable that step of the build process:

  ./configure --with-brand=
  ("equals nothing")

or to apply this patch:

  https://github.com/ClusterLabs/pacemaker/pull/1953

-- 
Jan (Poki)

pgpN4pdXOXrFV.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] ftime-imposed problems and compatibility going forward (Was: Final Pacemaker 2.0.3 release now available)

2019-11-26 Thread Jan Pokorný

On 25/11/19 20:32 -0600, Ken Gaillot wrote:
> The final release of Pacemaker version 2.0.3 is now available at:
> 
> https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.0.3

For downstream and individual builders of the codebase
==

... let it be known that we suffered some late moment difficulties
imposed by the fact that upcoming glibc (allegedly targeting v2.31)
will make it even more tough to use ftime(3) function (otherwise
marked deprecated for ages, including in POSIX specifications)
in the face our purposefully imposed -Werror, turning the respective
warning into an error and consequently failing the out-of-the-box
build process.  Due to this timing, the effect of the fix was kept
minimal for the time being, mostly allowing for a smooth opt-in
continuity down the road.

While it opens a broader topic of using anything but monotonic time
for measuring/supervising the timeouts being the killer of naturally
expected HA robustness on its own (for that, we currently do not
have a capacity for a full detail review, with one of the outcomes
being either ceasing the support for platforms where such a time
measuring is unattainable, for good, altogether, or at least enforcing
the build conductors to willingly accept the responsibility with
something like an extra HA_I_TOLERATE_RISKY_TIMERS=1 environment
variable required in the build environment [*1]), there are two things
you may want to pay attention to with said tagged v2.0.3 release:

   A. in case you run to said out-of-the-box build issues due to
  ftime(3) invocation (currently, is known to happen, e.g.,
  with Fedora Rawhide), the workaround is along the lines of

  env CPPFLAGS="$CPPFLAGS -UPCMK_TIME_EMERGENCY_CGT" ./configure

   B. the same trick, possibly in a slightly extended form

  env CPPFLAGS="$CPPFLAGS -UPCMK_TIME_EMERGENCY_CGT" \
  ac_cv_header_sys_timeb_h=no ./configure

  will allow you to check _forward_ compatibility --- statically
  in build-time and possibly even in run-time when the respective
  log-related behaviours of pacemaker_remoted get tested
  afterwards --- with a switch to a state where ftime(3) is
  dropped (limited to cases whereby `configure` script will
  report: `checking whether CLOCK_MONOTONIC is declared... yes`);
  it is hence strongly recommended to test that in particular
  variations of OS & libc of interest _ahead of time_ to avoid
  bad surprises in 2.0.4+ (especially if release candidates
  chronically miss the interest of some minor combos that are
  possibly under compatibility risks the most[*2])

[*1] one more reason _not_ to make any compromises regarding fencing
 at this current potentially suboptimal status quo, see the other
 sub-thread by Digimer

[*2] speaking of that, would be nice to receive some feedback on the
 usefulness of these offically marked release candidatate for
 early testing, and obtain thus rough stats how these get used
 at all, since I am pessimistic these would get as much attention
 as we would ideally hope for

Full context at the respective PR (especially last two comments):
https://github.com/ClusterLabs/pacemaker/pull/1943

Hope this helps to get ready for the next release; feedback,
if any, can influence the direction in this matter.

-- 
Jan (Poki)

pgpkVnXfA9Fo2.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker 2.0.3-rc3 now available

2019-11-15 Thread Jan Pokorný

On 14/11/19 14:54 +0100, Jan Pokorný wrote:
> On 13/11/19 17:30 -0600, Ken Gaillot wrote:
>> A longstanding pain point in the logs has been improved. Whenever the
>> scheduler processes resource history, it logs a warning for any
>> failures it finds, regardless of whether they are new or old, which can
>> confuse anyone reading the logs. Now, the log will contain the time of
>> the failure, so it's obvious whether you're seeing the same event or
>> not.
> 
> Just curious, how sensitive is this to time shifts, e.g. timezone
> related?  If it is (human/machine can be unable to match the same event
> reported back then and now in a straightforward way, for say time zone
> transition in between), considering some sort of rather unique
> identifier would be a more systemic approach for an event matching
> in an invariant manner, but then would we need some notion of
> monotonous cluster-wide sequence ordering?
  ^^

Sorry, monotonic (i.e. in a mathematical sense)

-- 
Jan (Poki)


pgpbC2dR0AYgT.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Documentation driven predictability of pacemaker commands (Was: Pacemaker 2.0.3-rc3 now available)

2019-11-15 Thread Jan Pokorný

On 14/11/19 10:16 -0600, Ken Gaillot wrote:
> On Thu, 2019-11-14 at 14:54 +0100, Jan Pokorný wrote:
>> On 13/11/19 17:30 -0600, Ken Gaillot wrote:
>>> This fixes some more minor regressions in crm_mon introduced in
>>> rc1.  Additionally, after feedback from this list, the new output
>>> format options were shortened. The help is now:
>>> 
>>> Output Options:
>>>   --output-as=FORMAT   Specify output format as one of: console 
>>> (default), html, text, xml
>>>   --output-to=DEST Specify file name for output (or "-" for stdout)
>>>   --html-cgi   Add CGI headers (requires --output- as=html)
>>>   --html-stylesheet=URILink to an external stylesheet (requires 
>>> --output-as=html)
>>>   --html-title=TITLE   Specify a page title (requires --output-as=html)
>>>   --text-fancy Use more highly formatted output
>>> (requires --output-as=text)
>> 
>> Wearing the random user's shoes, this leaves more questions behind
>> than it answers:
>> 
>> * what's the difference between "console" and "text", is any of these
>>   considered more stable in time than the other?
> 
> Console is crm_mon's interactive curses interface

With said possible explanation in terms of equivalence, I rather meant
whether the textual representation shall be assumed the same as with
"text" from get go, implying, amongst others, --text-fancy being
supported etc. (and if not, why not when ncurses just assumes full,
interim control over the terminal screen, as a recyclable canvas to
output to rather than outputting just one-way only raw lines with
"text"; granted, it can moreover respond interactively to key
presses, but for passive observation only).

I.e. the documentation shall state explicitly something like:

> "console" output is stream-lined alternative to "watch crm_mon -1"
> (hence reusing "text" output format) that makes do without excessive
> polling and allows the output to be refined interactively, with
> respective output manipulations mapped to these key presses:
> [keyboard control explaination]

Since I already touched that, on the other hand, it's not clear why,
for instance, "non-console" outputs could not ask for a password
interactively (given the interactive use is detected, see below).

Keeping user's hat on, there's more behavioral aspects that'd be
good to be able to learn about ahead of time:

- that default mode doesn't adapt to whether there's indeed a terminal
  at the output, as opposed to today's habitual and de facto assumed
  conditionalizing per isatty(3), so one cannot just naively pipe
  "endless" crm_mon output to a file/further pipe-line without full
  output qualification (it may be understood like that from the current
  help, although experience built with other SW will suggest it just
  doesn't state full reality for brevity) -- conversely, it could
  be modified to behave per this expectation

- what can we expect from the output using a demonstrative example
  right in the man page (see, e.g., how "systemctl status" output
  is documented)

- on top of previous, what exactly do we gain from appending
  --text-fancy?  sadly, I observed ho difference in a basic
  use case

>>   - do I understand it correctly that the password, when needed for
>> remote CIB access, will only be prompted for "console"?
>> 
>> * will --text-fancy work work with --output-as-console?
>>   the wording seems to suggest it won't
>> 
>> Doubtless explorability seems to be put on the backburner in
>> favour of straightforward wiring front-end to convoluted logic in
>> the command's back-end rather than going backwards, from
>> simple-to-understand behaviours down to the logic itself.
>> 
>> E.g., it may be easier to follow when there is a documented
>> equivalence of "--output-as=console" and "--output-as=text
>> --text-password-prompt-allowed", assuming a new text-output
>> specific switch that is not enabled by default otherwise.

-- 
Jan (Poki)


pgpddN_4_KzQF.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker 2.0.3-rc3 now available

2019-11-14 Thread Jan Pokorný

On 13/11/19 17:30 -0600, Ken Gaillot wrote:
> This fixes some more minor regressions in crm_mon introduced in rc1.
> Additionally, after feedback from this list, the new output format
> options were shortened. The help is now:
> 
> Output Options:
>   --output-as=FORMAT   Specify output format as one of: console 
> (default), html, text, xml
>   --output-to=DEST Specify file name for output (or "-" for stdout)
>   --html-cgi   Add CGI headers (requires --output-as=html)
>   --html-stylesheet=URILink to an external stylesheet (requires 
> --output-as=html)
>   --html-title=TITLE   Specify a page title (requires --output-as=html)
>   --text-fancy Use more highly formatted output (requires 
> --output-as=text)

Wearing the random user§s shoes, this leaves more questions behind
than it answers:

* what's the difference between "console" and "text", is any of these
  considered more stable in time than the other?

  - do I understand it correctly that the password, when needed for
remote CIB access, will only be prompted for "console"?

* will --text-fancy work work with --output-as-console?
  the wording seems to suggest it won't

Doubtless explorability seems to be put on the backburner in
favour of straightforward wiring front-end to convoluted logic
in the command's back-end rather than going backwards, from
simple-to-understand behaviours down to the logic itself.

E.g., it may be easier to follow when there is
a documented equivalence of "--output-as=console" and
"--output-as=text --text-password-prompt-allowed",
assuming a new text-output specific switch that is
not enabled by default otherwise.

> A longstanding display issue in crm_mon has been fixed. The disabled
> and blocked resources count was previously incorrect. The new, accurate
> count changes the text from "resources" to "resource instances",
> because individual instances of a cloned or bundled resource can be
> blocked. For example, if you have one regular resource, and a cloned
> resource running on three nodes, it would count as 4 resource
> instances.
> 
> A longstanding pain point in the logs has been improved. Whenever the
> scheduler processes resource history, it logs a warning for any
> failures it finds, regardless of whether they are new or old, which can
> confuse anyone reading the logs. Now, the log will contain the time of
> the failure, so it's obvious whether you're seeing the same event or
> not.

Just curious, how sensitive is this to time shifts, e.g. timezone
related?  If it is (human/machine can be unable to match the same event
reported back then and now in a straightforward way, for say time zone
transition in between), considering some sort of rather unique
identifier would be a more systemic approach for an event matching
in an invariant manner, but then would we need some notion of
monotonous cluster-wide sequence ordering?

> The log will also contain the exit reason if one was provided by the
> resource agent, for easier troubleshooting.

-- 
Jan (Poki)

pgpoEZgUpnkkn.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] -INFINITY location constraint not honored?

2019-10-18 Thread Jan Pokorný

On 18/10/19 17:59 +0200, Jan Pokorný wrote:
> On 18/10/19 11:21 +0300, Andrei Borzenkov wrote:
>> According to it, you have symmetric cluster (and apparently made
>> typo trying to change it)
>> 
>>  
>> name="symmetric-cluster" value="true"/>
>>  
>> name="symmectric-cluster" value="true"/>
> 
> Great spot, demonstrating how multi-faceted any question/case can
> become.  We shall rather make something about it, this is _not_
> sustainable.

(e.g. make a comment about it so as to do something about it...
happy Friday :-)

-- 
Jan (Poki)


pgpZsCLKKs1kt.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] -INFINITY location constraint not honored?

2019-10-18 Thread Jan Pokorný

On 18/10/19 11:21 +0300, Andrei Borzenkov wrote:
> According to it, you have symmetric cluster (and apparently made
> typo trying to change it)
> 
>  
> name="symmetric-cluster" value="true"/>
>  
> name="symmectric-cluster" value="true"/>

Great spot, demonstrating how multi-faceted any question/case can
become.  We shall rather make something about it, this is _not_
sustainable.

-- 
Jan (Poki)


pgpwr9nIRCjp5.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] -INFINITY location constraint not honored?

2019-10-17 Thread Jan Pokorný

On 17/10/19 08:22 +0200, Raffaele Pantaleoni wrote:
> I'm rather new to Pacemaker, I'm performing early tests on a set of
> three virtual machines.
> 
> I am configuring the cluster in the following way:
> 
> 3 nodes configured
> 4 resources configured
> 
> Online: [ SRVDRSW01 SRVDRSW02 SRVDRSW03 ]
> 
>  ClusterIP  (ocf::heartbeat:IPaddr2):   Started SRVDRSW01
>  CouchIP(ocf::heartbeat:IPaddr2):   Started SRVDRSW03
>  FrontEnd   (ocf::heartbeat:nginx): Started SRVDRSW01
>  ITATESTSERVER-DIP  (ocf::nodejs:pm2):  Started SRVDRSW01
> 
> with the following constraints:
> 
> Location Constraints:
>   Resource: ClusterIP
> Enabled on: SRVRDSW01 (score:200)
> Enabled on: SRVRDSW02 (score:100)
>   Resource: CouchIP
> Enabled on: SRVRDSW02 (score:10)
> Enabled on: SRVRDSW03 (score:100)
> Disabled on: SRVRDSW01 (score:-INFINITY)
>   Resource: FrontEnd
> Enabled on: SRVRDSW01 (score:200)
> Enabled on: SRVRDSW02 (score:100)
>   Resource: ITATESTSERVER-DIP
> Enabled on: SRVRDSW01 (score:200)
> Enabled on: SRVRDSW02 (score:100)
> Ordering Constraints:
>   start ClusterIP then start FrontEnd (kind:Mandatory)
>   start ClusterIP then start ITATESTSERVER-DIP (kind:Mandatory)
> Colocation Constraints:
>   FrontEnd with ClusterIP (score:INFINITY)
>   FrontEnd with ITATESTSERVER-DIP (score:INFINITY)
> 
> I've configured the cluster with an opt in strategy using the following
> commands:
> 
> crm_attribute --name symmectric-cluster --update false
> 
> pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=172.16.10.126
> cidr_netmask=16 op monitor interval=30s
> pcs resource create CouchIP ocf:heartbeat:IPaddr2 ip=172.16.10.128
> cidr_netmask=16 op monitor interval=30s
> pcs resource create FrontEnd ocf:heartbeat:nginx
> configfile=/etc/nginx/nginx.conf
> pcs resource create ITATESTSERVER-DIP ocf:nodejs:pm2 user=ITATESTSERVER
> --force
> 
> pcs constraint colocation add FrontEnd with ClusterIP INFINITY
> pcs constraint colocation add FrontEnd with ITATESTSERVER-DIP INFINITY
> 
> pcs constraint order ClusterIP then FrontEnd
> pcs constraint order ClusterIP then ITATESTSERVER-DIP
> 
> pcs constraint location ClusterIP prefers SRVRDSW01=200
> pcs constraint location ClusterIP prefers SRVRDSW02=100
> 
> pcs constraint location CouchIP prefers SRVRDSW02=10
> pcs constraint location CouchIP prefers SRVRDSW03=100
> 
> pcs constraint location FrontEnd prefers SRVRDSW01=200
> pcs constraint location FrontEnd prefers SRVRDSW02=100
> 
> pcs constraint location ITATESTSERVER-DIP prefers SRVRDSW01=200
> pcs constraint location ITATESTSERVER-DIP prefers SRVRDSW02=100
> 
> Everything seems to be ok but when I put vm 02 and 03 in standby I'd expect
> CouchIP not be assigned to vm 01 beacuse of the constraint.
> 
> The IPaddr2 resource gets assigned to vm 01 no matter what.
> 
> Node SRVDRSW02: standby
> Node SRVDRSW03: standby
> Online: [ SRVDRSW01 ]
> 
> Full list of resources:
> 
>  ClusterIP  (ocf::heartbeat:IPaddr2):   Started SRVDRSW01
>  CouchIP(ocf::heartbeat:IPaddr2):   Started SRVDRSW01
>  FrontEnd   (ocf::heartbeat:nginx): Started SRVDRSW01
>  ITATESTSERVER-DIP  (ocf::nodejs:pm2):  Started SRVDRSW01
> 
> crm_simulate -sL returns the follwoing
> 
> ---cut---
> 
> native_color: CouchIP allocation score on SRVDRSW01: 0
> native_color: CouchIP allocation score on SRVDRSW02: 0
> native_color: CouchIP allocation score on SRVDRSW03: 0
> 
> ---cut---
> Why is that? I have explicitly assigned -INFINITY to CouchIP resource
> related to node SRVDRSW01 (as stated by pcs constraint: Disabled on:
> SRVRDSW01 (score:-INFINITY) ).
> What am I missing or doing wrong?

I am not that deep into these relationships, proper design
documentation with guided examples is non-existent[*].

But it occurs to me that the situation might be the inverse of what's
been confusing for typical opt-out clusters:

https://lists.clusterlabs.org/pipermail/users/2017-April/005463.html

Have you tried avoiding:

>   Resource: CouchIP
> Disabled on: SRVRDSW01 (score:-INFINITY)

altogether, since when such explicit edge would be missing, implicit
"cannot" (for opt-in cluster) would apply anyway, and perhaps even
reliably, then?


[*] as far as I know, except for
https://wiki.clusterlabs.org/w/images/a/ae/Ordering_Explained_-_White.pdf
https://wiki.clusterlabs.org/w/images/8/8a/Colocation_Explained_-_White.pdf
probably not revised for giving a truthful model in all details for years,
and not demonstrating the effect of symmetry requested or avoided within
the cluster on those, amongst others
(shameless plug: there will be such coverage for upcoming group based
access control addition [those facilities haven't been terminated in
back-end so far] as a first foray in this area -- also the correct
understanding is rather important here)

-- 
Jan (Poki)


pgpbiSVttiBML.pgp
Description: PGP signature

Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Jan Pokorný

On 09/10/19 09:58 +0200, Kadlecsik József wrote:
> The nodes in our cluster have got backend and frontend interfaces: the 
> former ones are for the storage and cluster (corosync) traffic and the 
> latter ones are for the public services of KVM guests only.
> 
> One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
> stuck for 23s"), which resulted that the node could process traffic on the 
> backend interface but not on the fronted one. Thus the services became 
> unavailable but the cluster thought the node is all right and did not 
> stonith it. 
> 
> How could we protect the cluster against such failures?
> 
> We could configure a second corosync ring, but that would be a redundancy 
> ring only.
> 
> We could setup a second, independent corosync configuration for a second 
> pacemaker just with stonith agents. Is it enough to specify the cluster 
> name in the corosync config to pair pacemaker to corosync? What about the 
> pairing of pacemaker to this corosync instance, how can we tell pacemaker 
> to connect to this corosync instance?

Such pairing happens on the Unix socket system-wide singleton basis.
IOW, two instances of the corosync on the same machine would
apparently conflict -- only a single daemon can run at a time.

> Which is the best way to solve the problem? 

Looks like heuristics of corosync-qdevice that would ping/attest your
frontend interface could be a way to go.  You'd need an additional
host in your setup, though.

-- 
Poki


pgpZKhjeAe4it.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Ocassionally IPaddr2 resource fails to start

2019-10-07 Thread Jan Pokorný

Donat,

On 07/10/19 09:24 -0500, Ken Gaillot wrote:
> If this always happens when the VM is being snapshotted, you can put
> the cluster in maintenance mode (or even unmanage just the IP
> resource) while the snapshotting is happening. I don't know of any
> reason why snapshotting would affect only an IP, though.

it might be interesting if you could share the details to grow the
shared knowledge and experience in case there are some instances of
these problems reported in the future.

In particular, it'd be interesting to hear:

- hypervisor

- VM OS + if plain oblivious to running virtualized,
  or "the optimal arrangement" (e.g., specialized drivers, virtio,
  "guest additions", etc.)

(I think IPaddr2 is iproute2-only, hence in turn, VM OS must be Linux)

Of course, there might be more specific things to look at if anyone
here is an expert with particular hypervisor technology and the way
the networking works with it (no, not me at all).

-- 
Poki

pgphleQBrgswu.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Apache doesn't start under corosync with systemd

2019-10-07 Thread Jan Pokorný

On 07/10/19 10:27 -0500, Ken Gaillot wrote:
> Additionally, your pacemaker configuration is using the apache OCF
> script, so the cluster won't use /etc/init.d/apache2 at all (it invokes
> the httpd binary directly).

See my parallel response.

> Keep in mind that the httpd monitor action requires the status
> module to be enabled -- I assume that's already in place.

(For SUSE, it looks like it used to be injected.)

On-topic, spotted a small discrepancy around that part:
https://github.com/ClusterLabs/resource-agents/pull/1414

-- 
Poki


pgpiMoju8oq6D.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Apache doesn't start under corosync with systemd

2019-10-07 Thread Jan Pokorný

On 04/10/19 14:10 +, Reynolds, John F - San Mateo, CA - Contractor wrote:
> I've just upgraded a two-node active-passive cluster from SLES11 to
> SLES12.  This means that I've gone from /etc/init.d scripts to
> systemd services.
> 
> On the SLES11 server, this worked:
> 
>  type="apache">
>   
>  id="ncoa_apache-instance_attributes-configfile"/>
>   
>   
>  id="ncoa_apache-monitor-40s"/>
>   
> 
> 
> I had to tweak /etc/init.d/apache2 to make sure it only started on the active 
> node, but that's OK.
> 
> On the SLES12 server, the resource is the same:
> 
>  type="apache">
>   
>  id="ncoa_apache-instance_attributes-configfile"/>
>   
>   
>  id="ncoa_apache-monitor-40s"/>
>   
> 
> 
> and the cluster believes the resource is started:
> 
> 
> eagnmnmep19c1:/var/lib/pacemaker/cib # crm status
> Stack: corosync
> Current DC: eagnmnmep19c0 (version 1.1.16-4.8-77ea74d) - partition with quorum
> Last updated: Fri Oct  4 09:02:52 2019
> Last change: Thu Oct  3 10:55:03 2019 by root via crm_resource on 
> eagnmnmep19c0
> 
> 2 nodes configured
> 16 resources configured
> 
> Online: [ eagnmnmep19c0 eagnmnmep19c1 ]
> 
> Full list of resources:
> 
> Resource Group: grp_ncoa
>   (edited out for brevity)
>  ncoa_a05shared (ocf::heartbeat:Filesystem):Started eagnmnmep19c1
>  IP_56.201.217.146  (ocf::heartbeat:IPaddr2):   Started eagnmnmep19c1
>  ncoa_apache(ocf::heartbeat:apache):Started eagnmnmep19c1
> 
> eagnmnmep19c1:/var/lib/pacemaker/cib #
> 
> 
> But the httpd daemons aren't started.  I can start them by hand, but
> that's not what I need.
> 
> I have gone through the ClusterLabs and SLES docs for setting up
> apache resources, and through this list's archive; haven't found my
> answer.   I'm missing something in corosync, apache, or systemd.
> Please advise.

I think you've accidentally arranged a trap for yourself to fall in.

What makes be believe so is that you've explicitly mentioned you've
modified /etc/init.d/apache2 explicitly.  In RPM driven ecosystem
(incl. SUSE), it means that on removal/upgrade of the respective
package, that file won't get removed.  This is what likely happened
on your major system upgrade.

Then, you can follow the logic in the agent itself, see in particular:
https://github.com/ClusterLabs/resource-agents/blob/v4.3.0/heartbeat/apache#L187-L188
It makes the agent assume it is not in systemd realms and switches
to use that provided initscript instead.  Apparently, that is _not_
supposed to necessarily work, since at that point, you have the
differing versions of httpd executables vs. the initscript (the former
is presumably newer than the latter), so any evolutionary differences
are not accounted for in the initscript itself (it's stale at that
point).

To resolve this, try simply renaming /etc/init.d/apache2 to something
else (or moving it somewhere else, simply to retain a backup of your
modifications), unmanage and manage the resource again, hopefully
systemd route to get httpd running will turn out well then.

-- 
Poki

pgpYg3yvHJaBE.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fence_sbd script in Fedora30?

2019-09-24 Thread Jan Pokorný

On 24/09/19 08:08 +0200, Klaus Wenninger wrote:
> On 9/23/19 10:23 PM, Vitaly Zolotusky wrote:
>> I am trying to upgrade to Fedora 30. The platform is two node
>> cluster with pacemaker.
>> It Fedora 28 we were using old fence_sbd script from 2013:
>> 
>> # This STONITH script drives the shared-storage stonith plugin.
>> # Copyright (C) 2013 Lars Marowsky-Bree 
>> 
>> We were overwriting the distribution script in custom built RPM
>> with the one from 2013.
>> It looks like there is no fence_sbd script any more in the agents
>> source and some apis changed so that the old script would not work.
> 
> What about fence_sbd from fence-agents-sbd-package from fc30?
> That should work with pacemaker & sbd from fc30.

From current Rawhide ~ Fedora 32:

  # dnf repoquery -f '*/fence_sbd' 2>/dev/null
  fence-agents-sbd-0:4.4.0-2.fc31.noarch

>> Do you have any documentation / suggestions on how to move from old
>> fence_sbd script to the latest?

-- 
Jan (Poki)


pgpaDNXbUjBnR.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] kronosnet v1.12 released

2019-09-20 Thread Jan Pokorný

On 20/09/19 05:22 +0200, Fabio M. Di Nitto wrote:
> We are pleased to announce the general availability of kronosnet v1.12
> (bug fix release)
> 
> [...]
> 
> * Add support for musl libc

Congrats, and the above is a great news, since I've been toying with
an idea of putting together a truly minimalistic and vendor neutral
try-out image based on Alpine Linux, which uses musl as libc of choice
(for its bloatlessness, just as there's no systemd, etc.).

-- 
Jan (Poki)

pgpTFAy_WzW43.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] nfs-daemon will not start

2019-09-19 Thread Jan Pokorný

On 19/09/19 16:43 +0200, Oyvind Albrigtsen wrote:
> Try upgrading resource-agents and maybe nfs-utils (if there's newer
> version for CentOS 7).
> 
> I recall some issue with how the nfs config was generated, which might
> be causing this issue.

This is what I'd start with, otherwise, see below.

> On 18/09/19 20:49 +, Jones, Keven wrote:
>> I have 2 centos7.6  VM’s setup. Was able to successfully create
>> cluster, setup LVM, NFSHARE but not able to get the nfs-daemon
>> (ocf::heartbeat:nfsserver): to start successfully.
>> 
>> 
>> [root@cs-nfs1 ~]# pcs status
>> 
>> [...]
>> 
>> Failed Actions:
>> * nfs-daemon_start_0 on cs-nfs1.saas.local 'unknown error' (1): call=25, 
>> status=Timed Out, exitreason='',
>>   last-rc-change='Wed Sep 18 16:31:48 2019', queued=0ms, exec=40001ms
>> * nfs-daemon_start_0 on cs-nfs2.saas.local 'unknown error' (1): call=22, 
>> status=Timed Out, exitreason='',
>>   last-rc-change='Wed Sep 18 16:31:06 2019', queued=0ms, exec=40002ms
>> 
>> [...]
>> 
>> [root@cs-nfs1 ~]# pcs resource debug-start nfs-daemon
>> Operation start for nfs-daemon (ocf:heartbeat:nfsserver) failed: 'Timed Out' 
>> (2)
>>> stdout: STATDARG="--no-notify"
>>> stdout: * rpc-statd.service - NFS status monitor for NFSv2/3 locking.
>>> stdout:Loaded: loaded (/usr/lib/systemd/system/rpc-statd.service; 
>>> static; vendor preset: disabled)
>>> stdout:Active: inactive (dead) since Wed 2019-09-18 16:32:28 EDT; 13min 
>>> ago
>>> stdout:   Process: 7054 ExecStart=/usr/sbin/rpc.statd $STATDARGS 
>>> (code=exited, status=0/SUCCESS)
>>> stdout:  Main PID: 7055 (code=exited, status=0/SUCCESS)
>>> stdout:
>>> stdout: Sep 18 16:31:48 cs-nfs1 systemd[1]: Starting NFS status monitor for 
>>> NFSv2/3 locking
>>> stdout: Sep 18 16:31:48 cs-nfs1 rpc.statd[7055]: Version 1.3.0 starting
>>> stdout: Sep 18 16:31:48 cs-nfs1 rpc.statd[7055]: Flags: TI-RPC
>>> stdout: Sep 18 16:31:48 cs-nfs1 systemd[1]: Started NFS status monitor for 
>>> NFSv2/3 locking..
>>> stdout: Sep 18 16:32:28 cs-nfs1 systemd[1]: Stopping NFS status monitor for 
>>> NFSv2/3 locking
>>> stdout: Sep 18 16:32:28 cs-nfs1 systemd[1]: Stopped NFS status monitor for 
>>> NFSv2/3 locking..
>>> stderr: Sep 18 16:46:08 INFO: Starting NFS server ...
>>> stderr: Sep 18 16:46:08 INFO: Start: rpcbind i: 1
>>> stderr: Sep 18 16:46:08 INFO: Start: nfs-mountd i: 1
>>> stderr: Job for nfs-idmapd.service failed because the control
>>> process exited with error code. See "systemctl status
>>> nfs-idmapd.service" and "journalctl -xe" for details.
>>> 
>>> [...]
>>> 
>> 
>> [root@cs-nfs1 ~]# systemctl status nfs-idmapd.service
>> ● nfs-idmapd.service - NFSv4 ID-name mapping service
>>  Loaded: loaded (/usr/lib/systemd/system/nfs-idmapd.service; static; vendor 
>> preset: disabled)
>>  Active: failed (Result: exit-code) since Wed 2019-09-18 16:46:08 EDT; 1min 
>> 25s ago
>> Process: 8699 ExecStart=/usr/sbin/rpc.idmapd $RPCIDMAPDARGS (code=exited, 
>> status=1/FAILURE)
>> Main PID: 5334 (code=killed, signal=TERM)
>> 
>> Sep 18 16:46:08 cs-nfs1 systemd[1]: Starting NFSv4 ID-name mapping service...
>> Sep 18 16:46:08 cs-nfs1 systemd[1]: nfs-idmapd.service: control process 
>> exited, code=exited status=1
>> Sep 18 16:46:08 cs-nfs1 systemd[1]: Failed to start NFSv4 ID-name mapping 
>> service.
>> Sep 18 16:46:08 cs-nfs1 systemd[1]: Unit nfs-idmapd.service entered failed 
>> state.
>> Sep 18 16:46:08 cs-nfs1 systemd[1]: nfs-idmapd.service failed
>> 
>> [root@cs-nfs1 ~]# journalctl -xe
>> -- The start-up result is done.
>> Sep 18 16:46:08 cs-nfs1 rpc.idmapd[8711]: main: 
>> open(/var/lib/nfs/rpc_pipefs//nfs): No such file or directory

^ this is apparently the bottom-most diagnostics provided

Why would what appears to be a missing "rpc_pipefs on
/var/lib/nfs/rpc_pipefs" (or sunrpc type, perhaps, at least that's
what I observe on my system) mount occur at the moment it is expected
to be present, no idea.

But you haven't shown your exact nfs-daemon resource configuration,
afterall -- did you mangle with "rpcpipefs_dir" parameter, for
instance?.

Also, that would actually smell like a neglected integration with
systemd, since when this parameter would be changed, there is no
propagation of that towards the actual systemd unit files that
then get blindly managed, naively assuming coherency, AFAICT...
Then, the agent shall be fixed to disable such deliberate
modifications in the systemd scenarios, or something.

-- 
Jan (Poki)


pgpTXGNOpruvJ.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker 1.1.12 does not compile with CMAN Stack.

2019-09-16 Thread Jan Pokorný

On 16/09/19 12:01 +, Somanath Jeeva wrote:
> Thanks for your extended support and advise. I would like to give
> some background information about my exercise, which will give some
> fair idea to you folks. 
> 
> We are planning to use pacemaker + Corosync + CMAN stack in RHEL 6.5
> OS. For the same, we are trying to collect the source from
> Clusterlab.

If my memory serves me correctly and if the actual hard limitation
is usage of RHEL 6.5, pacemaker option would still be available for
you to consier...

> Initially we got the Pacemaker and Corosync Source successfully from
> Clusterlab and we were unable to fetch the source for CMAN from
> Cluster lab.

ClusterLabs is currently a loose "makers association".

I am not even sure what you call "getting corosync source from
ClusterLabs", since corosync project is even more independent,
see for instance a separate top-level GitHub associating naame
(https://github.com/corosync/corosync).

> As you suggested, We already tried to use the latest set of
> Pacemaker with its prerequisites (Corosync) on top of RHEL 6.5, but
> it fails due to dependencies of RHEL 7.x Components (Systemd).

This may be a matter of poor understanding of how to pre-arrange
the actual compilation of the components.  It might deserve more
room in README/INSTALL kind of top-level documents, but the
habitual approach working across any SW with build system backed
with autotools (autoconf, automake, ...) is to run these exploratory
steps:

   1. ./autogen.sh (sometimes also ./bootstrap.sh, etc., sometimes,
  for instance, mere "autoreconf -if" is suggested instead)

   2. ./configure --help

The latter will overwhelm you on the first contact, but then, when
you don't give up and try to understand your options with the project
at hand, you'll realize that you can change quite a lot about the
final products of the compilation, and indirectly about the
prerequisites needed to get there.

In case of pacemaker, you can, for instance, spot:

> --enable-systemd  Enable support for managing resources via systemd [try]

It follows then to disble this support + dependencies by force, you'll
just need to invoke the configure script as:

   ./configure --disable-systemd

> So we are requesting you to provide the below information.
> 
> Is it possible for me to get the source code for CMAN from Cluster
> lab ??? (since we are planning to use it for our production purpose,
> that's why we don't want to go with Red hat sites (even for Publicly
> available)).
> 
Somanath, it's basic to understand some facts first, since blind
fixation on ClusterLabs group and, conversely, blank refusal of
Red Hat as a vendor (sigh) won't lead anyway, I am afraid:

   1. CMAN pre-exists ClusterLabs establishment by multiple years

   2. CMAN used to be, more or less, single vendor show, and this
  vendor was, surprise surprise, Red Hat

   3. from 2., it follows that the only authoritative sources of CMAN
  ever were associated with this vendor -- you want it? better
  to grab it from there ... or from trustable enough and
  associated still mirror (currently hosted at pagure.io,
  see below)

So, take it or leave it, the links I provided for you (below) still
stand, don't expect any ClusterLabs rembranding-by-force of what
practically amounts to a dead project now.

Thanks for understanding.

And keep in mind, if I were you, I'd skip CMAN and RHEL 6 today.

> -Original Message-
> From: Jan Pokorný  
> Sent: Friday, August 30, 2019 20:15
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker 1.1.12 does not compile with CMAN Stack.
> 
> [...]
> 
> For CMAN in particular, look here:
> 
> https://pagure.io/linux-cluster/cluster/blob/STABLE32/f/cman
> https://ftp.redhat.com/pub/redhat/linux/enterprise/6Server/en/RHS/SRPMS/cluster-3.0.12.1-73.el6.src.rpm
> 
> 
> [...]
> 

-- 
Jan (Poki)

pgpjkHb093_qS.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why is last-lrm-refresh part of the CIB config?

2019-09-12 Thread Jan Pokorný

On 12/09/19 09:27 +0200, Ulrich Windl wrote:
>>>> Jan Pokorný  schrieb am 10.09.2019 um 17:38
>>>> in Nachricht <20190910153832.gj29...@redhat.com>:
>> 
>> Just looking at how to provisionally satisfy the needs here, I can
>> suggest this "one-liner" as a filter to let the to-archive-copies
>> through (depending on the aposteriori diff):
>> 
>>   { xsltproc - <(cibadmin -Q) | cat; } <>   > xmlns:xsl="http://www.w3.org/1999/XSL/Transform;>
>> 
>>   
>> 
>>   
>> 
>> 
>>   
>>   EOF
>> 
> 
> [...]
>> plain POSIX shell.  The additional complexity stems from the desire
>> so as not to keep blank lines where anything got filtered out.
> 
> Actually it "eats up" the line break and identation before
>  also, like here:
> 
> id="cib-bootstrap-options-stonith-enabled"/>
> id="cib-bootstrap-options-placement-strategy"/>
> -   name="last-lrm-refresh" value="1567945827"/>
> -   id="cib-bootstrap-options-maintenance-mode"/>
> -
> +   id="cib-bootstrap-options-maintenance-mode"/>
>
>

Good spot :-)

Untested fix (you'll get the idea, check just the first match element
on the preceding-sibling axis):

  { xsltproc - <(cibadmin -Q) | cat; } <http://www.w3.org/1999/XSL/Transform;>

  EOF

Some wiggling around this idea would certainly lead to desired results.
OTOH, if you backfeed that output back to crm (to get the reinstating
sequence), whitespaces are pretty irrelevant, allowing for vast
simplification.

And when you are after the actual XML diff, text-based diff is
a *bad choice* in general anyway.  An identical XML content can be
expressed with a countless number of variations in the serialized
form (plain text projection), and you are really interested in the
semantical identity only.  That text-based diff more or less work
is rather an implementation detail not to be relied upon across
SW updates, etc. (libxml2-specific environment variables could affect
the output as well!).  There are multiple "XML diff" tools, perhaps
even crm_diff would do (but specialities like comments, processing
instructions etc. may not work well with it, admittedly they probably
won't be present to begin with).

-- 
Poki

pgptzqob81yLk.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why is last-lrm-refresh part of the CIB config?

2019-09-10 Thread Jan Pokorný

On 10/09/19 08:06 +0200, Ulrich Windl wrote:
 Ken Gaillot  schrieb am 09.09.2019 um 17:14
 in Nachricht
 <9e51c562c74e52c3b9e5f85576210bf83144fae7.ca...@redhat.com>:
>> On Mon, 2019‑09‑09 at 11:06 +0200, Ulrich Windl wrote:
>>> In recent pacemaker I see that last‑lrm‑refresh is included in the
>>> CIB config (crm_config/cluster_property_set), so CIBs are "different"
>>> when they are actually the same.
>>> 
>>> Example diff:
>>> ‑  >> name="last‑lrm‑refresh" value="1566194010"/>
>>> +  >> name="last‑lrm‑refresh" value="1567945827"/>
>>> 
>>> I don't see a reason for having that. Can someone explain?
>> 
>> New transitions (re‑calculation of cluster status) are triggered by
>> changes in the CIB. last‑lrm‑refresh isn't really special in any way,
>> it's just a value that can be changed arbitrarily to trigger a new
>> transition when nothing "real" is changing.
> 
> But wouldn't it be more appropriate in the status section of the CIB then?
> Also isn't pacemaker working with CIB digests anyway? And there is the
> three-number CIB versioning, too.
> 
>> I'm not sure what would actually be setting it these days; its use has
>> almost vanished in recent code. I think it was used more commonly for
>> clean‑ups in the past.
> 
> My guess is that a crm configure did such a change.
> 
> Why I care: I'm archiving CIB configuration (but not status) changes
> (that is the CIBs and the diffs); so I don't like such dummy changes
> ;-)

Just looking at how to provisionally satisfy the needs here, I can
suggest this "one-liner" as a filter to let the to-archive-copies
through (depending on the aposteriori diff):

  { xsltproc - <(cibadmin -Q) | cat; } 

  

  


  
  EOF

Above, "cat" is only for demonstration of how to further extend your
pipeline (you can directly ground those flying data into a file
instead).  It relies on bash extension (process substitution) over
plain POSIX shell.  The additional complexity stems from the desire
so as not to keep blank lines where anything got filtered out.

Hope this helps for now.

-- 
Jan (Poki)


pgp7NH4QgsiJF.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-04 Thread Jan Pokorný

On 03/09/19 20:15 +0300, Andrei Borzenkov wrote:
> 03.09.2019 11:09, Marco Marino пишет:
>> Hi, I have a problem with fencing on a two node cluster. It seems that
>> randomly the cluster cannot complete monitor operation for fence devices.
>> In log I see:
>> crmd[8206]:   error: Result of monitor operation for fence-node2 on
>> ld2.mydomain.it: Timed Out
> 
> Can you actually access IP addresses of your IPMI ports?

[
Tangentially, interesting aspect beyond that and applicable for any
non-IP cross-host referential needs, which I haven't seen mentioned
anywhere so far, is the risk of DNS resolution (when /etc/hosts will
come short) getting to troubles (stale records, port blocked, DNS
server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW
cannot handle gracefully, etc.).  In any case, just a single DNS
server would apparently be an undesired SPOF, and would be unfortunate
when unable to fence a node because of that.

I think the most robust approach is to use IP addresses whenever
possible, and unambiguous records in /etc/hosts when practical.
]

>> As attachment there is
>> - /var/log/messages for node1 (only the important part)
>> - /var/log/messages for node2 (only the important part) <-- Problem starts
>> here
>> - pcs status
>> - pcs stonith show (for both fence devices)
>> 
>> I think it could be a timeout problem, so how can I see timeout value for
>> monitor operation in stonith devices?
>> Please, someone can help me with this problem?
>> Furthermore, how can I fix the state of fence devices without downtime?

-- 
Jan (Poki)

pgpL97hDs1Edl.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Q: Recommened directory for RA auxillary files?

2019-09-02 Thread Jan Pokorný

On 02/09/19 15:23 +0200, Ulrich Windl wrote:
> Are there any recommendations where to place (fixed content) files
> an RA uses?
> Usually my RAs use a separate XML file for the metadata, just to
> allow editing it in XML mode automatically.  Traditionally I put the
> file in the same directory as the RA itself (like "cat $0.xml" for
> meta-data).  Are there any expectations that every file in the RA
> directory is an RA?

Pacemaker itself only scans the relevant directories (in OCF
context, subdirectories of $OCF_ROOT/resource.d dedicated to particular
resource agent providers; note $OCF_ROOT boils down to /usr/lib/ocf in
this case), and will initially exclude any file that is not executable
(for at least one level of standard Unix permissions) or those
starting with a dot (~pseudo-hidden in Unix tradition).

This should provide some hints regarding possible co-location within
the same directory, at least when only pacemaker is considered as the
discoverer/exerciser/manager of the agents.

You (like anyone else) have the opportunity to get any presumably
pacemaker-implementation-non-contradicting conventions burnt even
firmer into the future revision of OCF, taking

https://github.com/ClusterLabs/OCF-spec/blob/master/ra/next/resource-agent-api.md#the-resource-agent-directory

as a starting point.  This would preempt portability glitches should
there be anything else than pacemaker to run your agents.

(E.g., there could be parallel implementations that would only
require the actual executability, e.g. as assigned with filesystem
backed extended access control, or a relaxed resource runner that
makes do just with a non-executable file with a legitimately looking
shebang like Perl does.)

> (Currently I'm extending an RA, and I'd like to provide some
> additional user-modifiable template file, and I wonder which path to
> use)

Intuitively, I would toy with and idea akin to 
/etc/ocf/conf.d/PROVIDER/AGENT.EXT scheme.  Similarly
/etc/ocf/conf-local.d/PROVIDER/AGENT.EXT for localhost
only override, see below.

Also, for as homogenous environment across the nodes as possible (and
if possible), synchronizing everything but notable exceptions such as
/etc/passwd, /etc/shadow/, /etc/group, /etc/hostname, /etc/cron.* and
similar asymmetry-wanted configurations from /etc hierarchy sounds
like a good idea.  Off the top of my head, csync2 with its hooks
system serves exactly this purpose, but there are likely more
that could fill the bill.

-- 
Jan (Poki)

pgpxCBlh8emSE.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker 1.1.12 does not compile with CMAN Stack.

2019-08-30 Thread Jan Pokorný

On 30/08/19 13:03 +, Somanath Jeeva wrote:
> In Pacemaker 1.1.12 version try to compile with CMAN Stack,

midly put, it's like trying to run with dinosaurs; that version
of pacemaker together with that effectively superseded bundle
of other components will hardly receive any attention in 2019.

But let's assume there's a reason.

> but we are unable to achieve that .
> 
> Source taken path : 
> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-1.1.12
> 
> After Extracting, we installed required dependencies as per
> README.markdown,
> 
> ## Installing from source
> 
> $ ./autogen.sh
> $ ./configure
> $ make
> $ sudo make install
> 
> After performing above task, we are unable to start pacemaker due
> to cman stack is unrecognized service.

Well, pacemaker alone won't magically bootstrap these other
prerequisites, you need to take an effort of grabbing them
as well.

For CMAN in particular, look here:

https://pagure.io/linux-cluster/cluster/blob/STABLE32/f/cman
https://ftp.redhat.com/pub/redhat/linux/enterprise/6Server/en/RHS/SRPMS/cluster-3.0.12.1-73.el6.src.rpm

> # service pacemaker status
> pacemakerd dead but pid file exists
> #service cman status
> cman: unrecognized service

Easy, no /etc/init.d/cman around, for the mentioned reason.

Still, I am not sure how far you'll get, sounds like an uphill battle.
Settling with the rather recent state of development may avoid
significant chunks of troubles, incl. those that were only fixed in
later versions of pacemaker.

> Please find the ./configure screenshot of the system:
> 
> pacemaker configuration:
>   Version  = 1.1.12 (Build: 561c4cfda1)
>   Features = libqb-logging libqb-ipc lha-fencing nagios  
> corosync-plugin cman acls
> 
>   Prefix   = /usr
>   Executables  = /usr/sbin
>   Man pages= /usr/share/man
>   Libraries= /usr/lib64
>   Header files = /usr/include
>   Arch-independent files   = /usr/share
>   State information= /var
>   System configuration = /etc
>   Corosync Plugins = /usr/libexec/lcrso
> 
>   Use system LTDL  = yes
> 
>   HA group name= haclient
>   HA user name = hacluster
> 
>   CFLAGS   = -g -O2 -I/usr/include -I/usr/include/heartbeat   
>-ggdb  -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return 
> -Wbad-function-cast -Wcast-align -Wdeclaration-after-statement -Wendif-labels 
> -Wfloat-equal -Wformat=2 -Wmissing-prototypes -Wmissing-declarations 
> -Wnested-externs -Wno-long-long -Wno-strict-aliasing 
> -Wunused-but-set-variable -Wpointer-arith -Wwrite-strings -Werror
>   Libraries= -lgnutls -lqb -lplumb -lpils -lqb -lbz2 -lxslt 
> -lxml2 -lc -luuid -lpam -lrt -ldl  -lglib-2.0   -lltdl -lqb -ldl -lrt 
> -lpthread
>   Stack Libraries  =   -lcoroipcc   -lcpg   -lcfg   -lconfdb   -lcman 
>   -lfenced
> 
> 
> Is there anything I am missing in the configure, to get the cman
> stack.

I think having "cman" enumerated amongst features might be enough,
but once you'll get past the "cman: unrecognized service" phase,
you shall see.

Hope this helps.

-- 
Jan (Poki)


pgpk4q77cisT7.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Command to show location constraints?

2019-08-29 Thread Jan Pokorný

On 28/08/19 10:27 +0200, Ulrich Windl wrote:
>>>> Jan Pokorný  schrieb am 28.08.2019 um 10:03
>>>> in Nachricht <20190828080347.ga9...@redhat.com>:
>> On 27/08/19 09:24 ‑0600, Casey & Gina wrote:
>>> Hi, I'm looking for a way to show just location constraints, if they
>>> exist, for a cluster.  I'm looking for the same data shown in the
>>> output of `pcs config` under the "Location Constraints:" header, but
>>> without all the rest, so that I can write a script that checks if
>>> there are any set.
>>> 
>>> The situation is that sometimes people will perform a failover with
>>> `pcs resource move ‑‑master `, but then forget to follow
>>> it up with `pcs resource clear `, and then it causes
>>> unnecessary failbacks later.  As we never want to have any
>>> particular node in the cluster preferred for this resource, I'd like
>>> to write a script that can automatically check for any location
>>> constraints being set and either alert or clear it automatically.
>> 
>> One could also think of "crm_resource ‑‑clear ‑‑dry‑run" that doesn't
>> exist yet.  Please, file a RFE with https://bugs.clusterlabs.org if
>> this would be useful.
> 
> 
> Thinking about it: Location constraints can have an expire
> timestamp, but they don't have a creation timestamp (AFAIK). This
> the first step in automatically cleaning up location constraints
> that don't have an expiration time set would be adding a creation
> time stamp.  The next step would be a definition of a "maximum
> location constraint lifetime", and finally actual code that removes
> location constraints that exceeded their lifetime.

It feels that any such automatism (even if user configuration driven)
would need to, moreover, prove (with internal what-if simulation)
that no change in resource assignment will immediately occur in return,
since you likely don't want spurious (only "positively deterministic"
if you really review the plans periodically and you are fine with
them), unnecessary shuffles in your cluster.

> Thinking about it: There may be a more pleasing solution for the
> current code base: Define a maximum lifetime for any migration
> constraint, and use that if the user did not specify one.  (The
> constrains would still linger around, but would not have an effect
> any more.)

All in all, I think the impact of cancelling shall be factored into
decisions as mentioned, to stand the "least surprise" test.

-- 
Jan (Poki)


pgpDU2trmkDv2.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Command to show location constraints?

2019-08-28 Thread Jan Pokorný

On 27/08/19 09:24 -0600, Casey & Gina wrote:
> Hi, I'm looking for a way to show just location constraints, if they
> exist, for a cluster.  I'm looking for the same data shown in the
> output of `pcs config` under the "Location Constraints:" header, but
> without all the rest, so that I can write a script that checks if
> there are any set.
> 
> The situation is that sometimes people will perform a failover with
> `pcs resource move --master `, but then forget to follow
> it up with `pcs resource clear `, and then it causes
> unnecessary failbacks later.  As we never want to have any
> particular node in the cluster preferred for this resource, I'd like
> to write a script that can automatically check for any location
> constraints being set and either alert or clear it automatically.

One could also think of "crm_resource --clear --dry-run" that doesn't
exist yet.  Please, file a RFE with https://bugs.clusterlabs.org if
this would be useful.

-- 
Jan (Poki)


pgpS7k1f23mve.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker resources under systemd

2019-08-27 Thread Jan Pokorný

On 27/08/19 15:27 +0200, Ulrich Windl wrote:
> Systemd think he's the boss, doing what he wants: Today I noticed that all
> resources are run inside control group "pacemaker.service" like this:
>   ├─pacemaker.service
>   │ ├─ 26582 isredir-ML1: listening on 172.20.17.238/12503 (2/1)
>   │ ├─ 26601 /usr/bin/perl -w /usr/sbin/ldirectord /etc/ldirectord/mail.conf 
> start
>   │ ├─ 26628 ldirectord tcp:172.20.17.238:25
>   │ ├─ 28963 isredir-DS1: handling 172.20.16.33/10475 -- 172.20.17.200/389
>   │ ├─ 40548 /usr/sbin/pacemakerd -f
>   │ ├─ 40550 /usr/lib/pacemaker/cib
>   │ ├─ 40551 /usr/lib/pacemaker/stonithd
>   │ ├─ 40552 /usr/lib/pacemaker/lrmd
>   │ ├─ 40553 /usr/lib/pacemaker/attrd
>   │ ├─ 40554 /usr/lib/pacemaker/pengine
>   │ ├─ 40555 /usr/lib/pacemaker/crmd
>   │ ├─ 53948 isredir-DS2: handling 172.20.16.33/10570 -- 172.20.17.201/389
>   │ ├─ 92472 isredir-DS1: listening on 172.20.17.204/12511 (13049/3)
> ...
> 
> (that "isredir" stuff is my own resource that forks processes and creates
> threads on demand, thus modifying process (and thread) titles to help
> understanding what's going on...)
> 
> My resources are started via OCF RA (shell script), not a systemd unit.
> 
> Wouldn't it make much more sense if each resource would run in its
> own control group?

While listing like above may be confusing, the main problem perhaps
is that all the resource restrictions you specify in pacemaker service
file will be accounted to the mix of stack-native and stack-managed
resources (unless when of systemd class), hence making all those
containment features and supervision of systemd rather unusable, since
there's no tight (vs. rather open-ended) blackbox to reason about.

There have been some thoughts that pacemaker could become the
delegated controller of its own delegated cgroup subtrees in the
past, however.

There is a nice document detailing various possibilities, but
also looks pretty overwhelming on the first look:
https://systemd.io/CGROUP_DELEGATION
Naively, i-like-continents integration option there looks most
appealing to me at this point.

If anyone has insights into cgroups and how it pairs with systemd
and could pair with pacemaker, please do speak up, it could be
a great help in sketching the design in this area.

> I mean: If systemd thinks everything MUST run in some control group,
> why not pick the "correct " one? Having the pacemaker infrastructure
> in the same control group as all the resources seems to be a bad
> idea IMHO.

No doubts it is suboptimal.

> The other "discussable feature" are "high PIDs" like "92472". While port
> numbers are still 16 bit (in IPv4 at least), I see little sense in having
> millions of processes or threads.

Have seen your questioning this at the systemd ML, but wouldn't think
of any kind of inconveniences in that regard, modulo pre-existing real
bugs.  It actually slightly helps to unbreak firm-guarantees-lacking
design based on PID liveness (risk of process ID recycling is still
better than downright crazy "process grep'ing", totally unsuitable
when chroots, PID namespaces or containers rooted on that very host
get into the picture, but not much better otherwise[1]!).

[1] https://lists.clusterlabs.org/pipermail/users/2019-July/025978.html

-- 
Jan (Poki)

pgpv8zIc3yPjG.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] SLES12 SP4: Same error logged in two different formats

2019-08-26 Thread Jan Pokorný

On 26/08/19 08:16 +0200, Ulrich Windl wrote:
> While inspecting the logs to improve my own RA, I noticed that one
> error is logged with two different formats: With literal "\n" nd
> with a space:
> lrmd[7278]:   notice: prm_isr_ds3_monitor_0:108007:stderr [ mkdir: cannot 
> create directory '/run/isredir': File exists ]
> crmd[7281]:   notice: h11-prm_isr_ds3_monitor_0:964 [ mkdir: cannot create 
> directory '/run/isredir': File exists\n ]
> 
> (The RA suffers from a race condition when multiple monitors are
> launched in parallel, and each one tries to create the missing
> directory; only the first one succeeds)

/me acting as a bot adding synapses would they be helpful:
https://lists.clusterlabs.org/pipermail/users/2019-February/025454.html
[never codified reentrancy of the agent-instances]

> The log message in question is created automatically from a failed
> mkdir in the shell script.
> 
> My guess is that the trailing "\n" should be removed from the
> message.
> 
> As the pacemaker on the machine is a bit outdated
> (pacemaker-1.1.19+20180928.0d2680780-1.8.x86_64), the problem might
> have been fixed already.

-- 
Jan (Poki)


pgp_27uo5NLke.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Q: "pengine[7280]: error: Characters left over after parsing '10#012': '#012'"

2019-08-22 Thread Jan Pokorný

On 22/08/19 08:07 +0200, Ulrich Windl wrote:
> When a second node joined a two-node cluster, I noticed the
> following error message that leaves me kind of clueless:
>  pengine[7280]:error: Characters left over after parsing '10#012': '#012'
> 
> Where should I look for these characters?

Given it's pengine related, one of the ideas is it's related to:

https://github.com/ClusterLabs/pacemaker/commit/9cf01f5f987b5cbe387c4e040ff5bfd6872eb0ad

Therefore it'd be nothing to try to tackle in the user-facing
configuration, but some kind of internal confusion, perhaps stemming
from mixing pacemaker version within the cluster?

By any chance, do you have an interval of 12 seconds configured
at any operation for any resource?

(The only other and unlikely possibility I can immediately see is
having one of pe-*-series-max cluster options misconfigured.)

> The message was written after an announced resource move to the new
> node.

-- 
Poki

pgpj0Z_tRBZUy.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Master/slave failover does not work as expected

2019-08-21 Thread Jan Pokorný

On 21/08/19 14:48 +0200, Jan Pokorný wrote:
> On 20/08/19 20:55 +0200, Jan Pokorný wrote:
>> On 15/08/19 17:03 +, Michael Powell wrote:
>>> First, thanks to all for their responses.  With your help, I'm
>>> steadily gaining competence WRT HA, albeit slowly.
>>> 
>>> I've basically followed Harvey's workaround suggestion, and the
>>> failover I hoped for takes effect quite quickly.  I nevertheless
>>> remain puzzled about why our legacy code, based upon Pacemaker
>>> 1.0/Heartbeat, works satisfactorily w/o such changes, 
>>> 
>>> Here's what I've done.  First, based upon responses to my post, I've
>>> implemented the following commands when setting up the cluster:
>>>> crm_resource  -r SS16201289RN00023 --meta -p resource-stickiness -v 0# 
>>>> (someone asserted that this was unnecessary)
>>>> crm_resource  --meta -r SS16201289RN00023 -p migration-threshold -v 1
>>>> crm_resource  -r SS16201289RN00023 --meta -p failure-timeout -v 10
>>>> crm_resource  -r SS16201289RN00023 --meta -p cluster-recheck-interval -v 15
>>> 
>>> In addition, I've added "-l reboot" to those instances where
>>> 'crm_master' is invoked by the RA to change resource scores.  I also
>>> found a location constraint in our setup that I couldn't understand
>>> the need for, and removed it.
>>> 
>>> After doing this, in my initial tests, I found that after 'kill -9
>>> ' was issued to the master, the slave instance on the other
>>> node was promoted to master within a few seconds.  However, it took
>>> 60 seconds before the killed resource was restarted.  In examining
>>> the cib.xml file, I found an "rsc-options-failure-timeout", which
>>> was set to "1min".  Thinking "aha!", I added the following line
>>>> crm_attribute --type rsc_defaults --name failure-timeout --update 15
>>> 
>>> Sadly, this does not appear to have had any impact.
>> 
>> It won't have any impact on your SS16201289RN00023 resource since it
>> has it's own (closer, hence overriding such outermost fallback value,
>> barring the deferring to built-in default when that default would not
>> be set explicitly) failure timeout set to value of 10 seconds per
>> above.
>> 
>> Admittedly, the documentation would use a precize formalization
>> of the precedence rules.
>> 
>> Anyway, this whole seems moot to me, see below.
>> 
>>> So, while the good news is that the failover occurs as I'd hoped
>>> for, the time required to restart the failed resource seems
>>> excessive.  Apparently setting the failure-timeout values isn't
>>> sufficient.  As an experiment, I issued a "crm_resource -r
>>>  --cleanup" command shortly after the failover took
>>> effect and found that the resource was quickly restarted.  Is that
>>> the recommended procedure?  If so, is changing the "failure-timeout"
>>> and "cluster-recheck-interval" really necessary?
>>> 
>>> Finally, for a period of about 20 seconds while the resource is
>>> being restarted (it takes about 30s to start up the resource's
>>> application), it appears that both nodes are masters.  E.g. here's
>>> the 'crm_mon' output.
>>> 
>>> [root@mgraid-16201289RN00023-0 bin]# date;crm_mon -1
>>> Thu Aug 15 09:44:59 PDT 2019
>>> Stack: corosync
>>> Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-c3c624ea3d) - 
>>> partition with quorum
>>> Last updated: Thu Aug 15 09:44:59 2019
>>> Last change: Thu Aug 15 06:26:39 2019 by root via crm_attribute on 
>>> mgraid-16201289RN00023-0
>>> 
>>> 2 nodes configured
>>> 4 resources configured
>>> 
>>> Online: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>>> 
>>> Active resources:
>>> 
>>>  Clone Set: mgraid-stonith-clone [mgraid-stonith]
>>>  Started: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>>>  Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023]
>>>  Masters: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>>> ^^
>>> This seems odd.  Eventually the just-started resource reverts to a slave, 
>>> but this doesn't make sense to me.
>>> 
>>> For the curious, I've attached the 'cibadmin --q

Re: [ClusterLabs] Master/slave failover does not work as expected

2019-08-21 Thread Jan Pokorný

On 20/08/19 20:55 +0200, Jan Pokorný wrote:
> On 15/08/19 17:03 +, Michael Powell wrote:
>> First, thanks to all for their responses.  With your help, I'm
>> steadily gaining competence WRT HA, albeit slowly.
>> 
>> I've basically followed Harvey's workaround suggestion, and the
>> failover I hoped for takes effect quite quickly.  I nevertheless
>> remain puzzled about why our legacy code, based upon Pacemaker
>> 1.0/Heartbeat, works satisfactorily w/o such changes, 
>> 
>> Here's what I've done.  First, based upon responses to my post, I've
>> implemented the following commands when setting up the cluster:
>>> crm_resource  -r SS16201289RN00023 --meta -p resource-stickiness -v 0# 
>>> (someone asserted that this was unnecessary)
>>> crm_resource  --meta -r SS16201289RN00023 -p migration-threshold -v 1
>>> crm_resource  -r SS16201289RN00023 --meta -p failure-timeout -v 10
>>> crm_resource  -r SS16201289RN00023 --meta -p cluster-recheck-interval -v 15
>> 
>> In addition, I've added "-l reboot" to those instances where
>> 'crm_master' is invoked by the RA to change resource scores.  I also
>> found a location constraint in our setup that I couldn't understand
>> the need for, and removed it.
>> 
>> After doing this, in my initial tests, I found that after 'kill -9
>> ' was issued to the master, the slave instance on the other
>> node was promoted to master within a few seconds.  However, it took
>> 60 seconds before the killed resource was restarted.  In examining
>> the cib.xml file, I found an "rsc-options-failure-timeout", which
>> was set to "1min".  Thinking "aha!", I added the following line
>>> crm_attribute --type rsc_defaults --name failure-timeout --update 15
>> 
>> Sadly, this does not appear to have had any impact.
> 
> It won't have any impact on your SS16201289RN00023 resource since it
> has it's own (closer, hence overriding such outermost fallback value,
> barring the deferring to built-in default when that default would not
> be set explicitly) failure timeout set to value of 10 seconds per
> above.
> 
> Admittedly, the documentation would use a precize formalization
> of the precedence rules.
> 
> Anyway, this whole seems moot to me, see below.
> 
>> So, while the good news is that the failover occurs as I'd hoped
>> for, the time required to restart the failed resource seems
>> excessive.  Apparently setting the failure-timeout values isn't
>> sufficient.  As an experiment, I issued a "crm_resource -r
>>  --cleanup" command shortly after the failover took
>> effect and found that the resource was quickly restarted.  Is that
>> the recommended procedure?  If so, is changing the "failure-timeout"
>> and "cluster-recheck-interval" really necessary?
>> 
>> Finally, for a period of about 20 seconds while the resource is
>> being restarted (it takes about 30s to start up the resource's
>> application), it appears that both nodes are masters.  E.g. here's
>> the 'crm_mon' output.
>> 
>> [root@mgraid-16201289RN00023-0 bin]# date;crm_mon -1
>> Thu Aug 15 09:44:59 PDT 2019
>> Stack: corosync
>> Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-c3c624ea3d) - 
>> partition with quorum
>> Last updated: Thu Aug 15 09:44:59 2019
>> Last change: Thu Aug 15 06:26:39 2019 by root via crm_attribute on 
>> mgraid-16201289RN00023-0
>> 
>> 2 nodes configured
>> 4 resources configured
>> 
>> Online: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>> 
>> Active resources:
>> 
>>  Clone Set: mgraid-stonith-clone [mgraid-stonith]
>>  Started: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>>  Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023]
>>  Masters: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>> ^^
>> This seems odd.  Eventually the just-started resource reverts to a slave, 
>> but this doesn't make sense to me.
>> 
>> For the curious, I've attached the 'cibadmin --query' output from
>> the DC node, taken just prior to issuing the 'kill -9' command, as
>> well as the corosync.log file from the DC following the 'kill -9'
>> command.
> 
> Thanks for attaching both, since your above described situation starts
> to make a bit more sense to me.
> 
> I think you've observed (at least that's what I do per t

Re: [ClusterLabs] booth (manual tickets) site/peer notifications

2019-08-21 Thread Jan Pokorný

On 21/08/19 14:49 +0530, Rohit Saini wrote:
> I am using booth (two clusters) using manual ticket. Is there any
> way for cluster-1 to know if peer cluster-2 booth-ip is not
> reachable. Here I am looking to know if there both the sites are
> connected and reachable via each other's booth-ip.

Just that we can come to common terms, do you mean the effective
reachability from the other site, as opposed to plain locally known
fact that that respective assigned virtual IP is up, and perhaps
responding (not necessarily meaning that address would be reachable
from the outside)?

I think having a detached arbitrator on a 3rd site (that is, without
reachability biases amongst other issues, should it be located with
either of the production sites) is the best way to address such
a concern, at least unless manual ticket mode of operation is in place.
TBH, haven't thought of consequences in that very case.

> pcs alerts gives notification about cluster resources/nodes whenever
> they start or join the cluster but there seems to be no option for
> booth notifications.

Naturally, it's above pacemaker's ("local cluster only") level of
reasoning, and only pacemaker talks such "hook execution language" at
the moment (putting before-acquire-handler of booth aside as not
relevant here universally, AFAICT).

You can always crank up your own down-to-socket-communication
monitoring and hook it into your "beyond pacemaker/booth alone"
decision engine you seem to be building anyway.  Just be cautious
about the mentioned network-proximity biases if you want (you shall)
as objective picture as feasible.

-- 
Jan (Poki)

pgpqtMQMYyu5T.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Master/slave failover does not work as expected

2019-08-20 Thread Jan Pokorný

On 15/08/19 17:03 +, Michael Powell wrote:
> First, thanks to all for their responses.  With your help, I'm
> steadily gaining competence WRT HA, albeit slowly.
> 
> I've basically followed Harvey's workaround suggestion, and the
> failover I hoped for takes effect quite quickly.  I nevertheless
> remain puzzled about why our legacy code, based upon Pacemaker
> 1.0/Heartbeat, works satisfactorily w/o such changes, 
> 
> Here's what I've done.  First, based upon responses to my post, I've
> implemented the following commands when setting up the cluster:
>> crm_resource  -r SS16201289RN00023 --meta -p resource-stickiness -v 0# 
>> (someone asserted that this was unnecessary)
>> crm_resource  --meta -r SS16201289RN00023 -p migration-threshold -v 1
>> crm_resource  -r SS16201289RN00023 --meta -p failure-timeout -v 10
>> crm_resource  -r SS16201289RN00023 --meta -p cluster-recheck-interval -v 15
> 
> In addition, I've added "-l reboot" to those instances where
> 'crm_master' is invoked by the RA to change resource scores.  I also
> found a location constraint in our setup that I couldn't understand
> the need for, and removed it.
> 
> After doing this, in my initial tests, I found that after 'kill -9
> ' was issued to the master, the slave instance on the other
> node was promoted to master within a few seconds.  However, it took
> 60 seconds before the killed resource was restarted.  In examining
> the cib.xml file, I found an "rsc-options-failure-timeout", which
> was set to "1min".  Thinking "aha!", I added the following line
>> crm_attribute --type rsc_defaults --name failure-timeout --update 15
> 
> Sadly, this does not appear to have had any impact.

It won't have any impact on your SS16201289RN00023 resource since it
has it's own (closer, hence overriding such outermost fallback value,
barring the deferring to built-in default when that default would not
be set explicitly) failure timeout set to value of 10 seconds per
above.

Admittedly, the documentation would use a precize formalization
of the precedence rules.

Anyway, this whole seems moot to me, see below.

> So, while the good news is that the failover occurs as I'd hoped
> for, the time required to restart the failed resource seems
> excessive.  Apparently setting the failure-timeout values isn't
> sufficient.  As an experiment, I issued a "crm_resource -r
>  --cleanup" command shortly after the failover took
> effect and found that the resource was quickly restarted.  Is that
> the recommended procedure?  If so, is changing the "failure-timeout"
> and "cluster-recheck-interval" really necessary?
> 
> Finally, for a period of about 20 seconds while the resource is
> being restarted (it takes about 30s to start up the resource's
> application), it appears that both nodes are masters.  E.g. here's
> the 'crm_mon' output.
> 
> [root@mgraid-16201289RN00023-0 bin]# date;crm_mon -1
> Thu Aug 15 09:44:59 PDT 2019
> Stack: corosync
> Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-c3c624ea3d) - 
> partition with quorum
> Last updated: Thu Aug 15 09:44:59 2019
> Last change: Thu Aug 15 06:26:39 2019 by root via crm_attribute on 
> mgraid-16201289RN00023-0
> 
> 2 nodes configured
> 4 resources configured
> 
> Online: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
> 
> Active resources:
> 
>  Clone Set: mgraid-stonith-clone [mgraid-stonith]
>  Started: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>  Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023]
>  Masters: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
> ^^
> This seems odd.  Eventually the just-started resource reverts to a slave, but 
> this doesn't make sense to me.
> 
> For the curious, I've attached the 'cibadmin --query' output from
> the DC node, taken just prior to issuing the 'kill -9' command, as
> well as the corosync.log file from the DC following the 'kill -9'
> command.

Thanks for attaching both, since your above described situation starts
to make a bit more sense to me.

I think you've observed (at least that's what I do per the
provided log) resource role continually flapping due to
a non-intermittent/deterministic obstacle on one of the nodes):

  nodes for simplicity: A, B
  only the M/S resource in question is considered
  notation: [node, state of the resource at hand]
  
  1. [A, master], [B, slave]

  2. for some reason, A fails (spotted with monitor)

  3. attempt demoting A, for likely the same reason, it also fails,
 note that the agent itself notices something is pretty fishy
 there(!):
 > WARNING: ss_demote() trying to demote a resource that was
 > not started
 while B gets promoted

  4. [A, stopped], [B, master]

  5. looks like at this point, there's a sort of a deadlock,
 since the missing instance of the

Re: [ClusterLabs] [Announce] clufter v0.77.2 released

2019-08-15 Thread Jan Pokorný

On 14/08/19 16:26 +0200, Jan Pokorný wrote:
> I am happy to announce that clufter, a tool/library for transforming
> and analyzing cluster configuration formats, got its version 0.77.2
> tagged and released (incl. signature using my 60BCBB4F5CD7F9EF key):
> <https://pagure.io/releases/clufter/clufter-0.77.2.tar.gz>
> <https://pagure.io/releases/clufter/clufter-0.77.2.tar.gz.asc>
> or alternative (original) location:
> <https://people.redhat.com/jpokorny/pkgs/clufter/clufter-0.77.2.tar.gz>
> <https://people.redhat.com/jpokorny/pkgs/clufter/clufter-0.77.2.tar.gz.asc>
> 
> The updated test suite for this version is also provided:
> <https://pagure.io/releases/clufter/clufter-0.77.2-tests.tar.xz>
> or alternatively:
> <https://people.redhat.com/jpokorny/pkgs/clufter/clufter-0.77.2-tests.tar.xz>

Later on, I've found out that Fedora now mandates packagers to run
the OpenPGP signature verification as an initial step in the build
process whenever upstream publishes such signatures.  Preferably in
this scenario, upstream would also publish the key(s) as a publicly
available key ring (this is something immutable and easily shareable
in a truly distributed manner, also independent of a fragile
-- as we could learn recently -- key servers infrastructure).

In addition, I've realized that using my general signing key
associated with my work life identity (and used to sign this very
message, for instance) is not the best choice for any sort of
long-term contingency.

Therefore, I am now killing two birds with one stone with following:

1. I am hereby publishing such a keyring as:
   <https://releases.pagure.org/clufter/clufter-2019-08-15-5CD7F9EF.keyring>
   
<https://people.redhat.com/jpokorny/pkgs/clufter/clufter-2019-08-15-5CD7F9EF.keyring>
   with following digests:
   SHA256 (clufter-2019-08-15-5CD7F9EF.keyring) = 
b2917412ed6d15206235ad98a14f4b51cbe15c66f534965bd31a13c01984305b
   SHA512 (clufter-2019-08-15-5CD7F9EF.keyring) = 
beedc493a04e3bd13c88792b0d9c0b9667a3a182416524c755ef1a2d60328c049d79a9ee88648eaf3f5965f8819221667fbb6d7fe2b5c92cd38b96226d44ea31

2. I am hereby and ahead-of-time detailing how the rollover be
   conducted should I decide to swap the signing key/related signing
   identity, barring a disaster (the purpose is to render any rough
   tampering attempts self-evident, since they would not follow the
   protocol set forth here):

   the first clufter release following such change will be signed
   with both the above specified key and the new/non-identical one
   (concatenated detached signatures in one file as usual) that
   will be present at distribution site/mirror likewise in a form
   "clufter--.keyring",
   where  is to correspond to the closest
   date prior to such release (in case there are more);
   once this "transaction is committed", default assumption for
   any subsequent one remains along the same lines, just with
   the newly established key as an origin

(Rationale for encoding the date hints in the file names: shall be
trivial to pick the correct keyring[s] related to the timestamp
of the the tarball -- or of the newest file within the tarball).

-- 
Jan (Poki)


pgpf4BgI3Mi3s.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] cloned ethmonitor - upon failure of all nodes

2019-08-15 Thread Jan Pokorný

On 15/08/19 10:59 +0100, solarmon wrote:
> I have a two node cluster setup where each node is multi-homed over two
> separate external interfaces - net4 and net5 - that can have traffic load
> balanced between them.
> 
> I have created multiple virtual ip resources (grouped together) that should
> only be active on only one of the two nodes.
> 
> I have created ethmonitor resources for net4 and net5 and have created a
> constraint against the virtual ip resources group.
> 
> When one of the net4/net5 interfaces is taken

clarification request: taken _down_ ?

(and if so, note that there's this running misconception that
ifdown equals pulling the cable out or cutting it apart physically
and I am not sure if, by any chance, ethmonitor is not as sensitive
about this difference as corosync used to be for years, yet people
were not getting it right)

> on the active node (where the virtual IPs are), the virtual ip
> resource group switches to the other node.  This is working as
> expected.
> 
> However, when either of the net4/net5 interfaces are down on BOTH nodes -
> for example, if net4 is down on BOTH nodes - the cluster seems to get
> itself in to a flapping state where there virtual IP resources keeps
> becoming available then unavailable. Or the virtual IP resources group
> isn't running on any node.
> 
> Since net4 and net5 interfaces can have traffic load-balanced across them,
> it is acceptable for the virtual IP resources to be running any of the
> node, even if the same interface (for example, net4) is down on both nodes,
> since the other interface (for example, net5) is still available on both
> nodes.
> 
> What is the recommended way to configure the ethmonitor and constraint
> resources for this type of multi-homed setup?

-- 
Jan (Poki)


pgpkVbJsN1vOV.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] clufter v0.77.2 released

2019-08-14 Thread Jan Pokorný

I am happy to announce that clufter, a tool/library for transforming
and analyzing cluster configuration formats, got its version 0.77.2
tagged and released (incl. signature using my 60BCBB4F5CD7F9EF key):


or alternative (original) location:



The updated test suite for this version is also provided:

or alternatively:


I am not so happy that this is only limited to bare minimum to get
clufter working with upcoming Python 3.8 (appears rather aggressive
about compatibility, even if some of that are just enforcements of
previous deprecations) plus some small accrued changes over time,
but there was no room to deliver more since the last release.
Quite some catching up with the recent developments, as also asked on
this list[1], is pending, hopefully this will get rectified soon.

[1] https://lists.clusterlabs.org/pipermail/users/2019-July/026057.html


Changelog highlights for v0.77.2 (also available as a tag message):

- Python 3 (3.8 in particular) compatibility improving release

- enhancements:
  . knowledge about mapping various platforms to particular cluster
component sets got updated so as to target these more reliably
-- note however that current capacity for package maintenance
does not allow for adding support for new evolutions of such
components, even when they are nominally recognized like that
(mostly a concern regarding corosync3/kronosnet, and new and
backward incompatible changes in pcs)
  . specfile received more care regarding using precisely qualified
Python interpreters and process of Python byte-compilation was
taken fully under the explicit control of where familiarity
for that is established
- internal enhancements:
  . previously introduced text/data separation to align with Python 3
regression in the test suite was rectified
  . mutliple newly identified issues with Python 3.8 were fixed
(deprecated and dropped standard library objects swapped for
the straightforward replacements, newly imposed metaclass
and relative module import related constraints were reflected)
  . (automatically reported) resource (open file descriptor) leak
was resolved

* * *

The public repository (notably master and next branches) is currently at

(rather than ).

Official, signed releases can be found at
 or, alternatively, at

(also beware, automatic git archives preserve a "dev structure").

Natively packaged in Fedora (python3-clufter, clufter-cli, ...).

Issues & suggestions can be reported at either of (regardless if Fedora)
,
.


Happy clustering/high-availing :)

-- 
Jan (Poki)


pgpKgJJYweR5p.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Master/slave failover does not work as expected

2019-08-13 Thread Jan Pokorný

On 13/08/19 09:44 +0200, Ulrich Windl wrote:
 Harvey Shepherd  schrieb am 12.08.2019 um 
 23:38
> in Nachricht :
>> I've been experiencing exactly the same issue. Pacemaker prioritises 
>> restarting the failed resource over maintaining a master instance. In my 
>> case 
>> I used crm_simulate to analyse the actions planned and taken by pacemaker 
>> during resource recovery. It showed that the system did plan to failover the 
>> master instance, but it was near the bottom of the action list. Higher 
>> priority was given to restarting the failed instance, consequently when that 
>> had occurred, it was easier just to promote the same instance rather than 
>> failing over.
> 
> That's interesting: Maybe usually it's actually faster to restart a
> failed (master) process rather than promoting a slave to master,
> possibly demoting the old master to slave, etc.
> 
> But most obviously while there is a (possible) resource utilization
> for resources, there is none for operations (AFAIK): If one could
> configure "operation costs" (maybe as rules), the cluster could
> prefer the transition with least costs. Unfortunately it will make
> things more complicated.
> 
> I could even imagine if you set the cost for "stop" to infinity, the
> cluster will not even try to stop the resource, but will fence the
> node instead...

Very courageous and highly nontrivial if you think about the
scalability impact (when at it, not that these wouldn't be mitigable
to some extent, switching single brain/DC into segmented multi-leader
approach met with hierarchical scheduling -- there are usually some
clusters [pun intended] of resources rather than each one coinciding
with all the others when the total count goes up).

Anyway, thanks for sharing the ideas, Ulrich, not just now :-)

-- 
Jan (Poki)


pgpiEPvKday33.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Restoring network connection breaks cluster services

2019-08-12 Thread Jan Pokorný

On 07/08/19 16:06 +0200, Momcilo Medic wrote:
> On Wed, Aug 7, 2019 at 1:00 PM Klaus Wenninger  wrote:
> 
>> On 8/7/19 12:26 PM, Momcilo Medic wrote:
>> 
>>> We have three node cluster that is setup to stop resources on lost
>>> quorum.  Failure (network going down) handling is done properly,
>>> but recovery doesn't seem to work.
>> 
>> What do you mean by 'network going down'?
>> Loss of link? Does the IP persist on the interface
>> in that case?
> 
> Yes, we simulate faulty cable by turning switch ports down and up.
> In such a case, the IP does not persist on the interface.
> 
>> That there are issue reconnecting the CPG-API sounds strange to me.
>> Already the fact that something has to be reconnected. I got it
>> that your nodes were persistently up during the
>> network-disconnection. Although I would have expected fencing to
>> kick in at least on those which are part of the non-quorate
>> cluster-partition.  Maybe a few words more on your scenario
>> (fening-setup e.g.) would help to understand what is going on.
> 
> We don't use any fencing mechanisms, we rely on quorum to run the
> services.  In more detail, we run three node Linbit LINSTOR storage
> that is hyperconverged.  Meaning, we run clustered storage on the
> virtualization hypervisors.
> 
> We use pcs in order to have linstor-controller service in high
> availabilty mode.  Policy for no quorum is to stop the resources.
> 
> In such hyperconverged setup, we can't fence a node without impact.
> It may happen that network instability causes primary node to no
> longer be primary.  In that case, we don't want running VMs to go
> down with the ship, as there was no impact for them.
> 
> However, we would like to have high-availability of that service
> upon network restoration, without manual actions.

This spurred a train of thought that is admittedly not immediately
helpful in this case:

* * *

1. the word "converged" is a fitting word for how we'd like
   the cluster stack to appear (from the outside), but what we have
   is that some circumstances are not clearly articulated across the
   components meaning that there's no way for users to express the
   preferences in simple terms and in a non-conflicting and
   unambiguous ways when 2+ components' realms combine together
   -- high level tools like pcs may attempt to rectify that to some
   extent, but they fall short when there are no surfaces to glue (at
   least unambiguously, see also parallel thread about shutting the
   cluster down in the presence of sbd)

   it seems to me that the very circumstance that was hit here is
   exactly where corosync authors decided that it's rare and obnoxious
   to indicate up the chain for a detached destiny reasoning (which
   pacemaker normaly performs) enough that they rather stop right
   there (and in a well-behaved cluster configuration hence ask to be
   fenced)

   all is actually sound, until one starts to make compromises like
   here was done, with ditching of the fencing (think: sanity
   assurance) layer, relying fully on no-quorum-policy=stop, naively
   thinking that one 100% covered, but with purely pacemaker hat
   on, we -- the pacemaker dev -- can't really give you such
   a guarantee, because we have no visibility into said "bail out"
   shortcuts that corosyncs makes for such rare circumstances -- you
   shall refer to corosync documentation, but it's not covered there
   (man pages) AFAIK (if it was _all_ indicated to pacemaker, just
   standard response on quorum loss could be carried out, not
   resorting to anything more drastic like here)


2. based on said missing explicit and clear inter-component signalling
   (1.) and the logs provided, it's fair to bring an argument that
   pacemaker had an opportunity to see, barring said explicit API
   signalling, that corosync died, but then, the major assumed case is:

   - corosync crashed or was explicitly killed (perhaps to test the
 claimed HA resiliency towards the outer world)

   - broken pacemaker-corosync communication consistency
 (did some messages fall through the cracks?)

   i.e., cluster endangering scenarios, not something to keep alive
   at all costs, better to try to stabilize the environment first,
   no to speak about chances with "miracles awaiting" strategy


3. despite 2.. there was a decision with systemd-enabled systems to
   actually pursue said "at all costs" (althought implicitly
   mitigated when the restart cycles would be happening in the rapid
   pace)

   - it's all then in the hands in slightly non-deterministic timing
 (token loss timeout window hit/miss, although perhaps not in
 this very case if the state within the protocol would be a clear
 indicator for other corosync peers)
   
   - I'd actually assume the pacemaker would be restarted in said
 scenario (unless one fiddled with the pacemaker service file,
 that is), and just prior to that, corosync would be forcibly
 started anew as well

   - is the

Re: [ClusterLabs] how to connect to the cluster from a docker container

2019-08-06 Thread Jan Pokorný

On 06/08/19 13:36 +0200, Jan Pokorný wrote:
> On 06/08/19 10:37 +0200, Dejan Muhamedagic wrote:
>> Hawk runs in a docker container on one of the cluster nodes (the
>> nodes run Debian and apparently it's rather difficult to install
>> hawk on a non-SUSE distribution, hence docker). Now, how to
>> connect to the cluster? Hawk uses the pacemaker command line
>> tools such as cibadmin. I have a vague recollection that there is
>> a way to connect over tcp/ip, but, if that is so, I cannot find
>> any documentation about it.
> 
> I think that what you are after is one of:
> 
> 1. have docker runtime for the particular container have the abstract
>Unix sockets shared from the host (--network=host? don't remember
>exactly)
> 
>- apparently, this weak style of compartmentalization comes with
>  many drawbacks, so you may be facing hefty work of cutting any
>  other interferences stemming from pre-chrooting assumptions of
>  what is a singleton on the system, incl. sockets etc.
> 
> 2. use modern enough libqb (v1.0.2+) and use
> 
>  touch /etc/libqb/force-filesystem-sockets
> 
>on both host and within the container (assuming those two locations
>are fully disjoint, i.e., not an overlay-based reuse), you should
>then be able to share the respective reified sockets simply by
>sharing the pertaining directory (normally /var/run it seems)
> 
>- if indeed a directory as generic as /var/run is involved,
>  it may also lead to unexpected interferences, so the more
>  minimalistic the container is, the better I think
>  (or you can recompile libqb and play with path mapping
>  in container configuration to achieve smoother plug-in)

Oh, and there's additional prerequisite for both to at least
theoretically work -- 1:1 sharing of /dev/shm (which may also
be problematic in a sense).

> Then, pacemaker utilities would hopefully work across the container
> boundaries just as if they were fully native, hence hawk shall as
> well.
> 
> Let us know how far you'll get and where we can colletively join you
> in your attempts, I don't think we had such experience disseminated
> here.  I know for sure I haven't ever tried this in practice, some
> one else here could have.  Also, there may be a lot of fun with various
> Linux Security Modules like SELinux.

-- 
Jan (Poki)


pgpqEk3Xjsvnq.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] how to connect to the cluster from a docker container

2019-08-06 Thread Jan Pokorný

Hello Dejan,

nice to see you around,

On 06/08/19 10:37 +0200, Dejan Muhamedagic wrote:
> Hawk runs in a docker container on one of the cluster nodes (the
> nodes run Debian and apparently it's rather difficult to install
> hawk on a non-SUSE distribution, hence docker). Now, how to
> connect to the cluster? Hawk uses the pacemaker command line
> tools such as cibadmin. I have a vague recollection that there is
> a way to connect over tcp/ip, but, if that is so, I cannot find
> any documentation about it.

I think that what you are after is one of:

1. have docker runtime for the particular container have the abstract
   Unix sockets shared from the host (--network=host? don't remember
   exactly)

   - apparently, this weak style of compartmentalization comes with
 many drawbacks, so you may be facing hefty work of cutting any
 other interferences stemming from pre-chrooting assumptions of
 what is a singleton on the system, incl. sockets etc.

2. use modern enough libqb (v1.0.2+) and use

 touch /etc/libqb/force-filesystem-sockets

   on both host and within the container (assuming those two locations
   are fully disjoint, i.e., not an overlay-based reuse), you should
   then be able to share the respective reified sockets simply by
   sharing the pertaining directory (normally /var/run it seems)

   - if indeed a directory as generic as /var/run is involved,
 it may also lead to unexpected interferences, so the more
 minimalistic the container is, the better I think
 (or you can recompile libqb and play with path mapping
 in container configuration to achieve smoother plug-in)

Then, pacemaker utilities would hopefully work across the container
boundaries just as if they were fully native, hence hawk shall as
well.

Let us know how far you'll get and where we can colletively join you
in your attempts, I don't think we had such experience disseminated
here.  I know for sure I haven't ever tried this in practice, some
one else here could have.  Also, there may be a lot of fun with various
Linux Security Modules like SELinux.

-- 
Jan (Poki)

pgpLs2h1wuNFz.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing

2019-07-25 Thread Jan Pokorný

On 24/07/19 12:33 -0500, Ken Gaillot wrote:
> A recent bugfix (clbz#5386) brings up a question.
> 
> A node may receive notification of its own fencing when fencing is
> misconfigured (for example, an APC switch with the wrong plug number)
> or when fabric fencing is used that doesn't cut the cluster network
> (for example, fence_scsi).

One related idea that'd be better to think through on its own pace,
whether it would make sense to maximize the benefit of knowing
which kind of behaviour to expect from particular abstracted
fencing device.  Is it absolute cut-off of the whole node's acting,
or is it just a partial isolation where it presumably matters
the most (access to disk, access to network resources, ...)?
Then, a dichotomy in failure modes could be introduced, since
these are effectively _different_ disaster limiting scenarios
with different pros and cons (consider also debug-ability).
I always had mixed feelings about putting total/partial fencing
into the same bucket.  Apparently, that information would need
to be propagated via the metadata of the agents, meaning pulling
more complexity on that level.

Broader picture might even be that compositions of the resources
could as well point out which kinds of shared resources are in
danger of amplifying the failure/causing split brain etc. and
hence offer the feedback which of these are yet to be covered
if the absolute cut-off is not preferred/available for whatever
reason.

/me gets back from daydreaming

-- 
Poki

pgpoaxij46hZj.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Which shell definitions to include?

2019-07-23 Thread Jan Pokorný

On 23/07/19 09:07 -0500, Ken Gaillot wrote:
> On Tue, 2019-07-23 at 08:48 +0200, Ulrich Windl wrote:
>> It's not that I like all the changes systemd requires, but systemd
>> complains about not being able to unmount /var while /var/run or
>> /var/lock is being used...
> 
> Agreed, it should be a build-time option whether to use /run or
> /var/run

Sounds more like a distro systemic inconsistency, since when /var/run
and /var/lock are just symlinks to /run-rooted counterparts (is it
your case?), it suggests a hint that /var shall never be unmounted
prior to /run itself (at which point the still surviving daemons will
be in troubles equally and the ordering hence doesn't sit well for
these for some reason -- I am assuming the only use case here is
a system shutdown).

I think I'd pay attention to other parts of the system or customized
configuration if you are seeing problems like mentioned above;
feels orthogonal to which access path for the same effective location
resource-agents project uses.

/me wonders, was it a deliberate design step from systemd?

-- 
Jan (Poki)

pgpS0GNCA4tjQ.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [Question] About clufter's Corosync 3 support

2019-07-18 Thread Jan Pokorný

On 18/07/19 18:08 +0900, 井上和徳 wrote:
> 'pcs config export' fails in RHEL 8.0 environment because clufter
> does not support Corosync 3 (Camelback).
> How is the state of progress? https://pagure.io/clufter/issue/4

As much as I'd want to, time budget hasn't been allocated for this,
pacemaker and other things keep me busy.  I need to review all the
configuration changes and work on direct and inter (one-way only,
actually) conversion procedures.

I hope to get to that later this year.

Any consumable form of said analysis would ease that task for me.

> [root@r80-1 ~]# cat /etc/redhat-release
> Red Hat Enterprise Linux release 8.0 (Ootpa)
> 
> [root@r80-1 ~]# dnf list | grep -E "corosync|pcs|clufter" | grep @
> clufter-bin.x86_640.77.1-5.el8@rhel-ha
> clufter-cli.noarch0.77.1-5.el8@rhel-ha
> clufter-common.noarch 0.77.1-5.el8@rhel-ha
> corosync.x86_64   3.0.0-2.el8 @rhel-ha
> corosynclib.x86_643.0.0-2.el8 @rhel-appstream
> pcs.x86_640.10.1-4.el8@rhel-ha
> python3-clufter.noarch0.77.1-5.el8@rhel-ha
> 
> [root@r80-1 ~]# pcs config export pcs-commands
> Error: unable to export cluster configuration: 'pcs2pcscmd-camelback'
> Try using --interactive to solve the issues manually, --debug to get more
> information, --force to override.
> 
> [root@r80-1 ~]# clufter -l | grep pcs2pcscmd
>   pcs2pcscmd-flatiron  (Corosync/CMAN,Pacemaker) cluster cfg. -> reinstating
>   pcs2pcscmd-needle(Corosync v2,Pacemaker) cluster cfg. -> reinstating
>   pcs2pcscmd   alias for pcs2pcscmd-needle

-- 
Jan (Poki)


pgpsFIwB2Ucky.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker alerts list

2019-07-18 Thread Jan Pokorný

On 17/07/19 19:07 +, Gershman, Vladimir wrote:
> This would be for the Pacemaker.
> 
> Seems like the alerts in the link you sent, refer to numeric codes,
> so where would I see all the codes and their meanings ?  This would
> allow a way to select what I need to monitor. 

Unfortunately, we currently won't make do without direct code
references, but keep in mind we are in the expert area already
when the alerts usefulness is to be maxed out (and one is then
also responsible to update such a tight wrapping accordingly if
or when incompatible changes arrive in new versions -- presumably
any slightly more disruptive releases would designate that in the
versioning scheme [non-minor component being incremented] and
perhaps it'd be noted in the release notes as well).

That being said, more detailed documentation and perhaps accompanied
with firmer assurances as to the details of so far vaguely specified
informative data items attached to the "unit" of alert may arrive
in the future, and I bet contributions to make it happen faster
are warmly welcome, especially when driven by the real production
needs.

> For example:

[intentionally reordered]

> CRM_alert_status:
>   A numerical code used by Pacemaker to represent the operation
>   result (resource alerts only)  

See
https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.2/include/crm/services.h#L118-L129

> CRM_alert_desc:
>   Detail about event. For node alerts, this is the node's current
>   state (member or lost).

That's literal, see
https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.2/include/crm/cluster.h#L30-L31

>   For fencing alerts, this is a summary of the requested fencing
>   operation, including origin, target, and fencing operation error
>   code, if any.

This would indeed require extensive parsing of the generated string
for fields that are not present as standalone variables (here, node
to be fenced that is also available separately via CRM_alert_node):

https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.2/daemons/controld/controld_execd_state.c#L805-L809

>   For resource alerts, this is a readable string equivalent of
>   CRM_alert_status.  

See the first link above, translation from numeric codes is rather
symbolic, though:

https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.2/include/crm/services.h#L331-L340

(but may denote that some codes from the full enumeration are strictly
internal, based on a simple reasoning about the coverage, not sure)

Plus there's an exception for operations already known finished, for
which exit status from the actual agent's execution is reproduced here
in words, and luckily, that's actually documented:

https://github.com/ClusterLabs/resource-agents/blob/v4.3.0/doc/dev-guides/ra-dev-guide.asc#return-codes

> CRM_alert_target_rc:
>   The expected numerical return code of the operation (resource
>   alerts only)  

This appears to be primarily bound to OCF codes referred just above.

* * *

Hopefully that's enough to get you started with your own exploration.
Initially, I'd also suggest attaching your own dump-all alert handler
to get the real hands-on with the data at your disposal that can be
leveraged in your true handler.

-- 
Jan (Poki)

pgpJ_ABsu74l5.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Question about associating with ClusterLabs wrt. a local community (Was: I have a question.)

2019-07-09 Thread Jan Pokorný

Hello Kim,

On 08/07/19 13:11 +0900, 김동현 wrote:
> I'm Donghyun Kim.
> 
> I work as a system engineer in Korea.
> 
> In the meantime, I was very interested in the cluster and want
> to promote it in Korea.
> There are many high-availability cases in Linux systems.
> 
> The reason why I am sending this mail is whether I can use the name
> "ClusterLabs Korea" in my Facebook community.

I am in no authority to decide this matter, and generally speaking,
the monicker in question we reached consensus to represent "us" as
a community (and be the label for an easy association with the SW
components of the cluster stack "we" evolve/deploy, also to avoid
further ambiguities or unnecessary verbosity when referring to "us")
is still rather an informal convention rather than something we
could claim exclusivity about.

Nonetheless, it's very kind of you that you seek approval for
label reference/association first.  It's good to be clear in
these collective entity matters, like for instance about the
license/reusability of ClusterLabs logo (originally devised by
Krig, license subsequently clarified on the developers aimed
sibling list[1]).

> It's only for the Facebook community.
> 
> Then I'll be looking forward to your reply.
> have a nice day

I am leaving this question open for more impactful community members,
but if I could have a single question, it would be about the reasons
for having a rather detached group.  It may be some cultural context
I may be missing -- does it have, for instance, something to do with
language barrier, with face to face opportunities to meet up locally,
or with heavy preference of Facebook over technically oriented forums
(like Reddit for general discussion, StackOverflow for Q, etc.; of
course, if blessed ClusterLabs communications channels[2] are not
a fit for whatever reason, otherwise they would ideally be the first
choice to avoid community fragmentation) where it would possibly be
more accessible for some people saving them also from imminent leakage
of personal data which is hardly (in a philosphical sense) justifiable
for the purpose of "having a technical discussion about particular
FOSS software"?

Just curious about this, looking forward to hearing from you.

P.S. speaking of community subdivisions, it's rather interesting to
me personally that also Japanese advanced their local group (actually
a long ago before I got to pacemaker), and as I just checked, it seems
still rather active: http://linux-ha.osdn.jp/wp/
(that's beside contributions to, say, Pacemaker coming in from people
of that country, and some stuff like https://github.com/linux-ha-japan
I never got to explore -- as mentioned, less fragmentation and we
could all be benefitting from getting to know the wider landscape
in a natural way, I believe).

[1] https://lists.clusterlabs.org/pipermail/developers/2019-April/002184.html
[2] https://clusterlabs.org/faq.html
+ https://wiki.clusterlabs.org/wiki/FAQ#Where_should_I_ask_questions.3F

-- 
Jan (Poki)

pgp1skYKDA_JZ.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Strange monitor return code log for LSB resource

2019-07-08 Thread Jan Pokorný

On 05/07/19 03:50 +, Harvey Shepherd wrote:
> I was able to resolve an issue which caused these logs to disappear.
> The problem was that the LSB script was named "logging" and the
> daemon that it controlled was also called "logging". The init script
> uses "start-stop-daemon --name" to start and stop the logging
> daemon, but "start-stop-daemon --stop --name logging" was causing
> both the init script and daemon to be terminated, and hence an error
> code would be returned for the stop operation.

Oh, that's nasty, indeed.  Incidentally, was just recently reminding
that /proc based traversals are rather betraying, for multiple
reasons, on developers counterpart list:
https://lists.clusterlabs.org/pipermail/developers/2019-July/002207.html

You've provided another example how relatively easy is for
people to shoot in their feet when process is specified
inprecisely/vaguely/in alias-prone ways.  It's something you
want to avoid where practically feasible.

> I have no idea why crm_resource logs were produced, or why the logs
> seemed to indicate that that it was treating an LSB resource as OCF
> (expected return code of 7 - not running), but the logs were no
> longer generated once I renamed the init script.

This deserves a little correction here: "not running" is not exclusive
to OCF class of resources -- it's defined also for LSB init scripts as
such, althought admittedly not (officially) for "status" operation
(e.g.
https://refspecs.linuxfoundation.org/LSB_3.1.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
) so the fact pacemaker tolerantly interprets it at all may be solely
so as to get along buggy (non-compliant) LSB init scripts, such as
the one that "crm_resource" was trying to monitor in your logs :-p

Anyway, good that you got rid of said phantom executions.
Should it recidivate, you can use the described trick to nail it down.

> I've not seen them since. There were definitely no external
> monitoring tools installed or in use (it's a very isolated system
> with very few resources installed), although I was using crm_monitor
> at the time. Does that run crm_resource under the hood?

AFAIK, there's no connection like that.

-- 
Jan (Poki)

pgpbx4jZosAsA.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Strange monitor return code log for LSB resource

2019-07-04 Thread Jan Pokorný

Harvey,

On 25/06/19 21:26 +, Harvey Shepherd wrote:
> Thanks for your reply Andrei. There is no external monitoring
> software running. The logs that I posted are from the pacemaker log
> with debug enabled.

that's exactly what's questionable here -- how can crm_resource
invocation be at all correlated to the standard operation of pacemaker
cluster if you claim you don't have a clue what else (such as external
monitoring built upon crm_resource or, in extension upon crm/pcs)
could be triggering it.  That's where Andrei was zeroing in.

Anyway, to spare you possibly heavyhanded inspection of circumstances
in your cluster, I'd suggest you install system and run this command
as root:

  stap /usr/share/systemtap/examples/process/pstrace_exec.stp /logging

where pstrace_exec.stp can also be obtained from
https://sourceware.org/systemtap/examples/process/pstrace_exec.stp 
if your installation (systemtap-client package in my case) doesn't
carry it.

During the time, you'll likely observe occurrences where "logging"
script is run directly by pacemaker-execd, plus some process chains
finishing with crm_resource and these shall suggest the cause of
why that is being run at all (where it likely should not, see also
the remark on re-entrancy below).

> The Linux hostname of the node is "ctr_qemu", but the Pacemaker name
> is "primary". It is actually the same node.
> 
> I've just run the same test again and interestingly this time the
> source of the monitor logs seems to be shared between "crm_resource"
> and "pacemaker-schedulerd"
> 
> Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (recurring_action_timer)  
>   debug: Scheduling another invocation of logging_status_1
> Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (operation_finished)  
>   debug: logging_status_1:1181 - exited with rc=0
> Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (operation_finished)  
>   debug: logging_status_1:1181:stdout [ Remote syslog service is running ]
> Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (log_finished)  
> debug: finished - rsc:logging action:monitor call_id:127 pid:1181 exit-code:0 
> exec-time:0ms queue-time:0ms
> Jun 25 21:18:35 ctr_qemu crm_resource[1268] (determine_op_status) 
>   debug: logging_monitor_1 on primary returned 'not running' (7) instead 
> of the expected value: 'ok' (0)
> Jun 25 21:18:35 ctr_qemu crm_resource[1268] (unpack_rsc_op_failure)   
>   warning: Processing failed monitor of logging on primary: not running | rc=7
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (determine_op_status) 
>   debug: logging_monitor_1 on primary returned 'not running' (7) instead 
> of the expected value: 'ok' (0)
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (unpack_rsc_op_failure)   
>   warning: Processing failed monitor of logging on primary: not running | rc=7
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (common_print)  info: 
> logging   (lsb:logging):  Started primary
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (common_apply_stickiness) 
>   debug: Resource logging: preferring current location (node=primary, 
> weight=1)
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (native_assign_node)  
>   debug: Assigning primary to logging
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (LogActions)info: 
> Leave   logging   (Started primary)

another thing, actually:

> 25.06.2019 16:53, Harvey Shepherd пишет:
>> Pacemaker is reporting failed resource actions, but fail-count is not 
>> incremented for the resource:
>> 
>> Migration Summary:
>> * Node primary:
>> * Node secondary:
>> 
>> Failed Resource Actions:
>> * logging_monitor_1 on primary 'not running' (7): call=119, 
>> status=complete, exitreason='',
>> last-rc-change='Tue Jun 25 13:13:12 2019', queued=0ms, exec=0ms

Gotta wonder this can only be caused with what _exclusively_
pacemaker-execd observed (but that could be a result of interference
with crm_resource initiated operation ... with --force-check parameter,
or what else?), sadly, the log didn't contain the referenced time
of "last-rc-change".

Finally, speaking of interferences, this topic has been brought up
recently, regarding re-entrancy in particular:

https://lists.clusterlabs.org/pipermail/users/2019-February/025454.html

Apparently, axioms of agent/resource execution that apply with pacemaker
alone won't hold when there are multiple actors as it seems to be your
case (pacemaker vs. isolated crm_resource tool based trigger).

Hope this helps

-- 
Jan (Poki)

pgpWwEAgXpABx.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Jan Pokorný

On 01/07/19 13:26 +0200, Ulrich Windl wrote:
>>>> Jan Pokorný  schrieb am 27.06.2019 um 12:02
>>>> in Nachricht <20190627100209.gf31...@redhat.com>:
>> On 25/06/19 12:20 ‑0500, Ken Gaillot wrote:
>>> On Tue, 2019‑06‑25 at 11:06 +, Somanath Jeeva wrote:
>>> Addressing the root cause, I'd first make sure corosync is running at
>>> real‑time priority (I forget the ps option, hopefully someone else can
>>> chime in).
>> 
>> In a standard Linux environment, I find this ultimately convenient:
>> 
>>   # chrt ‑p $(pidof corosync)
>>   pid 6789's current scheduling policy: SCHED_RR
>>   pid 6789's current scheduling priority: 99
> 
> To me this is like pushing a car that already has a running engine! If
> corosync does crazy things, this will make things worse (i.e. enhance
> crazyness).

Note that said invocation is read-only fetch of the process
characteristic.  So not sure about your parable, it shall
rather be compared to not paying full attention to driving
itself for a bit when checking the speedometer (the check
alone won't presumably run with such an ultimate scheduling
priority, so the interferences shall be fairly minimal, just
as with `ps` or whatever programmatic fetch of such data).

>> (requires util‑linux, procps‑ng)

-- 
Jan (Poki)


pgpZqBIWXgRnX.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Two node cluster goes into split brain scenario during CPU intensive tasks

2019-06-27 Thread Jan Pokorný

On 25/06/19 12:20 -0500, Ken Gaillot wrote:
> On Tue, 2019-06-25 at 11:06 +, Somanath Jeeva wrote:
> Addressing the root cause, I'd first make sure corosync is running at
> real-time priority (I forget the ps option, hopefully someone else can
> chime in).

In a standard Linux environment, I find this ultimately convenient:

  # chrt -p $(pidof corosync)
  pid 6789's current scheduling policy: SCHED_RR
  pid 6789's current scheduling priority: 99

(requires util-linux, procps-ng)

> Another possibility would be to raise the corosync token
> timeout to allow for a greater time before a split is declared.

This is the unavoidable trade-off between limiting false positives
(negligible glitches triggering the riot) vs. timely manner of
detecting the actual node/interconnect failures.  Just meant to
note it's not a one-way street, deliberation given the circumstances
needed.

-- 
Jan (Poki)

pgpzLoaxYZiqd.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fence agent definition under Centos7.6

2019-06-20 Thread Jan Pokorný

On 18/06/19 13:08 +, Michael Powell wrote:
> * The document does not specify where the agent should be installed.
>   (On /usr/sbin.)

This could give a precise answer:
https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.2/lib/fencing/st_rhcs.c#L33

> * The document does not mention that, if the agent requires root
> permissions (a not-unlikely scenario, I think), it needs the setuid
> bit set (e.g. chmod u+s /usr/sbin/fence-agent), or some similar
> mechanism, since crmd and stonith-ng apparently do not execute the
> fence agent with root privileges.

Executing daemons (pacemaker-fenced [formerly stonith-ng] or possibly
pacemaker-execd [formerly lrmd] as used for general resources) run
as root, so you can rely on having the root perms from the get go.

Do you observe the contrary?

> * The document does not specify support for "action=metadata". 
> I got this from
> https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc
> and by looking at the output of the fence-ipmilan agent.

This should be covered with the very last line in FenceAgentAPI.md:

  Output of fence_agent -o metadata should validate against RelaxNG
  schema available at lib/metadata.rng

Oyvind could provide a further guidance or corrections, but these
fence agents are not unified with resource agents in this regard
(yet, if that's still a final goal).

> Another issue I found is that I needed to use "pcs create stonith"
> rather than "crm configure primitive" to create the agent.  Our old
> code was based upon crmsh, and my on-line reading suggested that the
> choice of crmsh vs pcs was up to the user.  This is apparently not
> true.  For the most part, it looks as though transitioning our
> scripts from crmsh to pcs will be pretty straightforward, though.

Crm shell representatives can comment on this.
As far as pcs is concerned, it's rather natural for it to support
that family of fence-agents, since it's inherited from an older
cluster stack arrangement...

> Finally, I have two questions:
> * The FenceAgentsAPI.md doc describes the cluster.conf file (though,
> again, it does not indicate where it should be installed.)  AFAICT,
> this isn't really necessary for the fence agent to reboot another
> node.  Am I missing something?

...sometimes referred to per its legacy name Red Hat Cluster Suite
(that explains "rhcs" abbreviations in the source file name linked
above).  That stack used to use single point of configuration, and
it was -- you've guessed it -- cluster.conf file.  That it is still
referred from that API document is just a legacy thing, but the
important parts remain practically unchanged from those times.

> * In some cases, the fence agent receives "action=reboot" followed
> by "nodename=".  I.e. it's asked
> to reboot itself.  This doesn't make sense to me.  Again, what am I
> missing?

Might be the case of misconfiguration.  Showing your configuration
would provide a basis for further dissection.  Also, "node running the
agent" is ambiguous, it may mean the node running periodic "monitor"
operations, which doesn't have any direct bearings on node being
selected as a target to be fenced.

-- 
Jan (Poki)

pgpHGr5B3ViXo.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] CVE-2019-12779 assignment for libqb (Was: [Announce] libqb 1.0.4/1.0.5 release)

2019-06-10 Thread Jan Pokorný

On 15/04/19 14:56 +0100, Christine Caulfield wrote:
> We are pleased to announce the release of libqb 1.0.4
> 
> Source code is available at:
> https://github.com/ClusterLabs/libqb/releases/download/v1.0.4/libqb-1.0.4.tar.xz
> 
> Please use the signed .tar.gz or .tar.xz files with the version number
> in rather than the github-generated "Source Code" ones.
> 
> This is a security update to 1.0.3. Files are now opened with O_EXCL and
> are placed in directories created by mkdtemp().

For the record, this was (finally, after some initial hesitation about
the process in a situation like this) assigned CVE-2019-12779.

The summary at MITRE site makes a strict cut in proposing only v1.0.5
as not vulnerable, which is not quite the interpretation proposed, but
hopefully my other response in this thread makes it clear that v1.0.4
is not the right "peace of mind" target for other reasons.

-- 
Jan (Poki)

pgpfW4EvV9mB2.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] VirtualDomain and Resource_is_Too_Active ?? - problem/error

2019-06-04 Thread Jan Pokorný

On 03/06/19 13:39 +0200, Jan Pokorný wrote:
> Yes, there are at least two issues in ocf:heartbeat:VirtualDomain:
> 
> 1/ dealing with user input derived value, in an unchecked manner, while
>such value can be an empty string or may contain spaces
>(for the latter, see also I've raised back then:
>https://lists.clusterlabs.org/pipermail/users/2015-May/007629.html 
>https://lists.clusterlabs.org/pipermail/developers/2015-May/000620.html
>)
> 
> 2/ agent doesn't try to figure out whether is tries to parse
>a reasonably familiar file, in this case, it means it can
>be grep'ing file spanning up to terabytes of data
> 
> In your case, you mistakenly pointed the agent (via "config" parameter
> as highlighted above) not to the expected configuration, but rather to
> the disk image itself -- that's not how talk to libvirt -- only such
> a guest configuration XML shall point to where the disk image itself
> is located instead.  See ocf_heartbeat_VirtualDomain(7) or the output
> you get when invoking the agent with "meta-data" argument.
> 
> Such configuration issue could be indicated reliably with "validate-all"
> passed as an action for configured set of agent parameters would
> 1/ and/or 2/ not exist in the agent's implementation.
> Please, file issues against VirtualDomain agent to that effect at
> https://github.com/ClusterLabs/fence-agents/issues

sorry, apparently resource-agents, hence

https://github.com/ClusterLabs/resource-agents/issues

(I don't think GitHub allows after-the-fact issue retargeting, sadly)

-- 
Jan (Poki)


pgpnoFDROu6cE.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] VirtualDomain and Resource_is_Too_Active ?? - problem/error

2019-06-03 Thread Jan Pokorný

On 29/05/19 09:29 -0500, Ken Gaillot wrote:
> On Wed, 2019-05-29 at 11:42 +0100, lejeczek wrote:
>> I doing something which I believe is fairly simple, namely:
>> 
>> $ pcs resource create HA-work9-win10-kvm VirtualDomain \
>>   hypervisor="qemu:///system" \
>>   config="/0-ALL.SYSDATA/QEMU_VMs/HA-work9-win10.qcow2" \
 
>>   migration_transport=ssh --disable
>> 
>> virt guest is good, runs in libvirth okey, yet pacemaker fails:
>> 
>> ...
>>   notice: Calculated transition 1864, saving inputs in 
>> /var/lib/pacemaker/pengine/pe-input-2022.bz2
>>   notice: Configuration ERRORs found during PE processing.  Please run 
>> "crm_verify -L" to identify issues.
>>   notice: Initiating monitor operation HA-work9-win10-kvm_monitor_0 locally 
>> on whale.private
>>   notice: Initiating monitor operation HA-work9-win10-kvm_monitor_0 on 
>> swir.private
>>   notice: Initiating monitor operation HA-work9-win10-kvm_monitor_0 on 
>> rider.private
>>  warning: HA-work9-win10-kvm_monitor_0 process (PID 2103512) timed out
>>  warning: HA-work9-win10-kvm_monitor_0:2103512 - timed out after 3ms
>>   notice: HA-work9-win10-kvm_monitor_0:2103512:stderr [ 
>> /usr/lib/ocf/resource.d/heartbeat/VirtualDomain: line 981: [: too
>> many arguments ]
> 
> This looks like a bug in the resource agent, probably due to some
> unexpected configuration value. Double-check your resource
> configuration for what values the various parameters can have. (Or it
> may just be a side effect of the interval issue above, so try fixing
> that first.)

Yes, there are at least two issues in ocf:heartbeat:VirtualDomain:

1/ dealing with user input derived value, in an unchecked manner, while
   such value can be an empty string or may contain spaces
   (for the latter, see also I've raised back then:
   https://lists.clusterlabs.org/pipermail/users/2015-May/007629.html 
   https://lists.clusterlabs.org/pipermail/developers/2015-May/000620.html
   )

2/ agent doesn't try to figure out whether is tries to parse
   a reasonably familiar file, in this case, it means it can
   be grep'ing file spanning up to terabytes of data

In your case, you mistakenly pointed the agent (via "config" parameter
as highlighted above) not to the expected configuration, but rather to
the disk image itself -- that's not how talk to libvirt -- only such
a guest configuration XML shall point to where the disk image itself
is located instead.  See ocf_heartbeat_VirtualDomain(7) or the output
you get when invoking the agent with "meta-data" argument.

Such configuration issue could be indicated reliably with "validate-all"
passed as an action for configured set of agent parameters would
1/ and/or 2/ not exist in the agent's implementation.
Please, file issues against VirtualDomain agent to that effect at
https://github.com/ClusterLabs/fence-agents/issues

In addition, there may be some other configuration discrepancies as
pointed out by Ken.  Let us know if any issues persist once all these
are resolved.

-- 
Jan (Poki)


pgpE18eLaF8PX.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Inconclusive recap for bonding (balance-rr) vs. HA (Was: why is node fenced ?)

2019-05-30 Thread Jan Pokorný

On 20/05/19 14:35 +0200, Jan Pokorný wrote:
> On 20/05/19 08:28 +0200, Ulrich Windl wrote:
>>> One network interface is gone for a short period. But it's in a
>>> bonding device (round-robin), so the connection shouldn't be lost.
>>> Both nodes are connected directly, there is no switch in between.
>> 
>> I think you misunderstood: a round-robin bonding device is not
>> fault-safe IMHO, but it depends a lot on your cabling details. Also
>> you did not show the logs on the other nodes.
> 
> That was sort of my point.  I think that in this case, the
> fault tolerance together with TCP's "best effort" makes the case
> effectively fault-recoverable (except for some pathological scenarios
> perhaps) -- so the whole "value proposition" for that mode can
> make false impressions even for when it's not the case ... like
> with corosync (since it intentionally opts for unreliable
> transport for performance/scalability).
> 
> (saying that as someone who has just about a single hands-on
> experiment with bonding behind...)

Ok, I've tried to read up on the subject a bit -- still no more
hands-on so feel free to correct or amend my conclusions below.
This discusses a Linux setup.

First of all, I think that claiming further unspecified balance-rr
bonding mode to be SPOF solving solution is a myth -- and relying on
that unconditionally is a way to fail hard.

It needs in-depth considerations, I think:

1. some bonding modes (incl. mentioned balance-rr and active-backup)
   vitally depend on ability to actually detect a link failure

2. configuration of the bonding therefore needs to specify what's the
   most optimal way to detect such failures for given selection of
   network adapters (it seems the configuration for a resulting bond
   instance is shares across the enslaved devices -- it would then
   mean that preferably the same models shall be used, since this
   optimality is then shared inherently), that is, either

   - miimon, or
   - arp_interval and arp_ip_target

   parameters need to be specified to the kernel bonding module

   when this is not done, presumed non-SPOF interconnect still
   remains SPOF (!)

3. then, it is being discussed that there's hardly a notion of
   real-time detection of the link failure -- since all such
   indications are basically being polled for, and moreover,
   drivers for particular adapters can add up to the propagation
   delay, meaning the detection happens in order of hundreds+
   milliseconds after the fact -- which makes me think that
   such faulty link behaves essentially as a blackhole for such
   a period of time

4. finally, to get back to what I've meant with diametral differences
   between casual TCP vs. corosync (considering UDP unicast) traffic
   and which may play a role here as well, is that mentioned TCP's
   "best effort" with built-in confirmations will not normally give
   up in order of tens of seconds and more (please correct me),
   but in corosync case, with default "token" parameter of 1 second,
   multiplied with retransmit attempts (4 by default, see
   token_retransmits_before_loss_cons), we operate within the order
   of lower seconds (again, please correct me)

   therefore, specifying the above parameters for bonding in
   an excessive manner compard to corosync configuration (like
   miimon=1) could under further circumstances mean the
   same SPOF effect, e.g. when

   - packets_per_slave parameter specified in a way it will
 contain all those possibly repeated attempts in the
 corosync exchange (selected link may be, out of bad luck,
 be the faulty one while the failure hasn't been detected
 yet)

   - (unsure if can happen) when logical messages corosync
 exchanges doesn't fit a single UDP datagram (low MTU?), and
 packets_per_slave is 1 (default), complete message is
 never successfully transmitted, since its part will always
 be carried over the faulty link (again, while its failure
 hasn't been detected yet), IIUIC

Looks quite tricky, overall.  Myself, I'd likely opt to live with
admitted SPOF than with something that's possibly haisen-SPOF
(your mileage may vary, don't take it as FUD, just a reminder
that discretion is always needed in HA world).

I didn't investigate more, but if there are any bonding modes with
actual redundancy, it could be a more reliable SPOF-in-interconnect
avoidance (but then, you also need more switches...).  I don't know
enough to comment on RRP (allegedly discouraged[1]) or kronosnet
for such a task.

Would be glad if someone more knowledgable could chime in to share
the insights and show where the limits to bonding are, especially
in HA settings.

Some references:

https://github.com/torvalds/linux/blob/master/Documentation/networking/bonding.txt
https://wiki.linuxfoundation.org/networking/bonding
https://wiki.debian.org/Bonding

[ClusterLabs] Pacemaker detecting existing processes question (Was: indirectly related - pacemaker service)

2019-05-30 Thread Jan Pokorný

[forwarding to respective upstream list, this has little to do with
systemd, I suggest following up only there, detaching from systemd ML]

On 29/05/19 17:23 +0100, lejeczek wrote:
> something I was hoping one expert could shed bit more light onto - I
> have a pacemaker cluster composed of three nodes. One one always has a
> problem with pacemaker - it's tools would say thing like:
> 
> $ crm_mon --one-shot
> Connection to cluster failed: Transport endpoint is not connected
> $ pcs status --all
> Error: cluster is not currently running on this node
> 
> but systemd reports relevant demons as up and running with on tiny
> exceptions! On "working" nodes it's:
> 
> $ systemctl status -l pacemaker
> ● pacemaker.service - Pacemaker High Availability Cluster Manager
>    Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; disabled;
> vendor preset: disabled)
>    Active: active (running) since Fri 2019-05-10 15:39:40 BST; 2 weeks 5
> days ago
>  Docs: man:pacemakerd
>   
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html
> 
>  Main PID: 28664 (pacemakerd)
>    CGroup: /system.slice/pacemaker.service
>    ├─  28664 /usr/sbin/pacemakerd -f
>    ├─  28670 /usr/libexec/pacemaker/cib
>    ├─  28671 /usr/libexec/pacemaker/stonithd
>    ├─  28672 /usr/libexec/pacemaker/lrmd
>    ├─  28673 /usr/libexec/pacemaker/attrd
>    ├─  28674 /usr/libexec/pacemaker/pengine
>    ├─  28676 /usr/libexec/pacemaker/crmd
>    ├─1503698 /bin/sh /usr/lib/ocf/resource.d/heartbeat/LVM monitor
>    ├─1503717 /bin/sh /usr/lib/ocf/resource.d/heartbeat/LVM monitor
>    ├─1503718 vgs -o tags --noheadings equalLogic-2.2
>    └─1503719 tr -d  
> 
> but on that one single failing node:
> 
> $ systemctl status -l pacemaker.service 
> ● pacemaker.service - Pacemaker High Availability Cluster Manager
>    Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled;
> vendor preset: disabled)
>    Active: active (running) since Wed 2019-05-29 17:08:40 BST; 2min 19s ago
>  Docs: man:pacemakerd
>   
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html
> 
>  Main PID: 48729 (pacemakerd)
>     Tasks: 1
>    Memory: 3.3M
>    CGroup: /system.slice/pacemaker.service
>    └─48729 /usr/sbin/pacemakerd -f
>  
> May 29 17:08:41 rider.private pacemakerd[48729]:   notice: Tracking
> existing cib process (pid=39234)
> May 29 17:08:41 rider.private pacemakerd[48729]:   notice: Tracking
> existing stonithd process (pid=39235)
> May 29 17:08:41 rider.private pacemakerd[48729]:   notice: Tracking
> existing lrmd process (pid=39236)
> May 29 17:08:41 rider.private pacemakerd[48729]:   notice: Tracking
> existing attrd process (pid=39238)
> May 29 17:08:41 rider.private pacemakerd[48729]:   notice: Tracking
> existing pengine process (pid=39240)
> May 29 17:08:41 rider.private pacemakerd[48729]:   notice: Tracking
> existing crmd process (pid=39241)
> May 29 17:08:41 rider.private pacemakerd[48729]:   notice: Quorum acquired
> 
> You can clearly see the difference, right? Systems are virtually
> identical, same Dell's server model, same Centos 7.6 and packages from
> same default repos.
> 
> Does that difference between systemds status for pacemaker signify anything?

As those notices say, you attempted to start pacemaker service in an
unexpected situation all the subdaemons were already running for some
hard to guess reason (manually invoked pacemakerd crashing, leaving these
orphans behind could be one such reason; I presume you haven't run
pacemaker daemons on your own).  Another thing, you mention baremetal
deployment, but the non-well-isolated containers with such processes
running within could cause a similar problem.

Do you use CentOS provided packages or did you build from source?

In any case, let's continue at ClusterLabs ML, systemd folks cannot
help here, systemctl status just summarizes what can be derived
from the log messages (that also happen to be amongst the more recent
logged stuff as of that time, hence listed as well above) already.

-- 
Jan (Poki)


pgpOKp7Uu1v51.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] info: mcp_cpg_deliver: Ignoring process list sent by peer for local node

2019-05-30 Thread Jan Pokorný

On 30/05/19 11:01 +0100, lejeczek wrote:
> On 29/05/2019 21:04, Ken Gaillot wrote:
>> On Wed, 2019-05-29 at 17:28 +0100, lejeczek wrote:
>>> and:
>>> $ systemctl status -l pacemaker.service 
>>> ● pacemaker.service - Pacemaker High Availability Cluster Manager
>>>Loaded: loaded (/usr/lib/systemd/system/pacemaker.service;
>>> disabled; vendor preset: disabled)
>>>Active: active (running) since Wed 2019-05-29 17:21:45 BST; 7s ago
>>>  Docs: man:pacemakerd
>>>
>>> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html
>>>  Main PID: 51617 (pacemakerd)
>>> Tasks: 1
>>>Memory: 3.3M
>>>CGroup: /system.slice/pacemaker.service
>>>└─51617 /usr/sbin/pacemakerd -f
>>> 
>>> May 29 17:21:45 rider.private pacemakerd[51617]:   notice: Tracking 
>>> existing pengine process (pid=51528)
>>> May 29 17:21:45 rider.private pacemakerd[51617]:   notice: Tracking 
>>> existing lrmd process (pid=51542)
>>> May 29 17:21:45 rider.private pacemakerd[51617]:   notice: Tracking 
>>> existing stonithd process (pid=51558)
>>> May 29 17:21:45 rider.private pacemakerd[51617]:   notice: Tracking 
>>> existing attrd process (pid=51559)
>>> May 29 17:21:45 rider.private pacemakerd[51617]:   notice: Tracking 
>>> existing cib process (pid=51560)
>>> May 29 17:21:45 rider.private pacemakerd[51617]:   notice: Tracking 
>>> existing crmd process (pid=51566)
>>> May 29 17:21:45 rider.private pacemakerd[51617]:   notice: Quorum acquired
>>> May 29 17:21:45 rider.private pacemakerd[51617]:   notice: Node 
>>> whale.private state is now member
>>> May 29 17:21:45 rider.private pacemakerd[51617]:   notice: Node 
>>> swir.private state is now member
>>> May 29 17:21:45 rider.private pacemakerd[51617]:   notice: Node 
>>> rider.private state is now member

I grok that you've, in parallel, started asking about this part also
on the systemd ML, and I redirected that thread here (but my message
still didn't hit here for being stuck in the moderation queue, since
I use different addresses on these two lists -- you can still respond
right away to that as readily available via said systemd list, just
make sure you only target users@cl.o, it was really unrelated to
systemd).

In a nutshell, we want to know how you get into such situation that
entirely detached subdaemons would be flowing in your environment,
prior to starting pacemaker.service (or after stopping it).
That's rather unexpected.
If you can dig up traces of any pacemaker associated processes
(search pattern: pacemaker*|attrd|cib|crmd|lrmd|stonithd|pengine)
dying (+ the messages logged immediately before that if at all),
it could help up diagnose your situation.

-- 
Jan (Poki)

pgp1OMjyyHDGq.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: why is node fenced ?

2019-05-20 Thread Jan Pokorný

On 20/05/19 08:28 +0200, Ulrich Windl wrote:
>> One network interface is gone for a short period. But it's in a
>> bonding device (round-robin), so the connection shouldn't be lost.
>> Both nodes are connected directly, there is no switch in between.
> 
> I think you misunderstood: a round-robin bonding device is not
> fault-safe IMHO, but it depends a lot on your cabling details. Also
> you did not show the logs on the other nodes.

That was sort of my point.  I think that in this case, the
fault tolerance together with TCP's "best effort" makes the case
effectively fault-recoverable (except for some pathological scenarios
perhaps) -- so the whole "value proposition" for that mode can
make false impressions even for when it's not the case ... like
with corosync (since it intentionally opts for unreliable
transport for performance/scalability).

(saying that as someone who has just about a single hands-on
experiment with bonding behind...)

-- 
Jan (Poki)

pgpHjgc9VtFaL.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] why is node fenced ?

2019-05-17 Thread Jan Pokorný

On 16/05/19 17:10 +0200, Lentes, Bernd wrote:
> my HA-Cluster with two nodes fenced one on 14th of may.
> ha-idg-1 has been the DC, ha-idg-2 was fenced.
> It happened around 11:30 am.
> The log from the fenced one isn't really informative:
> 
> [...]
> 
> Node restarts at 11:44 am.
> The DC is more informative:
> 
> =
> 2019-05-14T11:24:05.105739+02:00 ha-idg-1 PackageKit: daemon quit
> 2019-05-14T11:24:05.106284+02:00 ha-idg-1 packagekitd[11617]: 
> (packagekitd:11617): GLib-CRITICAL **: Source ID 15 was not found when 
> attempting to remove it
> 2019-05-14T11:27:23.276813+02:00 ha-idg-1 liblogging-stdlog: -- MARK --
> 2019-05-14T11:30:01.248803+02:00 ha-idg-1 cron[24140]: 
> pam_unix(crond:session): session opened for user root by (uid=0)
> 2019-05-14T11:30:01.253150+02:00 ha-idg-1 systemd[1]: Started Session 17988 
> of user root.
> 2019-05-14T11:30:01.301674+02:00 ha-idg-1 CRON[24140]: 
> pam_unix(crond:session): session closed for user root
> 2019-05-14T11:30:03.710784+02:00 ha-idg-1 kernel: [1015426.947016] tg3 
> :02:00.3 eth3: Link is down
> 2019-05-14T11:30:03.792500+02:00 ha-idg-1 kernel: [1015427.024779] bond1: 
> link status definitely down for interface eth3, disabling it
> 2019-05-14T11:30:04.849892+02:00 ha-idg-1 hp-ams[2559]: CRITICAL: Network 
> Adapter Link Down (Slot 0, Port 4)
> 2019-05-14T11:30:05.261968+02:00 ha-idg-1 kernel: [1015428.498127] tg3 
> :02:00.3 eth3: Link is up at 100 Mbps, full duplex
> 2019-05-14T11:30:05.261985+02:00 ha-idg-1 kernel: [1015428.498138] tg3 
> :02:00.3 eth3: Flow control is on for TX and on for RX
> 2019-05-14T11:30:05.261986+02:00 ha-idg-1 kernel: [1015428.498143] tg3 
> :02:00.3 eth3: EEE is disabled
> 2019-05-14T11:30:05.352500+02:00 ha-idg-1 kernel: [1015428.584725] bond1: 
> link status definitely up for interface eth3, 100 Mbps full duplex
> 2019-05-14T11:30:05.983387+02:00 ha-idg-1 hp-ams[2559]: NOTICE: Network 
> Adapter Link Down (Slot 0, Port 4) has been repaired
> 2019-05-14T11:30:10.520149+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A 
> processor failed, forming new configuration.
> 2019-05-14T11:30:16.524341+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A new 
> membership (192.168.100.10:1120) was formed. Members left: 1084777492
> 2019-05-14T11:30:16.524799+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] Failed 
> to receive the leave message. failed: 1084777492
> 2019-05-14T11:30:16.525199+02:00 ha-idg-1 lvm[12430]: confchg callback. 0 
> joined, 1 left, 1 members
> 2019-05-14T11:30:16.525706+02:00 ha-idg-1 attrd[6967]:   notice: Node 
> ha-idg-2 state is now lost
> 2019-05-14T11:30:16.526143+02:00 ha-idg-1 cib[6964]:   notice: Node ha-idg-2 
> state is now lost
> 2019-05-14T11:30:16.526480+02:00 ha-idg-1 attrd[6967]:   notice: Removing all 
> ha-idg-2 attributes for peer loss
> 2019-05-14T11:30:16.526742+02:00 ha-idg-1 cib[6964]:   notice: Purged 1 peer 
> with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.527283+02:00 ha-idg-1 stonith-ng[6965]:   notice: Node 
> ha-idg-2 state is now lost
> 2019-05-14T11:30:16.527618+02:00 ha-idg-1 attrd[6967]:   notice: Purged 1 
> peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.527884+02:00 ha-idg-1 stonith-ng[6965]:   notice: Purged 
> 1 peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.528156+02:00 ha-idg-1 corosync[6957]:   [QUORUM] 
> Members[1]: 1084777482
> 2019-05-14T11:30:16.528435+02:00 ha-idg-1 corosync[6957]:   [MAIN  ] 
> Completed service synchronization, ready to provide service.
> 2019-05-14T11:30:16.548481+02:00 ha-idg-1 kernel: [1015439.782587] dlm: 
> closing connection to node 1084777492
> 2019-05-14T11:30:16.555995+02:00 ha-idg-1 dlm_controld[12279]: 1015492 fence 
> request 1084777492 pid 24568 nodedown time 1557826216 fence_all dlm_stonith
> 2019-05-14T11:30:16.626285+02:00 ha-idg-1 crmd[6969]:  warning: 
> Stonith/shutdown of node ha-idg-2 was not expected
> 2019-05-14T11:30:16.626534+02:00 ha-idg-1 dlm_stonith: stonith_api_time: 
> Found 1 entries for 1084777492/(null): 0 in progress, 1 completed
> 2019-05-14T11:30:16.626731+02:00 ha-idg-1 dlm_stonith: stonith_api_time: Node 
> 1084777492/(null) last kicked at: 1556884018
> 2019-05-14T11:30:16.626875+02:00 ha-idg-1 stonith-ng[6965]:   notice: Client 
> stonith-api.24568.6a9fa406 wants to fence (reboot) '1084777492' with device 
> '(any)'
> 2019-05-14T11:30:16.627026+02:00 ha-idg-1 crmd[6969]:   notice: State 
> transition S_IDLE -> S_POLICY_ENGINE
> 2019-05-14T11:30:16.627165+02:00 ha-idg-1 crmd[6969]:   notice: Node ha-idg-2 
> state is now lost
> 2019-05-14T11:30:16.627302+02:00 ha-idg-1 crmd[6969]:  warning: 
> Stonith/shutdown of node ha-idg-2 was not expected
> 2019-05-14T11:30:16.627439+02:00 ha-idg-1 stonith-ng[6965]:   notice: 
> Requesting peer fencing (reboot) of ha-idg-2
> 2019-05-14T11:30:16.627578+02:00 ha-idg-1 pacemakerd[6963]:   notice: Node 
>

[ClusterLabs] Recurring troubles with the weak grip on processes (Was: Timeout stopping corosync-qdevice service)

2019-05-02 Thread Jan Pokorný

On 30/04/19 20:39 +0300, Andrei Borzenkov wrote:
> 30.04.2019 9:51, Jan Friesse пишет:
>> 
>>> Now, corosync-qdevice gets SIGTERM as "signal to terminate", but it
>>> installs SIGTERM handler that does not exit and only closes some socket.
>>> May be this should trigger termination of main loop, but somehow it does
>>> not.
>> 
>> Yep, this is exactly how qdevice daemon shutdown works. Signal just
>> closes socket (should be signal safe) and poll in main loop do its job
>> so main loop is terminated.
>> 
> 
> That is bug in corosync 2.4.4 which is still used in TW. stop is using
> pidof, I have two corosync-qdevice processes so corosync-qdevice never
> gets signal in the first place.
> 
> 
> ++ pidof corosync-qdevice
> + kill -TERM '1812 1811'

Needless to remind that half of the cluster stack, especially the
agents, still make decisions based on overly naive assumptions based
on unreliable grip on processes (and singletons thereof, which may not
apply, as demonstrated above) per their name/PID, something that may
clash even totally accidentally (typical default of process namespace
serving just 2^15 slots leading to possibly quick wraparounds; someone
invoking the pacemaker daemon just so as to on-off fetch the metadata
provided in this way), with containers make the situation just worse
from the host perspective[1].

Luckily, we've fixed some of these troublemakers in pacemaker with
the recent security updates, and there are some interesting synergies
possible in the outlook, see "pidfd" from the newest developments
in Linux.

[1] e.g.
https://lists.clusterlabs.org/pipermail/developers/2017-July/001875.html

-- 
Jan (Poki)

pgpKnvWsl9gDO.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Multiple processes appending to the same log file questions (Was: Pacemaker detail log directory permissions)

2019-04-30 Thread Jan Pokorný

[let's move this to developers@cl.o, please drop users on response
unless you are only subscribed there, I tend to only respond to the
lists]

On 30/04/19 13:55 +0200, Jan Pokorný wrote:
> On 30/04/19 07:55 +0200, Ulrich Windl wrote:
>>>>> Jan Pokorný  schrieb am 29.04.2019 um 17:22
>>>>> in Nachricht <20190429152200.ga19...@redhat.com>:
>>> On 29/04/19 14:58 +0200, Jan Pokorný wrote:
>>>> On 29/04/19 08:20 +0200, Ulrich Windl wrote:
>> I agree that multiple threads in one thread have no problem using
>> printf(), but (at least in the buffered case) if multiple processes
>> write to the same file, that type of locking doesn't help much IMHO.
> 
> Oops, you are right, I made a logical shortcut connecting flockfile(3)
> and flock(1), which was entirely unbacked.  You are correct it would
> matter only amongst the threads, not otherwise unsynchronized processes.
> Sorry about the noise :-/
> 
> Shamefully, this rather important (nobody wants garbled log messages)
> aspect is in no way documented in libqb's context (it does not do any
> explicit locking on its own), especially since the length of the logged
> messages can go above the default of 512 B (in the upcoming libqb 2,
> IIUIC) ... and luckily, I was steering the direction to still stay
> modest and cap that on 4 kiB, even if for other reasons:
> 
> https://github.com/ClusterLabs/libqb/pull/292#issuecomment-361745575
> 
> which still might be within Linux + typical FSs (ext4) boundaries
> to guarantee atomicity of an append (or maybe not even that, it all
> seems a gray area of any guarantees provided by the underlying system,
> inputs from the experts welcome).  Anyway, ISTM that we should at the
> very least increase the buffer size for block buffering to the
> configured one (up to 4 kiB as mentioned) if BUFSIZ would be less,
> to prevent tainting this expected atomicity from the get-go.

This post seems to indicate that while equivalent of BUFSIZ is 8 kiB
with glibc (confirmed on Fedora/x86-64), it might possibly be 1 kiB
only on BSDs (unless the underlying FS provides a hint otherwise?),
so the opt-in maxed out message (of 4 kiB currently, but generic
guards are in order in case anybody decides to bump that even further)
might readily cause log corruption on some systems with multiple
processes appending to the same file:

https://github.com/the-tcpdump-group/libpcap/issues/792

Any especially BSD people to advise here about what "atomic append"
on their system means, which conditions need to be assurably met?

-- 
Jan (Poki)


pgpldnZTbHDVX.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker detail log directory permissions

2019-04-30 Thread Jan Pokorný

On 30/04/19 07:55 +0200, Ulrich Windl wrote:
>>>> Jan Pokorný  schrieb am 29.04.2019 um 17:22
>>>> in Nachricht <20190429152200.ga19...@redhat.com>:
>> On 29/04/19 14:58 +0200, Jan Pokorný wrote:
>>> On 29/04/19 08:20 +0200, Ulrich Windl wrote:
>>>>>>> Jan Pokorný  schrieb am 25.04.2019 um 18:49
>>>>>>> in Nachricht <20190425164946.gf23...@redhat.com>:
>>>>> I think the prime and foremost use case is that half of the actual
>>>>> pacemaker daemons run as hacluster:haclient themselves, and it's
>>>>> preferred for them to be not completely muted about what they do,
>>>>> correct? :‑)
>>>> 
>>>> I interesting aspect: So the commands write directly to one file
>>>> concurrently?  How's synchronization done. It's especially
>>>> interesting with multiple threads.
>>> 
>>> You don't believe your libc (fprintf in particular) to abide the
>>> POSIX constraints, do you? :-)
>>> 
>>>>   All functions that reference (FILE *) objects, except those with
>>>>   names ending in _unlocked, shall behave as if they use flockfile()
>>>>   and funlockfile() internally to obtain ownership of these (FILE *)
>>>>   objects.
>>> 
>>> 
> [https://pubs.opengroup.org/onlinepubs/9699919799/functions/flockfile.html]
>>> 
>>>> Wouldn't it be much easier to use some socket-bases syslog like
>>>> service that serializes all the messages to one file. Then the
>>>> clients just need write access to the socket, not to the files.
> 
> I agree that multiple threads in one thread have no problem using
> printf(), but (at least in the buffered case) if multiple processes
> write to the same file, that type of locking doesn't help much IMHO.

Oops, you are right, I made a logical shortcut connecting flockfile(3)
and flock(1), which was entirely unbacked.  You are correct it would
matter only amongst the threads, not otherwise unsynchronized processes.
Sorry about the noise :-/

Shamefully, this rather important (nobody wants garbled log messages)
aspect is in no way documented in libqb's context (it does not do any
explicit locking on its own), especially since the length of the logged
messages can go above the default of 512 B (in the upcoming libqb 2,
IIUIC) ... and luckily, I was steering the direction to still stay
modest and cap that on 4 kiB, even if for other reasons:

https://github.com/ClusterLabs/libqb/pull/292#issuecomment-361745575

which still might be within Linux + typical FSs (ext4) boundaries
to guarantee atomicity of an append (or maybe not even that, it all
seems a gray area of any guarantees provided by the underlying system,
inputs from the experts welcome).  Anyway, ISTM that we should at the
very least increase the buffer size for block buffering to the
configured one (up to 4 kiB as mentioned) if BUFSIZ would be less,
to prevent tainting this expected atomicity from the get-go.

Anyway, correctness of the message durability is still dependent
on the syslog implementation to behave correctly if syslog logging
sink is enabled (holds for both corosync and pacemaker daemons),
so file-based logging can still be considered optional, and e.g.
for pacemaker, it should be possible to turn it off altogether in
the future as mentioned (already possible with corosync, IIRC).

Lowest level I/O a very complex and difficult area to reason about,
sadly, we didn't do our homework at the very beginning, so problems
are likely to snowball (incl. brain farts like the one I did above
... it might have been avoided if the contraints of libqb logging
usage were documented properly).

> (In clasical syslog there's _one_ process that receives the messages
> and writes them to files. I can only remember HP-UX kernel messages
> that were synchronized byte-wise, so if multiple cores emitted the
> same message concurrently you go some funny messages like
> "OpOpeeniingng"... ;_)

Truly nasty to not have even somewhat atomic blocks (see block
based buffering, which is a default for fprintf et al. unless the
output stream equals stdout) that gets interleaved.

>> FWIW, have just realized that by virtue of this, we have actually
>> silently alleviated a hidden lock contention (against the realtime
>> corosync process! but allegedly, it is not too verbose if nothing
>> interesting happens unless set to be more verbose) since
>> Pacemaker-2.0.0:l
>> 
>> [...]
>> 

Already disproved, thanks to Ulrich, sorry.

>> But we can do better for those who would prefer to serialize
>> multiple (per daemon) file outputs by hand (if ever at all, complete
>> serialization can be found in the syslog sink, perhaps) as b

Re: [ClusterLabs] Pacemaker detail log directory permissions

2019-04-29 Thread Jan Pokorný

On 29/04/19 14:58 +0200, Jan Pokorný wrote:
> On 29/04/19 08:20 +0200, Ulrich Windl wrote:
>>>>> Jan Pokorný  schrieb am 25.04.2019 um 18:49
>>>>> in Nachricht <20190425164946.gf23...@redhat.com>:
>>> I think the prime and foremost use case is that half of the actual
>>> pacemaker daemons run as hacluster:haclient themselves, and it's
>>> preferred for them to be not completely muted about what they do,
>>> correct? :‑)
>> 
>> I interesting aspect: So the commands write directly to one file
>> concurrently?  How's synchronization done. It's especially
>> interesting with multiple threads.
> 
> You don't believe your libc (fprintf in particular) to abide the
> POSIX constraints, do you? :-)
> 
>>   All functions that reference (FILE *) objects, except those with
>>   names ending in _unlocked, shall behave as if they use flockfile()
>>   and funlockfile() internally to obtain ownership of these (FILE *)
>>   objects.
> 
> [https://pubs.opengroup.org/onlinepubs/9699919799/functions/flockfile.html]
> 
>> Wouldn't it be much easier to use some socket-bases syslog like
>> service that serializes all the messages to one file. Then the
>> clients just need write access to the socket, not to the files.

FWIW, have just realized that by virtue of this, we have actually
silently alleviated a hidden lock contention (against the realtime
corosync process! but allegedly, it is not too verbose if nothing
interesting happens unless set to be more verbose) since
Pacemaker-2.0.0:

https://github.com/ClusterLabs/pacemaker/commit/b8075c86d35f3d37b0cbac86a8c90f1ac1091c33

Great!

But we can do better for those who would prefer to serialize
multiple (per daemon) file outputs by hand (if ever at all, complete
serialization can be found in the syslog sink, perhaps) as briefly
touched on earlier in this thread[1].

[1] https://lists.clusterlabs.org/pipermail/users/2019-April/025709.html

-- 
Jan (Poki)


pgpz1GguYMzqk.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker detail log directory permissions

2019-04-29 Thread Jan Pokorný

On 29/04/19 08:20 +0200, Ulrich Windl wrote:
>>>> Jan Pokorný  schrieb am 25.04.2019 um 18:49
>>>> in Nachricht <20190425164946.gf23...@redhat.com>:
>> On 24/04/19 09:32 ‑0500, Ken Gaillot wrote:
>>> On Wed, 2019‑04‑24 at 16:08 +0200, wf...@niif.hu wrote:
>>>> Make install creates /var/log/pacemaker with mode 0770, owned by
>>>> hacluster:haclient.  However, if I create the directory as root:root
>>>> instead, pacemaker.log appears as hacluster:haclient all the
>>>> same.  What breaks in this setup besides log rotation (which can be
>>>> fixed by removing the su directive)?  Why is it a good idea to let
>>>> the haclient group write the logs?
>>> 
>>> Cluster administrators are added to the haclient group. It's a
>>> minor use case, but the group write permission allows such users
>>> to run commands that log to the detail log. An example would be
>>> running "crm_resource ‑‑force‑start" for a resource agent that
>>> writes debug information to the log.
>> 
>> I think the prime and foremost use case is that half of the actual
>> pacemaker daemons run as hacluster:haclient themselves, and it's
>> preferred for them to be not completely muted about what they do,
>> correct? :‑)
> 
> I interesting aspect: So the commands write directly to one file
> concurrently?  How's synchronization done. It's especially
> interesting with multiple threads.

You don't believe your libc (fprintf in particular) to abide the
POSIX constraints, do you? :-)

>   All functions that reference (FILE *) objects, except those with
>   names ending in _unlocked, shall behave as if they use flockfile()
>   and funlockfile() internally to obtain ownership of these (FILE *)
>   objects.

[https://pubs.opengroup.org/onlinepubs/9699919799/functions/flockfile.html]

> Wouldn't it be much easier to use some socket-bases syslog like
> service that serializes all the messages to one file. Then the
> clients just need write access to the socket, not to the files.

Two things:

- pacemaker _already_ uses the syslog interface (in daemon context,
  anyway; it can disabled, but I think there should at the very
  least be a toggle to optionally disable file-based logging
  altogether)

- it doesn't make the serialization problem any better
  (yes, it could perform some sort of buffered time-stamp based
  reordering on-the-fly, but that wouldn't make the situation any
  better than an explicit post-processing on the files)

>> Indeed, users can configure whatever log routing they desire
>> (I was actually toying with an idea to make it a lot more flexible,
>> log‑per‑type‑of‑daemon and perhaps even distinguished by PID,
>> configurable log formats since currently it's arguably a heavy
>> overkill to keep the hostname stated repeatedly over and over
>> without actually bothering to recheck it from time to time, etc.).
> 
> Reminds me of a project I had started more than 10 years ago: I
> wanted to start with a log mechanism, but the idea was rejected.
> At some later time out local mail system was brought to a halt
> by >30 mail messages being created within a few minutes. Most
> GUI clients behave very badly once you have that many messages
> But that's a different story in the subject...

Agree in a sense that explicit message spooling is not a substitute
for proper logging (as opposed to true alerts etc., but you can
usually configure mechanisms so as to forward very _select_ subset
of log messages like that in the syslog daemon of your choice).

>> Also note, relying on almighty root privileges (like with the
>> pristine deployment) is a silent misconception that cannot be taken
>> for fully granted, so again arguably, even the root daemons should
>> take a haclient group's coat on top of their own just in case [*].
> 
> See above.
> 
>> 
>>> If ACLs are not in use, such users already have full read/write
>>> access to the CIB, so being able to read and write the log is not an
>>> additional concern.
>>> 
>>> With ACLs, I could see wanting to change the permissions, and that idea
>>> has come up already. One approach might be to add a PCMK_log_mode
>>> option that would default to 0660, and users could make it more strict
>>> if desired.
> 
> With 0660 you actually mean "ug+w" (why care about read
> permissions)? Octal absolute modes are obsolet for at least
> 20 years...

Note sure about this opinion :-)
Luckily, chmod(1) still documents both.

>> It looks reasonable to prevent read‑backs by anyone but root, that
>> could be applied without any further toggles, assuming the pacemaker
>

Re: [ClusterLabs] [Announce] libqb 1.0.5 release

2019-04-26 Thread Jan Pokorný

On 26/04/19 07:58 +0100, Christine Caulfield wrote:
> We are pleased to announce the release of libqb 1.0.5
> 
> Source code is available at:
> https://github.com/ClusterLabs/libqb/releases/download/v1.0.5/libqb-1.0.5.tar.xz
> 
> Please used the signed .tar.gz or .tar.xz files with the version number
> in rather than the github-generated "Source Code" ones.
> 
> This release is an update to fix a regression in 1.0.4,

one unrevealed "detail" here is that libqb v1.0.4 makes pacemaker
unusable, so cluster stack users are hence strongly adviced to keep
their distance from that botched version.

The announced v1.0.5 appears to work from a cursory test, but...

> huge thanks to wferi for all the help with this

...that was a Ferenc's (and Valentin's) primary objective here,
so thanks as well for the scrutiny and effort carried out!

-- 
Jan (Poki)

pgpddYyr1bGls.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker detail log directory permissions

2019-04-25 Thread Jan Pokorný

On 24/04/19 09:32 -0500, Ken Gaillot wrote:
> On Wed, 2019-04-24 at 16:08 +0200, wf...@niif.hu wrote:
>> Make install creates /var/log/pacemaker with mode 0770, owned by
>> hacluster:haclient.  However, if I create the directory as root:root
>> instead, pacemaker.log appears as hacluster:haclient all the
>> same.  What breaks in this setup besides log rotation (which can be
>> fixed by removing the su directive)?  Why is it a good idea to let
>> the haclient group write the logs?
> 
> Cluster administrators are added to the haclient group. It's a minor
> use case, but the group write permission allows such users to run
> commands that log to the detail log. An example would be running
> "crm_resource --force-start" for a resource agent that writes debug
> information to the log.

I think the prime and foremost use case is that half of the actual
pacemaker daemons run as hacluster:haclient themselves, and it's
preferred for them to be not completely muted about what they do,
correct? :-)

Indeed, users can configure whatever log routing they desire
(I was actually toying with an idea to make it a lot more flexible,
log-per-type-of-daemon and perhaps even distinguished by PID,
configurable log formats since currently it's arguably a heavy
overkill to keep the hostname stated repeatedly over and over
without actually bothering to recheck it from time to time, etc.).

Also note, relying on almighty root privileges (like with the pristine
deployment) is a silent misconception that cannot be taken for fully
granted, so again arguably, even the root daemons should take
a haclient group's coat on top of their own just in case [*].

> If ACLs are not in use, such users already have full read/write
> access to the CIB, so being able to read and write the log is not an
> additional concern.
> 
> With ACLs, I could see wanting to change the permissions, and that idea
> has come up already. One approach might be to add a PCMK_log_mode
> option that would default to 0660, and users could make it more strict
> if desired.

It looks reasonable to prevent read-backs by anyone but root, that
could be applied without any further toggles, assuming the pacemaker
code won't flip once purposefully allowed read bits for group back
automatically and unconditionally.

[*] for instance when SELinux hits hard (which is currently not the
case for Fedora/EL family), even though the executor(s) would need
to be exempted if process inheritance taints the tree once forever:
https://danwalsh.livejournal.com/69478.html

-- 
Jan (Poki)

pgp3dAgFvEUfh.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Coming in 2.0.2: check whether a date-based rule is expired

2019-04-24 Thread Jan Pokorný

On 24/04/19 08:22 +0200, Ulrich Windl wrote:
> I know that April 1st is gone, but maybe should be have "user-friendly
> durations" also? Maybe like:
> "a deep breath", meaning "30 seconds"
> "a guru meditation", meaning "5 minutes"
> "a coffee break", meaning "15 minutes"
> "a lunch break", meaning "half an hour"
> ...
> 
> Typical maintenance tasks can be finished within guro meditation or coffee
> breaks; only few require lunch breaks or more... ;-)

Hehe, and be aware that pacemaker already started the tradition of
pioneering some innovative time expresssions, so don't say it too
loudly:

$ iso8601 -d epoch
> Date: 1970-01-01 00:00:00Z

Mostly as a follow-up for the above joke, please don't _ever_ refer
to "epoch" as a date/time expression, it may go away any time without
notice (or on relatively short notice, for being undue here).

>>>> Jan Pokorný  schrieb am 24.04.2019 um 02:01
>>>> in Nachricht <20190424000144.ga23...@redhat.com>:
>> On 16/04/19 12:38 ‑0500, Ken Gaillot wrote:
>>> With just ‑r, it will tell you whether the specified rule from the
>>> configuration is currently in effect. If you give ‑d, it will check as
>>> of that date and time (ISO 8601 format).
>> 
>> Uh, the date‑time data representations (encodings of the singular
>> information) shall be used with some sort of considerations towards the
>> use cases:
>> 
>> 1. _data‑exchange friendly_, point‑of‑use‑context‑agnostic
>>(yet timezone‑respecting if need be) representation
>>‑ this is something you want to have serialized in data
>>  to outlive the code (extrapolated: for exchange between
>>  various revisions of the same code)
>>‑ ISO 8601 fills the bill
>> 
>> 2. _user‑friendly_, point‑of‑use‑context‑respecting representation
>>‑ this is something you want user to work with, be it the
>>  management tools or helpers like crm_rule
>>‑ ISO 8601 _barely_ fills the bill, fails in basic attampts of
>>  integration with surrounding system:
>> 
>> $ CIB_file=cts/scheduler/date‑1.xml ./tools/crm_rule ‑c \
>> ‑r rule.auto‑2 ‑d "next Monday 12:00"
>>> (crm_abort) error: crm_time_check: Triggered assert at 
>>> iso8601.c:1116: dt‑>days > 0
>>> (crm_abort) error: parse_date: Triggered assert at iso8601.c:757: 
>>> crm_time_check(dt)
>> 
>>  no good, let's try old good coreutils' `date` as the "chewer"
>> 
>> $ CIB_file=cts/scheduler/date‑1.xml ./tools/crm_rule ‑c \
>> ‑r rule.auto‑2 ‑d "$(date ‑d "next Monday 12:00")"
>>> (crm_abort) error: crm_time_check: Triggered assert at 
>>> iso8601.c:1116: dt‑>days > 0
>>> (crm_abort) error: parse_date: Triggered assert at iso8601.c:757: 
>>> crm_time_check(dt)
> 
>> crm_time_check(dt)
>> 
>>  still no good, so after few more iterations:
>> 
>> $ CIB_file=cts/scheduler/date‑1.xml ./tools/crm_rule ‑c \
>> ‑r rule.auto‑2 ‑d "$(date ‑Iminutes ‑d "next Monday 12:00")"
>>> Rule rule.auto‑2 is still in effect
>> 
>>  that could be much more intuitive + locale‑driven (assuming users
>>  have the locales set per what's natural to them/what they are
>>  used to), couldn't it?
>> 
>> I mean, at least allowing `‑d` switch in `crm_rule` to support
>> LANG‑native date/time specification makes a lot of sense to me:
>> https://github.com/ClusterLabs/pacemaker/pull/1756 

FWIW. I now think we should leverage gnulib's parse-datetime module
directly:
https://github.com/ClusterLabs/pacemaker/pull/1756#issuecomment-486108864

-- 
Jan (Poki)


pgpPlgorxPO44.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Coming in 2.0.2: check whether a date-based rule is expired

2019-04-23 Thread Jan Pokorný

On 16/04/19 12:38 -0500, Ken Gaillot wrote:
> We are adding a "crm_rule" command

Wouldn't `pcmk-rule` be a more sensible command name -- I mean, why not
to benefit from not suffering the historical burden in this case, given
that `crm` in the broadest "almost anything that can be associated with
our cluster SW" sense is an anachronism, whereas the term metamorphed
into the invoking name of the original management shell project
(heck, we don't have `crmd` as daemon name anymore)?

> that has the ability to check > whether a particular date-based rule is
> currently in effect.
> 
> The motivation is a perennial user complaint: expired constraints
> remain in the configuration, which can be confusing.
> 
> [...]
> 
> The new command gives users (and high-level tools) a way to determine
> whether a rule is in effect, so they can remove it themselves, whether
> manually or in an automated way such as a cron.
> 
> You can use it like:
> 
> crm_rule -r  [-d ] [-X ]
> 
> With just -r, it will tell you whether the specified rule from the
> configuration is currently in effect. If you give -d, it will check as
> of that date and time (ISO 8601 format).

Uh, the date-time data representations (encodings of the singular
information) shall be used with some sort of considerations towards the
use cases:

1. _data-exchange friendly_, point-of-use-context-agnostic
   (yet timezone-respecting if need be) representation
   - this is something you want to have serialized in data
 to outlive the code (extrapolated: for exchange between
 various revisions of the same code)
   - ISO 8601 fills the bill

2. _user-friendly_, point-of-use-context-respecting representation
   - this is something you want user to work with, be it the
 management tools or helpers like crm_rule
   - ISO 8601 _barely_ fills the bill, fails in basic attampts of
 integration with surrounding system:

$ CIB_file=cts/scheduler/date-1.xml ./tools/crm_rule -c \
-r rule.auto-2 -d "next Monday 12:00"
> (crm_abort)   error: crm_time_check: Triggered assert at iso8601.c:1116 : 
> dt->days > 0
> (crm_abort)   error: parse_date: Triggered assert at iso8601.c:757 : 
> crm_time_check(dt)

 no good, let's try old good coreutils' `date` as the "chewer"

$ CIB_file=cts/scheduler/date-1.xml ./tools/crm_rule -c \
-r rule.auto-2 -d "$(date -d "next Monday 12:00")"
> (crm_abort)   error: crm_time_check: Triggered assert at iso8601.c:1116 : 
> dt->days > 0
> (crm_abort)   error: parse_date: Triggered assert at iso8601.c:757 : 
> crm_time_check(dt)

 still no good, so after few more iterations:

$ CIB_file=cts/scheduler/date-1.xml ./tools/crm_rule -c \
-r rule.auto-2 -d "$(date -Iminutes -d "next Monday 12:00")"
> Rule rule.auto-2 is still in effect

 that could be much more intuitive + locale-driven (assuming users
 have the locales set per what's natural to them/what they are
 used to), couldn't it?

I mean, at least allowing `-d` switch in `crm_rule` to support
LANG-native date/time specification makes a lot of sense to me:
https://github.com/ClusterLabs/pacemaker/pull/1756

Perhaps iso8601 (and more?) would deserve the same, even though it
smells with dragging some compatibility/interoperatibility into the
game?  (at least, `crm_rule` is at brand new, moreover marked
experimental, anyway, let's discuss this part at developers ML
if need be -- one more thing to possible put on debate, actually,
this user interface sanitization could be performed merely in the
opaque management shell wrappings, but if nothing else, it amounts
to duplication of work and makes bare-bones use bit of a PITA).

-- 
Jan (Poki)


pgpmHD73aIt_q.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker security issues discovered and patched

2019-04-17 Thread Jan Pokorný

On 17/04/19 12:09 -0500, Ken Gaillot wrote:
> Without the patches, a mitigation is to prevent local user access to
> cluster nodes except for cluster administrators (which is the
> recommended and most common deployment model).

Not trying to artificially amplify the risk in response to the above,
but I think it's important to perceive threat models in the wider
context:

- mitigating factor: machines (and interconnects) usually isolated
  and controlled to a significant extent (for instance to make fencing
  feasible to start with) as mentioned

- provoking factor: cluster is usually predestined to deliver
  service(s) not necessarily bullet-proof themselves to a wide range
  of users, not necessarily to those with all-good intents
  (so the whole chain throughout may consist of many small steps,
  low hanging fruit is usually long harvested)

It would be hypocritical to close eyes from the latter, mileage
for each deployment can vary, just as precautions taken etc.
Not being even a passive enabler shall be a general goal across
the industry (note that the most severe case was nothing that
the chosen implementation language could be blamed for -- with
the 2019-marked one, well, perhaps).

* * *

As an extra note, thanks in advance to whoever will put the effort
to keep an eye on the after-patch behaviour and report back any
shenanigans observed!  Let's restate the upstream issue tracker for
pacemaker, since it appears to be gone from the list footer since
around March 19: https://bugs.clusterlabs.org

And as far as dislosing the possibly sensitive problems with SW
some in this community happen to maintain and contribute to is
concerned, the recommended and most vendor-neutral (these are the
main drivers, let's admit) option at this time is this list per
its rules: https://oss-security.openwall.org/wiki/mailing-lists/distros
(That is, unless there's an active interest to build something
unified collectively for what can be associated with ClusterLabs.)

Private issues would also do where possible, but at the end of
the day, any report is preferred to no report when at least
semi-reasonably routed.

Thanks!

-- 
Jan (Poki)

pgp7Z0i3Gtqjy.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Resource not starting correctly IV

2019-04-16 Thread Jan Pokorný

[letter-casing wise:
 it's either "Pacemaker" or down-to-the-terminal "pacemaker"]

On 16/04/19 10:21 -0600, JCA wrote:
> 2. It would seem that what Pacemaker is doing is the following:
>a. Check out whether the app is running.
>b. If it is not, launch it.
>c. Check out again
>d. If running, exit.
>e. Otherwise, stop it.
> f. Launch it.
>g. Go to a.
> 
> [...]
> 
> 4. If the above is correct, and if I am getting the picture correctly, it
> would seem that the problem is that my monitoring function does not detect
> immediately that my app is up and running. That's clearly my problem.
> However, is there any way to get Pacemaker to introduce a delay between
> steps b and c in section 2 above?

Ah, it should have occurred to me!

Typical solution, I think, is to have a sleep loop following the
daemon launch within "start" action that will run (subset) of what
"monitor" normally does, so as to synchronize on the "service ready"
moment.  Default timeout for "start" within agent's metadata should
then reflect the common time to get to the point "monitor" is happy
plus some reserve.

Some agents may do more elaborate things like precisely limiting such
waiting in respect to the time they were actually given by the
resource manager/pacemaker (if I don't misremember, that value is
provided through environment variables for sort of an introspection).

Resource agent experts could advise here.

(Truth to be told, "daemon readiness" used to be a very marginalized
problem putting barriers to practical [= race-free] dependency ordering
etc., luckily clever people realized that the most precize tracking
can only be at the hands of the actual daemon implementors if event
driven paradigm is to be applied.  For instance, if you can influence
my_app, and it's a standard forking daemon, it would be best if the
parent exited only when the daemon is truly ready to provide service
-- this usually requires some typically signal-based synchronization
amongst the daemon processes.  With systemd, situation is much simpler
since no forking is necessary, just a call to sd_notify(3) -- in that
case, though, your agent would need to mimic the server side of the
sd_notify protocol since nothing would do it for you.)

> 5. Following up on 4: if my script sleeps for a few seconds immediately
> after launching my app (it's a daemon) in myapp_start then everything works
> fine. Indeed, the call sequence in node one now becomes:
> 
>  monitor:
> 
> Status: NOT_RUNNING
> Exit: NOT_RUNNING
> 
>   start:
> 
> Validate: SUCCESS
> Status: NOT_RUNNING
> Start: SUCCESS
> Exit: SUCCESS
> 
>   monitor:
> 
> Status: SUCCESS
> Exit: SUCCESS

That's easier but less effective and reliable (more opportunistic than
fact-based) than polling the "monitor" outcomes privately within "start"
as sketched above.

-- 
Jan (Poki)


pgpC486CyR_af.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Resource not starting correctly III

2019-04-16 Thread Jan Pokorný

On 15/04/19 16:01 -0600, JCA wrote:
> This is weird. Further experiments, consisting of creating and deleting the
> resource, reveal that, on creating the resource, myapp-script may be
> invoked multiple times - sometimes four, sometimes twenty or so, sometimes
> returning OCF_SUCCESS, some other times returning OCF_NOT_RUNNING. And
> whether or not it succeeds, as per pcs status, this seems to be something
> completely random.

Please, don't forget that the agent gets also invoked so as to extract
its metadata (action is meta-data in that case).  You would figure
this out if you followed Ulrich's advice.

Apologies if this possibility is expressly skipped in your experiments.

-- 
Jan (Poki)


pgpoTl6uJqDKV.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] RRP is broken ... in what ways? (Was: corosync caused network breakdown)

2019-04-10 Thread Jan Pokorný

On 08/04/19 19:08 +0200, Jan Friesse wrote:
> Anyway. RRP is broken,

Since this is repeatedly rehashed on this list without summarizing
*what* is that much broken[1] while serving (rather well, I think)
last 8+ years, perhaps it would deserve some wider dissemination
incl. *why* is that.

Closest thing to come to mind is this, though I don't think
it's a well known reference:
https://corosync.github.io/corosync/doc/presentations/2017-Kronosnet-The-new-face-of-corosync-communications.pdf#page=3

In addition, there are minor notes here and there to be found, e.g.,
on the topic of the networks being presumably of comparable
characteristics with RRP [2,3]:

> It should probably be documented somewhere but may not be, with
> redundant ring, both rings must be of approximately the same
> performance.

> The protocol is designed to limit to the speed of the slowest ring
> - perhaps this is not working as intended.

[1] https://lists.clusterlabs.org/pipermail/users/2019-March/025491.html
[2] 
https://discuss.corosync.narkive.com/jUynoGxv/totem-implementation-is-unreliable-ring-faulty-and-retransmit-list#post2
[3] https://lists.clusterlabs.org/pipermail/pacemaker/2011-August/058536.html

Can you perhaps give a definite answer here to arising "what" and "why",
also perhaps to provide a credible incentive to upgrade out of something
just sketchily referred to as "broken", please?

-- 
Jan (Poki)


pgpEpILkEvBcV.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Community's communication sustainability (Was: recommendations for corosync totem timeout for CentOS 7 + VMware?)

2019-03-22 Thread Jan Pokorný

On 22/03/19 15:02 +0300, Andrei Borzenkov wrote:
> On Fri, Mar 22, 2019 at 1:08 PM Jan Pokorný  wrote:
>> Also a Friday's idea:
>> Perhaps we should crank up "how to ask" manual for this list
> 
> Yest another one?
> 
> http://www.catb.org/~esr/faqs/smart-questions.html

That's the essence that can be linked along (I had that document
somewhere back on my mind when proposing that), but I also
had some very specific points to raise:

- decision diagram on what information is usually requested for
  particular community assisted debugging of particular
  behaviours of particular parts of the stack, possibly
  generalized around how to find out which of the component
  needs this level of investigation in the first place

- link to community-managed index of notable posts previously
  circulated on the lists
  (I am not very good and possibly also biased at maintaining
   https://wiki.clusterlabs.org/wiki/Lists%27_Digest, anyone
   is welcome to step in and grow the insightful usefulness
   of some, sometimes very elaborate, posts to the MLs, it
   would certainly use some more love)

- centralized summary of where the issue reports for particular
  components are expected, and how the existing ones can
  be searched

- (whatever will help to cut the collateral time spent on
  progress towards the satisfaction of the original poster,
  be it a providing appropriate pieces of information off
  the bat, getting to understand the shared terminology to
  avoid communication problems like in this very case,
  issue routing or whatever else;
  still someone not getting the importance of fencing?)

This is not to discourage rather a generic purposing of the
lists, but primarily to limit signal-to-noise and redundancy
ratio at least a bit.  It's fully acknowledged that many
mysterious behaviours are just a sum of the behaviours of
the underlying components, and sometimes the web search
will fail for some, so it's entirely OK bring up such points
up, but even better if done so in a somewhat considerate
way.

User friendliness of the HA components is also relevant,
so the better the diagnostics and documentation are to 
realize the mechanisms involved and their causal principles,
the less question marks are to arise, sure, and that's
why we need to work also on that front, it's admittedly
not a one way street (and I don't want to sound harsh
towards the audience, especially since I can also be
blamed for contributing to the current state there -- I'd
just prefer it if the lists could work at some productive
level basis, since this is not exactly a hot-line service).

-- 
Jan (Poki)

pgpAmOkqNretN.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-22 Thread Jan Pokorný

On 21/03/19 12:21 -0400, Brian Reichert wrote:
> I've followed several tutorials about setting up a simple three-node
> cluster, with no resources (yet), under CentOS 7.
> 
> I've discovered the cluster won't restart upon rebooting a node.
> 
> The other two nodes, however, do claim the cluster is up, as shown
> with 'pcs status cluster'.

Please excuse the lack of understanding perhaps owing to the Friday
mental power phenomena, but:

1. what do you mean with "cluster restart"?
   local instance of cluster services being started anew once
   the node at hand finishes booting?

2. why shall a single malfunctioned node (out of three) irrefutably
   result in dismounting of otherwise healthy cluster?
   (if that's indeed what you presume)

Do you, in fact, expect the failed node to be rebooted automatically?
Do you observe a non-running cluster on the previously failed node
once it came up again?

Frankly, I'm lost at deciphering your situation and hence at least
part of your concerns.

* * *

Also a Friday's idea:
Perhaps we should crank up "how to ask" manual for this list
(to be linked in the header) to attempt a smoother, less time
(precious commodity) consuming Q -> A flow, to the benefit
of community.

-- 
Jan (Poki)

pgpRqBJusOjAI.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] FYI: clusterlabs.org server maintenance

2019-03-11 Thread Jan Pokorný

On 11/03/19 14:08 -0500, Ken Gaillot wrote:
> Both issues turned out to be SELinux-related. I believe everything is
> working properly again.

Thanks for looking into that (beside the long term maintenance,
important nonetheless), will keep an eye on recidivating of those.

-- 
Jan (Poki)


pgpxq01KkQ49S.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] FYI: clusterlabs.org server maintenance

2019-03-11 Thread Jan Pokorný

On 08/03/19 15:24 -0600, Ken Gaillot wrote:
> I plan to move the bugzilla and wiki sites over the weekend, and
> everything else sometime in the week after that.

To provide feedback:

* wiki: seems really slow now

* bugzilla: at standard responsive speed, but when adding a comment:

> An unexpected error occurred. This could be a temporary problem, or some code
> is behaving incorrectly, [...]
> 
> URL: https://bugs.clusterlabs.org/process_bug.cgi
> 
> There was an error sending mail from 'nobody' to '': error
> when closing pipe to /usr/lib/sendmail: Temp failure (EX_TEMPFAIL)
> 
> Traceback:
>  at Bugzilla/Mailer.pm line 162.
>   Bugzilla::Mailer::MessageToMTA(...) called at Bugzilla/BugMail.pm line 354
>   Bugzilla::BugMail::sendMail(...) called at Bugzilla/BugMail.pm line 248
>   Bugzilla::BugMail::Send(...) called at Bugzilla/Bug.pm line 1199
>   Bugzilla::Bug::_send_bugmail(...) called at Bugzilla/Bug.pm line 1138
>   Bugzilla::Bug::send_changes(...) called at 
> /var/www/bugzilla/process_bug.cgi line 378

-- 
Jan (Poki)


pgpqX2ECVpI2C.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] can't start pacemaker resources

2019-02-22 Thread Jan Pokorný

On 21/02/19 18:15 -0800, solarflow99 wrote:
> I have tried to create NFS shares using the ceph backend, but I can't seem
> to get the resources to start.  It doesn't show me much as to why, does
> anyone have an idea?
> 
> [...]
> 
> Feb 21 15:11:42 cephmgr101.corp.mydomain.com rbd.in(p_rbd_map_1)[10962]:
   ^^
I'd recommend to checkout and build-configure the Ceph
project properly (with do_cmake.sh?), rather than using raw
(to-be-parameter-interpolated) rbd.in with ad-hoc hand edits
(otherwise it might not serve you well, I guess).

If you consume Ceph through downstream packages, perhaps they
ship this file in a usable state like that already.

> ERROR: error: unknown option 'device map rbd/nfs1' usage: rbd  ...
^^^
Looks like "rbd" executable got swapped for "rbdmap":

https://github.com/ceph/ceph/commit/6a57358add1157629a6d1f72ae48777e134c99e6#diff-2e4ba39659222c9c6cf778e2a1b77180

and that change was just accidentally missed at src/ocf/rbd.in
in that very commit.

-- 
Hope this helps,
Poki


pgpqtpB7rLgA0.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný

On 20/02/19 21:16 +0100, Klaus Wenninger wrote:
> On 02/20/2019 08:51 PM, Jan Pokorný wrote:
>> On 20/02/19 17:37 +, Edwin Török wrote:
>>> strace for the situation described below (corosync 95%, 1 vCPU):
>>> https://clbin.com/hZL5z
>> I might have missed that earlier or this may be just some sort
>> of insignificant/misleading clue:
>> 
>>> strace: Process 4923 attached with 2 threads
>>> strace: [ Process PID=4923 runs in x32 mode. ]
> 
> A lot of reports can be found that strace seems to get this wrong
> somehow.

Haven't ever seen that (incl. today's attempts), but could be:
https://lkml.org/lkml/2016/4/7/823

It's no fun when debugging tools generate more questions than
they are meant to answer.

>> but do you indeed compile corosync using x32 ABI?  (moreover
>> while pristine x86_64 libqb is used, i.e., using 64bit pointers?)
>> 
>> Perhaps would be good to state any such major divergencies if
>> so unless I missed that.

-- 
Jan (Poki)


pgpHXWMJsanRa.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný

On 20/02/19 21:25 +0100, Klaus Wenninger wrote:
> Hmm maybe the thing that should be scheduled is running at
> SCHED_RR as well but with just a lower prio. So it wouldn't
> profit from the sched_yield and it wouldn't get anything of
> the 5% either.

Actually, it would possibly make the situation even worse in
that case, as explained in sched_yield(2):

> since doing so will result in unnecessary context
> switches, which will degrade system performance

(not sure into which bucket would this context-switched time
get accounted if at all, but the physical-clock time is ticking
in the interim...)

I am curious if well-tuned SCHED_DEADLINE as mentioned might
be a more comprehensive solution here, also to automatically
flip still-alive-without-progress buggy scenarios into
a purposefully exaggerated condition and hence possibly
actionable (like with token loss -> fencing).

-- 
Jan (Poki)

pgproQPcFiqTk.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný

On 20/02/19 17:37 +, Edwin Török wrote:
> strace for the situation described below (corosync 95%, 1 vCPU):
> https://clbin.com/hZL5z

I might have missed that earlier or this may be just some sort
of insignificant/misleading clue:

> strace: Process 4923 attached with 2 threads
> strace: [ Process PID=4923 runs in x32 mode. ]

but do you indeed compile corosync using x32 ABI?  (moreover
while pristine x86_64 libqb is used, i.e., using 64bit pointers?)

Perhaps would be good to state any such major divergencies if
so unless I missed that.

-- 
Jan (Poki)

pgpoZQEeLOZfX.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný

On 19/02/19 16:41 +, Edwin Török wrote:
> Also noticed this:
> [ 5390.361861] crmd[12620]: segfault at 0 ip 7f221c5e03b1 sp
> 7ffcf9cf9d88 error 4 in libc-2.17.so[7f221c554000+1c2000]
> [ 5390.361918] Code: b8 00 00 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00
> c3 0f 1f 80 00 00 00 00 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77
> 19  0f 6f 0f 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0

By any chance, is this an unmodified pacemaker package as obtainable
from some public repo together with debug symbols?

-- 
Jan (Poki)


pgpvY5q1QOKcY.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-18 Thread Jan Pokorný

On 15/02/19 08:48 +0100, Jan Friesse wrote:
> Ulrich Windl napsal(a):
>> IMHO any process running at real-time priorities must make sure
>> that it consumes the CPU only for short moment that are really
>> critical to be performed in time.

Pardon me, Ulrich, but something is off about this, especially
if meant in general.

Even if the infrastructure of the OS was entirely happy with
switching scheduling parameters constantly and in the furious rate
(I assume there may be quite a penalty when doing so, in the overhead
caused with reconfiguration of the schedulers involved), the
time-to-proceed-critical sections do not appear to be easily divisible
(if at all) in the overall code flow of a singlethreaded program like
corosync, since everything is time-critical in a sense (token and
other timeouts are ticking), and offloading some side/non-critical
tasks for asynchronous processing is likely not on the roadmap for
corosync, given the historical move from multithreading (only retained
for logging for which an extra precaution is needed so as to prevent
priority inversion, which will generally always be a threat when
unequal priority processes do interface, even if transitively).

The step around multithreading is to have another worker process
with IPC of some sorts, but with that, you only add more overhead
and complexity around such additionally managed queues into the
game (+ possibly priority inversion yet again).

BTW. regarding "must make sure" part, barring self-supervision
of any sort (new complexity + overhead), that's a problem of
fixed-priority scheduling assignment.  I've been recently raising
an awareness of (Linux-specific) *deadline scheduler* [1,2], which:

- has even higher hierarchical priority compared to SCHED_RR
  policy (making the latter possibly ineffective, which would
  not be very desirable, I guess)

- may better express not only the actual requirements, but some
  "under normal circumstances, using reasonably scoped HW for
  the task" (speaking of hypothetical defaults now, possibly
  user configurable and/or influenced with actual configured
  timeouts at corosync level) upper boundary for how much of
  CPU run-time shall be allowed for the process in absolute
  terms, possibly preventing said livelock scenarios (being
  throttled when exceeded, presumably speeding the loss of
  the token and subsequent fencing up)

Note that in systemd deployments, it would be customary for
the service launcher (unit file executor) to actually expose
this stuff as yet another user-customizable wrapping around
the actual run, but support for this very scheduling policy
is currently missing[3].

>> Specifically having some code that performs poorly (for various
>> reasons) is absolutely _not_ a candidate to be run with real-time
>> priorities to fix the bad performance!

You've managed to flip (more-or-less, have no contrary evidence)
isolated occurrence of evidently buggy behaviour to a generalized
description of the performance of the involved pieces of SW.
If that was that bad, we would hear there's not enough room for
the actual clustered resources all the time, but I am not aware
of that.

With buggy behaviour, I mean, logs from https://clbin.com/9kOUM
and https://github.com/ClusterLabs/libqb/commit/2a06ffecd bug fix
from the past seem to have something in common, like high load
as a surrounding circumstance, and the missed event/job (on,
presumably a socket, fd=15 in the log ... since that never gets
handled even when there's no other input event).  Guess that
another look is needed at _poll_and_add_to_jobs_ function
(not sure why it's without leading/trailing underscore in the
provided gdb backtrace [snipped]:
>>> Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)):
>>> #0 0x7f6fd31c5183 in epoll_wait () from /lib64/libc.so.6
>>> #1 0x7f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0
>>> #2 0x7f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0
>>> #3 0x55592d62ff78 in main ()
) and its use.

>> So if corosync is using 100% CPU in real-time, this says something
>> about the code quality in corosync IMHO.

... or in any other library that's involved (primary suspect: libqb)
down to kernel level, and, keep in mind, no piece of nontrivial SW is
bug-free, especially if the reproducer requires rather a specific
environment that is not prioritized by anyone incl. those tasked
with quality assurance.

>> Also SCHED_RR is even more cooperative than SCHED_FIFO, and another
>> interesting topic is which of the 100 real-time priorities to
>> assign to which process.  (I've written some C code that allows to
>> select the scheduling mechanism and the priority via command-line
>> argument, so the user and not the program is responsible if the
>> system locks up. Maybe corosync should thing about something
>> similar.
> 
> And this is exactly why corosync option -p (-P) exists (in 3.x these
> were moved to corosync.conf as a sched_rr/priority).
> 
>> Personally I also think that a

Re: [ClusterLabs] Pacemaker log showing time mismatch after

2019-02-11 Thread Jan Pokorný

On 11/02/19 15:03 -0600, Ken Gaillot wrote:
> On Fri, 2019-02-01 at 08:10 +0100, Jan Pokorný wrote:
>> On 28/01/19 09:47 -0600, Ken Gaillot wrote:
>>> On Mon, 2019-01-28 at 18:04 +0530, Dileep V Nair wrote:
>>> Pacemaker can handle the clock jumping forward, but not backward.
>> 
>> I am rather surprised, are we not using monotonic time only, then?
>> If so, why?
> 
> The scheduler runs on a single node (the DC) but must take as input the
> resource history (including timestamps) on all nodes. We need wall
> clock time to compare against time-based rules.

Yep, was aware of the troubles with this.

> Also, if we get two resource history entries from a node, we don't
> know if it rebooted in between, so a monotonic timestamp alone
> wouldn't be sufficient.

Ah, that's along the lines of Ulrich's response, I see it now, thanks.

I am not sure if there could be a step around that using boot IDs when
provided by the platform (like the hashes in case of systemd, assuming
just an equality test is all that's needed, since both histories
cannot arrive at the same time and presumably the FIFO gets
preserved -- it shall under normal circumstances, incl. no time
travels and sane, non-byzantine process, which might even be partially
detected and acted upon [two differing histories without any
recollection the node was fenced by the cluster? fence it for sure
and good measure!]).

> However, it might be possible to store both time representations in the
> history (and possibly maintain some sort of cluster knowledge about
> monotonic clocks to compare them within and across nodes), and use one
> or the other depending on the context. I haven't tried to determine how
> feasible that would be, but it would be a major project.

Just thinking aloud, the DC node, once won the election, will cause
other nodes (incl. new-comers during it's ruling) to (re)set the
offsets DC's vs. their own internal wall-clock time clocks (for
time-based rules should they ever become DC on their own), all nodes
will (re)establish monotonic to wall-clock time conversion based on
these one-off inputs, and from that point on, they can operate fully
detached from the wall-clock time (so that changing the wall clock
will have no immediate effect on the cluster unless (re)harmonizing
desired via some configured event handler that could reschedule the
plan appropriately, or delay the sync for a more suitable moment).
This indeed counts on pacemaker used in a full-fledged cluster mode,
i.e., requiring quorum (not to speak about fencing).

As a corollary, with such a scheme, time-based rules would only be
allowed when quorum fully honoured (afterall, it makes more sense
to employ cron or timer systemd units otherwise, since they all
use a local time only, which is the right fit, then, as opposed to
a distributed system).

>> We shall not need any explicit time synchronization across the nodes
>> since we are already backed by extended virtual synchrony from
>> corosync, eventhough it could introduce strangenesses when
>> time-based rules kick in.
> 
> Pacemaker determines the state of a resource by replaying its resource
> history in the CIB. A history entry can be replaced only by a newer
> event. Thus if there's a start event in the history, and a stop result
> comes in, we have to know which one is newer to determine whether the
> resource is started or stopped.

But for that, boot ID might be a sufficient determinant, since
a result (keyed with boot ID ABC) for an action never triggered with
boot ID XYZ deemed "current" ATM means that something is off, and
fencing is perhaps the best choice.

> Something along those lines is likely the cause of:
> 
> https://bugs.clusterlabs.org/show_bug.cgi?id=5246

I believe the "detached" scheme sketched above could solve this.
Problem is, devil is in the details to it can be unpredictably hard
to get it right in the distributed environment.

-- 
Jan (Poki)

pgpzb2j8BIGQK.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [pacemaker] Discretion with glib v2.59.0+ recommended

2019-02-11 Thread Jan Pokorný

On 20/01/19 12:44 +0100, Jan Pokorný wrote:
> On 18/01/19 20:32 +0100, Jan Pokorný wrote:
>> It was discovered that this release of glib project changed sligthly
>> some parameters of how distribution of values within  hash tables
>> structures work, undermining pacemaker's hard (alas unfeasible) attempt
>> to turn this data type into fully predictable entity.
>> 
>> Current impact is unknown beside some internal regression test failing
>> due to this, so that, e.g., in the environment variables passed in the
>> notification messages, the order of the active nodes (being a space
>> separarated list) may be appear shuffled in comparison with the long
>> standing (and perhaps making a false impression of determinism)
>> behaviour witnessed with older versions of glib in the game.
> 
> Our immediate response is to, at the very least, make the
> cts-scheduler regression suite (the only localhost one that was
> rendered broken with 52 tests out of 733 failed) skip those tests
> where reliance on the exact order of hash-table-driven items was
> sported, so it won't fail as a whole:
> 
> https://github.com/ClusterLabs/pacemaker/pull/1677/commits/15ace890ef0b987db035ee2d71994e37f7eaff96
> [above edit: updated with the newer version of the patch]

Shout-out to Ken for fixing the immediate fallout (deterministic
output breakages in some cts-scheduler tests, making the above
change superfluous) for the upcoming 2.0.1 release!

>> Variations like these are expected, and you may take it as an
>> opportunity to fix incorrect order-wise (like in the stated case)
>> assumptions.
> 
> [intentionally CC'd developers@, should have done it since beginning]
> 
> At this point, testing with glib v2.59.0+, preferably using 2.0.1-rc3
> due to the release cycle timing, is VERY DESIRED if you are considering
> providing some volunteer capacity to pacemaker project, especially if
> you have your own agents and scripts that rely on the exact (and
> previously likely stable) order of "set data made linear, hence
> artificially ordered", like with OCF_RESKEY_CRM_meta_notify_active_uname
> environment variable in clone notifications (as was already suggested;
> complete list is also unknown at this point, unfortunately, for a lack
> of systemic and precise data items tracking in general).

While some of these if not all are now ordered, I'd call using
"stable ordered list" approach to these variable, as opposed to
"plain unordered set" one, from within agents as continuously
frowned-upon unless explicitly lifted.  For predictable
backward/forward pacemaker+glib version compatibility if
for no other reason.

Ken, do you agree?

(If so, we shall keep that in mind for future documentation tweaks
[possibly including also OCF updates], so no false assumptions won't
be cast for new agent implementations going forward.)

>> More serious troubles stemming from this expectation-reality mismatch
>> regarding said data type cannot be denied at this point, subject of
>> further investigation.  When in doubt, staying with glib up to and
>> including v2.58.2 (said tests are passing with it, though any later
>> v2.58.* may keep working "as always") is likely a good idea for the
>> time being.

It think this still partially holds and only time-proven as fully
settled?  I mean, for anything truly reproducible (as in crm_simulate),
either pacemaker prior to 2.0.1 combined with glib pre- or equal-or-post-
2.59.0 need to be uniformly (reproducers need to follow the original)
combined to get the same results, and with pacemaker 2.0.1+, identical
results (but possibly differing against either of the former combos)
will _likely_ be obtained regardless of particular run-time linked glib
version, but strength of this "likely" will only be established with
future experience, I suppose (but shall universally hold with the same
glib class per stated division, so no change in this already positive
regard).

Just scratched the surface, so gladly be corrected.

-- 
Jan (Poki)


pgpkiGGrQQBN2.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker log showing time mismatch after

2019-01-31 Thread Jan Pokorný

On 28/01/19 09:47 -0600, Ken Gaillot wrote:
> On Mon, 2019-01-28 at 18:04 +0530, Dileep V Nair wrote:
> Pacemaker can handle the clock jumping forward, but not backward.

I am rather surprised, are we not using monotonic time only, then?
If so, why?

We shall not need any explicit time synchronization across the nodes
since we are already backed by extended virtual synchrony from
corosync, eventhough it could introduce strangenesses when
time-based rules kick in.

-- 
Nazdar,
Jan (Poki)

pgpLqnsQYhMnE.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

1 2 3 4 >

1 - 100 of 355 matches

Mail list logo