Re: [ClusterLabs] The Linux-HA site is down.

2023-06-01 Thread Lars Ellenberg
On Wed, May 03, 2023 at 11:07:20AM -0400, Madison Kelly wrote:
> On 2023-05-03 05:26, 黃暄皓 wrote:
> 
> As the title said,is it still in maintenance?
> 
> I'm not sure who even owns or maintains that old domain.

We (Linbit) did host that site still,
though all of it was supposed to be "read-only", as "archive".

There was a spam attack on the old wiki software there,
somehow bypassing the "supposedly read-only" settings.

We still look into bringing it back as an "archive",
in some form or other, it used to have "good karma".

But:

> I don't think it's been used or maintained for a long time.

Exactly.
Basically it is/was unmaintained content
about software that was old ten years ago.

It really is interesting only for "history lessons",
I think the old content did not have any practical relevance
for quite some time now.

So besides being busy with more pressing things than bringing back
what I think of as an old uninteresting archive,
I also wanted to see if anyone would actually notice that it is gone.

Lars

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] DC marks itself as OFFLINE, continues orchestrating the other nodes

2022-09-14 Thread Lars Ellenberg
On Thu, Sep 08, 2022 at 10:11:46AM -0500, Ken Gaillot wrote:
> On Thu, 2022-09-08 at 15:01 +0200, Lars Ellenberg wrote:
> > Scenario:
> > three nodes, no fencing (I know)
> > break network, isolating nodes
> > unbreak network, see how cluster partitions rejoin and resume service
> 
> I'm guessing the CIB changed during the break, with more changes in one
> of the other partitions than mqhavm24 ...

quite likely.

> Reconciling CIB differences in different partitions is inherently
> lossy. Basically we gotta pick one side to win, and the current
> algorithm just looks at the number of changes. (An "admin epoch" can
> also be bumped manually to override that.)

Yes.

> > I have full crm_reports and some context knowledge about the setup.
> > 
> > For now I'd like to know: has anyone seen this before,
> > is that a known bug in corner cases/races during re-join,
> > has it even been fixed meanwhile?
> 
> No, yes, no

Thank you.
That's what I thought :-|

> It does seem we could handle the specific case of the local node's
> state being overwritten a little better. We can't just override the
> join state if the other nodes think it is different, but we could
> release DC and restart the join process. How did it handle the
> situation in this case?

I think these are the most interesting lines:

-
Aug 11 12:32:45 mqhavm24 corosync[13296]:  [QUORUM] Members[1]: 1
   stopping stuff

Aug 11 12:33:36 mqhavm24 corosync[13296]:  [QUORUM] Members[3]: 1 3 2

Aug 11 12:33:36 [13310] mqhavm24   crmd:  warning: crmd_ha_msg_filter:  
Another DC detected: mqhavm37 (op=noop)
Aug 11 12:33:36 [13310] mqhavm24   crmd: info: update_dc:   Set DC 
to mqhavm24 (3.0.14)

Aug 11 12:33:36 [13308] mqhavm24  attrd:   notice: 
attrd_check_for_new_writer:  Detected another attribute writer (mqhavm37), 
starting new election
Aug 11 12:33:36 [13308] mqhavm24  attrd:   notice: attrd_declare_winner:
Recorded local node as attribute writer (was unset)

plan to start stuff on all three nodes
Aug 11 12:33:36 [13309] mqhavm24pengine:   notice: process_pe_message:  
Calculated transition 161, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-688.bz2

but then
Aug 11 12:33:36 [13305] mqhavm24cib: info: cib_perform_op:  +  
/cib/status/node_state[@id='1']:  @crm-debug-origin=do_cib_replaced, @join=down

and we now keep stuff stopped locally, but continue to manage the other two 
nodes.
-


commented log of the most intersting node below,
starting at the point when communication goes down.
maybe you see something that gives you an idea how to handle this better.

If it helps, I have the full crm_report of all nodes,
should you feel the urge to have a look.

Aug 11 12:32:45 mqhavm24 corosync[13296]:  [TOTEM ] Failed to receive the leave 
message. failed: 3 2
Aug 11 12:32:45 mqhavm24 corosync[13296]:  [QUORUM] This node is within the 
non-primary component and will NOT provide any services.
Aug 11 12:32:45 mqhavm24 corosync[13296]:  [QUORUM] Members[1]: 1
Aug 11 12:32:45 mqhavm24 corosync[13296]:  [MAIN  ] Completed service 
synchronization, ready to provide service.
[stripping most info level for now]
Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice: 
crm_update_peer_state_iter:  Node mqhavm37 state is now lost | nodeid=2 
previous=member source=crm_update_peer_proc
Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice: reap_crm_member: Purged 
1 peer with id=2 and/or uname=mqhavm37 from the membership cache
Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice: 
crm_update_peer_state_iter:  Node mqhavm34 state is now lost | nodeid=3 
previous=member source=crm_update_peer_proc
Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice: reap_crm_member: Purged 
1 peer with id=3 and/or uname=mqhavm34 from the membership cache
Aug 11 12:32:45 [13303] mqhavm24 pacemakerd:  warning: 
pcmk_quorum_notification:Quorum lost | membership=3112546 members=1
Aug 11 12:32:45 [13303] mqhavm24 pacemakerd:   notice: 
crm_update_peer_state_iter:  Node mqhavm34 state is now lost | nodeid=3 
previous=member source=crm_reap_unseen_nodes
Aug 11 12:32:45 [13303] mqhavm24 pacemakerd:   notice: 
crm_update_peer_state_iter:  Node mqhavm37 state is now lost | nodeid=2 
previous=member source=crm_reap_unseen_nodes
Aug 11 12:32:45 [13310] mqhavm24   crmd:  warning: 
pcmk_quorum_notification:Quorum lost | membership=3112546 members=1
Aug 11 12:32:45 [13310] mqhavm24   crmd:   notice: 
crm_update_peer_state_iter:  Node mqhavm34 state is now lost | nodeid=3 
previous=member source=crm_reap_unseen_nodes
Aug 11 12:32:45 [13308] mqhavm24  attrd:   notice: 
crm_update_peer_state_iter:  Node mqhavm37 state is now lost | nodeid=2 
previous=member source=crm_update_peer_proc
Aug 11 12:32:45 [13305] mqhavm24cib:  

[ClusterLabs] DC marks itself as OFFLINE, continues orchestrating the other nodes

2022-09-08 Thread Lars Ellenberg


Scenario:
three nodes, no fencing (I know)
break network, isolating nodes
unbreak network, see how cluster partitions rejoin and resume service


Funny outcome:
/usr/sbin/crm_mon  -x pe-input-689.bz2
Cluster Summary:
  * Stack: corosync
  * Current DC: mqhavm24 (version 1.1.24.linbit-2.0.el7-8f22be2ae) - partition 
with quorum
  * Last updated: Thu Sep  8 14:39:54 2022
  * Last change:  Thu Aug 11 12:33:02 2022 by root via crm_resource on mqhavm24
  * 3 nodes configured
  * 16 resource instances configured (2 DISABLED)

Node List:
  * Online: [ mqhavm34 mqhavm37 ]
  * OFFLINE: [ mqhavm24 ]


Note how the current DC considers itself as OFFLINE!

It accepted an apparently outdated cib replaceament from one of the non-DCs
from a previous membership while already authoritative itself,
overwriting its own "join" status in the cib.

I have full crm_reports and some context knowledge about the setup.

For now I'd like to know: has anyone seen this before,
is that a known bug in corner cases/races during re-join,
has it even been fixed meanwhile?

Thanks,
Lars

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] no-quorum-policy=stop never executed, pacemaker stuck in election/integration, corosync running in "new membership" cycles with itself

2021-06-02 Thread Lars Ellenberg
lge> > I would have expected corosync to come back with a "stable
lge> > non‑quorate membership" of just itself within a very short
lge> > period of time, and pacemaker winning the
lge> > "election"/"integration" with just itself, and then trying
lge> > to call "stop" on everything it knows about.
ken> 
ken> That's what I'd expect, too. I'm guessing the corosync cycling is
ken> what's causing the pacemaker cycling, so I'd focus on corosync first.

Any Corosync folks around with some input?
What may cause corosync on an isolated (with iptables DROP rules)
node to keep creating "new membership" with only itself?

Is it a problem with the test setup maybe?
Does an isolated corosync node need to be able
to send the token to itself?
Do the "iptables DROP" rules on the outgoing interfaces prevent that?

On Tue, Jun 01, 2021 at 10:31:21AM -0500, kgail...@redhat.com wrote:
> On Tue, 2021-06-01 at 13:18 +0200, Ulrich Windl wrote:
> > Hi!
> > 
> > I can't answer, but I doubt the usefulness of
> > "no-quorum-policy=stop": If nodes loose quorum, they try to
> > stop all resources, but "remain" in the cluster (will respond
> > to network queries (if any arrive).  If one of those "stop"s
> > fails, the other part of the cluster never knows.  So what can
> > be done? Should the "other(left)" part of the cluster start
> > resources, assuming the "other(right)" part of the cluster had
> > stopped resources successfully?
> 
> no-quorum-policy only affects what the non-quorate partition will do.
> The quorate partition will still fence the non-quorate part if it is
> able, regardless of no-quorum-policy, and won't recover resources until
> fencing succeeds.

The context in this case is: "fencing by storage".
DRBD 9 has a "drbd quorum" feature, where you can ask it
to throw IO errors (or freeze) if DRBD quorum is lost,
so data integrity on network partition is protected,
even without fencing on the pacemaker level.

It is rather a "convenience" that the non-quorate
pacemaker on the isolated node should stop everything
that still "survived", especially the umount is necessary
for DRBD on that node to become secondary again,
which is necessary to be able to re-integrate later
when connectivity is restored.

Yes, fencing on the node level is still necessary for other
scenarios.  But with certain scenarios, avoiding a node level
fence while still being able to also avoid "trouble" once
connectivity is restored would be nice.

And would work nicely here, if the corosync membership
of the isolated node would be stable enough for pacemaker
to finalize "integration" with itself and then (try to) stop
everything, so we have a truely "idle" node when connectivity is
restored.

"trouble":
spurious restart of services ("resource too active ..."),
problems with re-connecting DRBD ("two primaries not allowed")

> > > pcmk 2.0.5, corosync 3.1.0, knet, rhel8
> > > I know fencing "solves" this just fine.
> > > 
> > > what I'd like to understand though is: what exactly is
> > > corosync or pacemaker waiting for here, why does it not
> > > manage to get to the stage where it would even attempt to
> > > "stop" stuff?
> > > 
> > > two "rings" aka knet interfaces.
> > > node isolation test with iptables,
> > > INPUT/OUTPUT ‑j DROP on one interface,
> > > shortly after on the second as well.
> > > node loses quorum (obviously).
> > > 
> > > pacemaker is expected to no‑quorum‑policy=stop,
> > > but is "stuck" in Election ‑> Integration,
> > > while corosync "cycles" bewteen "new membership" (with only
> > > itself, obviously) and "token has not been received in ...",
> > > "sync members ...", "new membership has formed ..."
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] no-quorum-policy=stop never executed, pacemaker stuck in election/integration, corosync running in "new membership" cycles with itself

2021-06-01 Thread Lars Ellenberg
pcmk 2.0.5, corosync 3.1.0, knet, rhel8
I know fencing "solves" this just fine.

what I'd like to understand though is: what exactly is corosync or
pacemaker waiting for here,
why does it not manage to get to the stage where it would even attempt
to "stop" stuff?

two "rings" aka knet interfaces.
node isolation test with iptables,
INPUT/OUTPUT -j DROP on one interface, shortly after on the second as well.
 node loses quorum (obviously).

pacemaker is expected to no-quorum-policy=stop,
but is "stuck" in Election -> Integration,
while corosync "cycles" bewteen "new membership" (with only itself, obviously)
and "token has not been received in ...", "sync members ...", "new
membership has formed ..."

I would have expected corosync to come back with a "stable non-quorate
membership" of just itself
within a very short period of time, and pacemaker winning the
"election"/"integration" with just itself,
and then trying to call "stop" on everything it knows about.
I'm asking for hints what to look for in the logs, or how to drill
down further as to why that is not the case.

Lars
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]

2020-09-17 Thread Lars Ellenberg
On Fri, Sep 11, 2020 at 11:42:46AM +0200, Lars Ellenberg wrote:
> On Thu, Sep 10, 2020 at 11:18:58AM -0500, Ken Gaillot wrote:
> > > But for some unrelated reason (stress on the cib, IPC timeout),
> > > crmd on the DC was doing an error exit and was respawned:
> > > 
> > >   cib: info: cib_process_ping:  Reporting our current digest
> > >   crmd:error: do_pe_invoke_callback: Could not retrieve the
> > > Cluster Information Base: Timer expired
> > >   ...
> > >   pacemakerd:error: pcmk_child_exit:   The crmd process (17178)
> > > exited: Generic Pacemaker error (201)
> > >   pacemakerd:   notice: pcmk_process_exit: Respawning failed child
> > > process: crmd
> > > 
> > > The new DC now causes:
> > >   cib: info: cib_perform_op:Diff: --- 0.971.201 2
> > >   cib: info: cib_perform_op:Diff: +++ 0.971.202 (null)
> > >   cib: info: cib_perform_op:--
> > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > >
> > > But the attrd apparently does not notice that transient attributes it
> > > had cached are now gone.
> > 
> > This is a known issue. There was some work done on it in stages that
> > never went anywhere:
> > 
> > https://github.com/ClusterLabs/pacemaker/pull/1695
> > 
> > https://github.com/ClusterLabs/pacemaker/pull/1699
> > 
> > https://github.com/ClusterLabs/pacemaker/pull/2020
> > 
> > The basic idea is that the controller should ask pacemaker-attrd to
> > clear a node's transient attributes rather than doing so directly, so
> > attrd and the CIB stay in sync. Backward compatibility would be tricky.
> > 
> > The fix would only be in Pacemaker 2, since this would require a
> > feature set bump, which can't be backported.
> 
> Thank you for that quick response and all the context above.
> 
> You mention below
> 
> > the controller
> > should request node attribute erasure only if the node leaves the
> > corosync membership, not just the controller CPG.
> 
> Would that be a change that could go into the 1.1.x series?

Suggestion to mitigate the issue:

periodically, for example from a monitor action of a simple resource
agent script, do:

   if attrd_updater -n attrd-canary --update 1; then
 crm_attribute --lifetime reboot --name attrd-canary --query || 
attrd_updater --refresh 
   fi

Do you see any possible issues with that approach?

Lars

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]

2020-09-10 Thread Lars Ellenberg
Now with "reproducer" ... see below

On Thu, Sep 10, 2020 at 11:55:20AM +0200, Lars Ellenberg wrote:
> Hi there.
> 
> I've seen a scenario where a network "hickup" isolated the current DC in a 3
> node cluster for a short time; other partition elected a new DC obviously, and
> all node attributes of the former DC are "cleared" together with the rest of
> its state.

I have to correct myself here.
Network and membership remained stable, even the CIB CPG did not notice 
anything.

But for some unrelated reason (stress on the cib, IPC timeout),
crmd on the DC was doing an error exit and was respawned:

  cib: info: cib_process_ping:  Reporting our current digest
  crmd:error: do_pe_invoke_callback: Could not retrieve the Cluster 
Information Base: Timer expired
  ...
  pacemakerd:error: pcmk_child_exit:   The crmd process (17178) exited: 
Generic Pacemaker error (201)
  pacemakerd:   notice: pcmk_process_exit: Respawning failed child process: crmd

The new DC now causes:
  cib: info: cib_perform_op:Diff: --- 0.971.201 2
  cib: info: cib_perform_op:Diff: +++ 0.971.202 (null)
  cib: info: cib_perform_op:-- 
/cib/status/node_state[@id='2']/transient_attributes[@id='2']

But the attrd apparently does not notice that transient attributes it had 
cached are now gone.

Reprobes are going on, and all give the expected results.
But unchanged (from the perspective of the attrd on the former DC,
the one with the crmd Respawn) master scores will not be re-populated
to the CIB, preventing a later switchover of the Master role
(that is when it became apparent that something was wrong).

A "reproducer" in the sense of "reproduces approximate behavior",
even if not the exact scenario (crmd emergency respawn and DC re-election):

 * have a healthy cluster with some master scores set
 * delete transient node attributes:
   cibadm -D --xpath 
"/cib/status/node_state[@id='2']/transient_attributes[@id='2']"
(or whatever your node id is; the resource should not be promoted on
that node at that time, or this will result in resource "recovery"
actions, which will change the master score, and we have a different effect)

Any cached node attributes (master scores) on that node
will "never" make it to the CIB (until they eventually change their value).

How can this be fixed?
   * for the "cibadmin -D" case? (do we even want to?)
   * for the "DC re-election" and one crmd "temporarily not available"
 case as in the scenario described here?
 (I think we should)

> All nodes rejoin, "all happy again", BUT ...
> the attrd of the former DC apparently had some cached node attribute values,
> which are now no longer present in the cib.
> Specifically, some master scores.
> So the master scores for the former DC (that was lost, then rejoined) are now
> "only" in its attrd, but (as long as they don't change) will never be flushed
> to the CIB.
> 
> The policy engine therefore no longer considers this node as a possible
> promotion candidate.
> 
> Again: the master score did not change, not from the perspective of the attrd
> on the node which was isolated for a short time, anyways.
> 
> But since that node "left", the two-node partition deleted the node state of
> the "lost" node (including master scores).
> Then that node rejoined.
> 
> Now, I have a cib without that master score, an attrd with that master score
> value still "cached", and some periodic monitor that will just reset this same
> (already cached in attrd) master score.
> But that apparently will never reach the CIB.
> 
> So.
> Question is: anyone seen anything like that before?
> Could that be fixed already?
> Version in that scenario was: 1.1.20+ (almost .21).
> 
> Obviously "stonith" would have fixed it,
> then that node would not have just rejoined, but rebooted, then rejoined,
> and its attrd would not have any cached values anymore ;-)
> 
> I suppose attrd attributes should sync with the last CIB on re-join?
> I'd hope it does something like that already?
> If it does nothing yet, then maybe that's the obvious fix.
> If it does something, then maybe this boils down to some funky timing issue?
> 
> How would I go about trying to create a reproducer?
> 
Thanks,
 
 Lars

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Removing DRBD w/out Data Loss?

2020-09-10 Thread Lars Ellenberg
On Tue, Sep 08, 2020 at 02:33:37PM +, Eric Robinson wrote:
> I checked the DRBD manual for this, but didn't see an answer. We need
> to convert a DRBD cluster node into standalone server and remove DRBD
> without losing the data. Is that possible? I asked on the DRBD list
> but it didn't get much of a response.
> 
> Given:
> 
> The backing device is logical volume: /dev/vg1/lv1
> 
> The drbd volume is: drbd0
> 
> The filesystem is ext4 on /dev/drbd0
> 
> Since the filesystem is built on /dev/drbd0, not on /dev/vg1/lv1, if we 
> remove drbd from the stack, how do we get access the data?

As long as we still have a "full copy" of all data on each node
[okay, not on "diskless clients", obviously],
you can just pretend DRBD was not there, and mount directly: DRBD is
"transparent", and just for "internal metadata" reserves some blocks at
the very end of your backing device.

"transparent" was an intentional design choice, so you can easily add
DRBD to an existing non-replicated data set, once you recognize that you
want that.

[That will change when we introduce the "erasure coding" data layouts,
where we do no longer have a full copy of the data on each node,
but need several (but not all) nodes to reconstruct the full data set]

Some libblkid (what is used by mount to "guess" the file system type)
versions know how to recognize DRBD meta data, would "guess" a file
system type of "drbd" (for "internal" drbd meta data), and fail.

But you can explicitly specify the file system type:
"mount -t ext4 /dev/vg/lv /mnt/point"

You could also remove the DRBD meta data magic,
drbdmeta wipe-md should do it, or wipefs to just remove the DRBD
specific "magic" used by libblkid to identify it.


The "without Data Loss" part depends on whether the local copy was
"Consistent" (or better yet: UpToDate) before you decided to remove DRBD.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after cluster partition/rejoin

2020-09-10 Thread Lars Ellenberg
Hi there.

I've seen a scenario where a network "hickup" isolated the current DC in a 3
node cluster for a short time; other partition elected a new DC obviously, and
all node attributes of the former DC are "cleared" together with the rest of
its state.

All nodes rejoin, "all happy again", BUT ...
the attrd of the former DC apparently had some cached node attribute values,
which are now no longer present in the cib.
Specifically, some master scores.
So the master scores for the former DC (that was lost, then rejoined) are now
"only" in its attrd, but (as long as they don't change) will never be flushed
to the CIB.

The policy engine therefore no longer considers this node as a possible
promotion candidate.

Again: the master score did not change, not from the perspective of the attrd
on the node which was isolated for a short time, anyways.

But since that node "left", the two-node partition deleted the node state of
the "lost" node (including master scores).
Then that node rejoined.

Now, I have a cib without that master score, an attrd with that master score
value still "cached", and some periodic monitor that will just reset this same
(already cached in attrd) master score.
But that apparently will never reach the CIB.

So.
Question is: anyone seen anything like that before?
Could that be fixed already?
Version in that scenario was: 1.1.20+ (almost .21).

Obviously "stonith" would have fixed it,
then that node would not have just rejoined, but rebooted, then rejoined,
and its attrd would not have any cached values anymore ;-)

I suppose attrd attributes should sync with the last CIB on re-join?
I'd hope it does something like that already?
If it does nothing yet, then maybe that's the obvious fix.
If it does something, then maybe this boils down to some funky timing issue?

How would I go about trying to create a reproducer?

Thanks,

Lars

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [EXTERNAL] Re: "node is unclean" leads to gratuitous reboot

2019-07-11 Thread Lars Ellenberg
On Wed, Jul 10, 2019 at 06:15:56PM +, Michael Powell wrote:
> Thanks to you and Andrei for your responses.  In our particular
> situation, we want to be able to operate with either node in
> stand-alone mode, or with both nodes protected by HA.  I did not
> mention this, but I am working on upgrading our product
> from a version which used Pacemaker version 1.0.13 and Heartbeat
> to run under CentOS 7.6 (later 8.0).
> The older version did not exhibit this behavior, hence my concern.

Heartbeat by default has much less aggressive timeout settings,
and clearly distinguishes between "deadtime", and "initdead",
basically a "wait_for_all" with timeout: how long to wait for other
nodes during startup before declaring them dead and proceeding in
the startup sequence, ultimately fencing unseen nodes anyways.

Pacemaker itself has "dc-deadtime", documented as
"How long to wait for a response from other nodes during startup.",
but the 20s default of that in current Pacemaker is much likely
shorter than what you had as initdead in your "old" setup.

So maybe if you set dc-deadtime to two minutes or something,
that would give you the "expected" behavior?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to Understanding crm-fence-peer.sh

2019-01-16 Thread Lars Ellenberg
On Wed, Jan 16, 2019 at 04:27:18PM +0100, Valentin Vidic wrote:
> On Wed, Jan 16, 2019 at 04:20:03PM +0100, Valentin Vidic wrote:
> > I think drbd always calls crm-fence-peer.sh when it becomes disconnected
> > primary.  In this case storage1 has closed the DRBD connection and
> > storage2 has become a disconnected primary.
> > 
> > Maybe the problem is the order that the services are stopped during
> > reboot. It would seem that drbd is shutdown before pacemaker. You
> > can try to run manually:
> > 
> >   pacemaker stop
> >   corosync stop
> >   drbd stop
> > 
> > and see what happens in this case.
> 
> Some more info here:
> 
> https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_drbd_fencing.html
> 
> So storage2 does not know why the other end disappeared and tries to use
> pacemaker to prevent storage1 from ever becoming a primary.  Only when
> it comes back online and gets in sync it is allowed to start again as a
> pacemaker resource by a second script:
> 
>   after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";

Though that should be "unfence-peer" nowadays, and no longer overload
the after-resync-target handler, which actually has a different purpose.

To clarify: crm-fence-peer.sh is an *example implementation*
(even though an elaborate one) of a DRBD fencing policy handler,
which uses pacemaker location constraints on the Master role
if DRBD is not sure about the up-to-date-ness of that instance,
to ban nodes from taking over the Master role.

It does NOT trigger node level fencing.
But it has to wait for, and rely on, pacemaker node level fencing.

That script is heavily commented, btw,
so you should be able to follow
what it tries to do, and even why.

Other implementations of drbd fencing policy handlers may directly
escalate to node level fencing. If that is what you want, use one of
those, and effectively map every DRBD replication link hickup to a hard
reset of the peer.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync.log Growing at 1Gb in 15min

2018-06-20 Thread Lars Ellenberg
On Wed, Jun 20, 2018 at 01:58:18PM +0200, Mark Prinsloo wrote:
> Good Day,
> 
> We had a node failure and we had to rebuild the whole server from scratch.
> After rebuilding Asterisk01 and rejoining it to the cluster something isn't
> working 100%.
> 
> It does fail over but the corosync.log file on the second node is growing
> at 1Gb every 15min.
> I have attached a piece of the log file and you can see the logs get filled
> with the same set of messages.
> 
> Something isn't working properly but can figure out what exactly.

cib_perform_op:  Discarding update with feature set '3.0.14' greater than 
our own '3.0.10' 

What that tries to say is:
"The other node runs a sofware version that is too recent,
 I fear I might not understand it, so I reject it."

You options, as far as I can see them:
- upgrade the older pacemaker to the same version as on the other node,

- or tell your CIB on the newer version to only use the "feature set"
  up to 3.0.10, and only validate with whatever the older version of
  pacemaker can handle.

To do that, unless you are actually using some incompatible recent
features from the feature set denoted by 3.0.14, which I think is unlikely,
you should be able to dump the cib (cibadmin -Q > tmp.xml),
edit the "crm_feature_set=3.0.14" and the "validate-with" to something
the older version can understand (e.g. 3.0.10 and pacemaker-2.0),
and "reimport" that (cibadmin -R -x tmp.xml).
Or something like that.

> Should I rather rebuild the cluster?

Maybe. If so, try to not mix pacemaker versions,
or at least have all nodes be present and connected first,
before your start configuring resources; if that is not possible,
activate the *oldest* pacemaker version first.


Currently you are in an "election / integration" loop.
Pacemaker should handle that situation in a more graceful way,
but unfortunately it does not, logging this in a busy loop
(info level stripped away):

> Jun 20 13:09:45  kpasterisk02   crmd:   notice: do_dc_join_finalize:  
> join-17178295: Syncing the CIB from kpasterisk01-ha to the rest of the cluster
> Jun 20 13:09:45  kpasterisk02   crmd:   notice: do_dc_join_finalize:  
> Requested versionvalidate-with="pacemaker-2.3" epoch="46" num_updates="1" admin_epoch="0" 
> cib-last-written="Tue Jun 19 21:02:12 2018" update-origin="kpasterisk01-ha" 
> update-client="crmd" update-user="hacluster" have-quorum="1"/>
> Jun 20 13:09:45  kpasterisk02cib:error: cib_perform_op:   
> Discarding update with feature set '3.0.14' greater than our own '3.0.10'
> Jun 20 13:09:45  kpasterisk02cib:error: cib_process_request:  
> Completed cib_replace operation for section 'all': Protocol not supported 
> (rc=-93, origin=kpasterisk01-ha/crmd/###, version=0.45.44)
> Jun 20 13:09:45  kpasterisk02   crmd:error: finalize_sync_callback:   
> Sync from kpasterisk01-ha failed: Protocol not supported
> Jun 20 13:09:45  kpasterisk02   crmd:  warning: do_log:   FSA: Input 
> I_ELECTION_DC from finalize_sync_callback() received in state S_FINALIZE_JOIN


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pengine bug? Recovery after monitor failure: Restart of DRBD does not restart Filesystem -- unless explicit order start before promote on DRBD

2018-01-22 Thread Lars Ellenberg
On Fri, Jan 19, 2018 at 04:52:40PM -0600, Ken Gaillot wrote:

> Your constraints are:
> 
>   place IP then place drbd instance(s) with it
>   start IP then start drbd instance(s)
> 
>   place drbd master then place fs with it
>   promote drbd master then start fs
> 
> I'm guessing you meant to colocate the drbd *master* with the IP, and
> "start IP then promote drbd" -- otherwise you can never have more than
> one drbd instance. That doesn't have any relevance to the problem,
> though.

In the real config, it would be a "stacked" DRBD setup,
which only has one instance (per "site").

> I also see you have clone-max="1". Interestingly, if we set this to
> "2", it now restarts the fs, but it only promotes drbd (which is
> already master).
> 
> > Is (was?) this a pengine bug?
> 
> Definitely. :-(
> 
> I confirmed the behavior on Pacemaker 1.1.12 as well, so it's not
> something new. This will require further investigation.

 :-(

Let me know if I can help somehow.

Workaround available, though it's non-obvious
(add the "stupid" constraint).

I think it also works ok with the "equivalent" resource set contraints.

Cheers,

Lars


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] pengine bug? Recovery after monitor failure: Restart of DRBD does not restart Filesystem -- unless explicit order start before promote on DRBD

2018-01-11 Thread Lars Ellenberg

To understand some weird behavior we observed,
I dumbed down a production config to three dummy resources,
while keeping some descriptive resource ids (ip, drbd, fs).

For some reason, the constraints are:
stuff, more stuff, IP -> DRBD -> FS -> other stuff.
(In the actual real-world config, it makes somewhat more sense,
but it reproduces with just these three resources)

All is running just fine.

Online: [ ava emma ]
 virtual_ip (ocf::pacemaker:Dummy): Started ava
 Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
 Masters: [ ava ]
 p_fs_drbd1 (ocf::pacemaker:Dummy): Started ava

If I simulate a monitor failure on IP:
# crm_simulate -L -i virtual_ip_monitor_3@ava=1

Transition Summary:
 * Recover virtual_ip   (Started ava)
 * Restart p_drbd_r0:0  (Master ava)

Which in real life will obviously fail,
because we cannot "restart" (demote) a DRBD
while it is still in use (mounted, in this case).

Only if I add a stupid intra-resource order constraint that explicitly
states to first start, then promote on the DRBD itself,
I get the result I would have expected:

Transition Summary:
 * Recover virtual_ip   (Started ava)
 * Restart p_drbd_r0:0  (Master ava)
 * Restart p_fs_drbd1   (Started ava)

Interestingly enough, if I simulate a monitor failure on "DRBD" directly,
it is in both cases the expected:

Transition Summary:
 * Recover p_drbd_r0:0  (Master ava)
 * Restart p_fs_drbd1   (Started ava)


What am I missing?

Do we have to "annotate" somewhere that you must not demote something
if it is still "in use" by something else?

Did I just screw up the constraints somehow?
How would the constraints need to look like to get the expected result,
without explicitly adding the first-start-then-promote constraint?

Is (was?) this a pengine bug?



How to reproduce:
=

crm shell style dummy config:
--
node 1: ava
node 2: emma
primitive p_drbd_r0 ocf:pacemaker:Stateful \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive p_fs_drbd1 ocf:pacemaker:Dummy \
op monitor interval=20 timeout=40
primitive virtual_ip ocf:pacemaker:Dummy \
op monitor interval=30s
ms ms_drbd_r0 p_drbd_r0 \
meta master-max=1 master-node-max=1 clone-max=1 clone-node-max=1
colocation c1 inf: ms_drbd_r0 virtual_ip
colocation c2 inf: p_fs_drbd1:Started ms_drbd_r0:Master
order o1 inf: virtual_ip:start ms_drbd_r0:start
order o2 inf: ms_drbd_r0:promote p_fs_drbd1:start
--

crm_simulate -x bad.xml -i virtual_ip_monitor_3@ava=1

 trying to demote DRBD before umount :-((

adding stupid constraint:

order first-start-then-promote inf: ms_drbd_r0:start ms_drbd_r0:promote

crm_simulate -x good.xml -i virtual_ip_monitor_3@ava=1

  yay, first umount, then demote...

(tested with 1.1.15 and 1.1.16, not yet with more recent code base)


Full good.xml and bad.xml are both attached.

Manipulating constraint in live cib using cibadmin only:
add: cibadmin -C -o constraints -X ''
del: cibadmin -D -X ''

Thanks,

Lars



bad.xml.bz2
Description: Binary data


good.xml.bz2
Description: Binary data
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Regression in Filesystem RA

2017-10-16 Thread Lars Ellenberg
On Mon, Oct 16, 2017 at 08:09:21PM +0200, Dejan Muhamedagic wrote:
> Hi,
> 
> On Thu, Oct 12, 2017 at 03:30:30PM +0900, Christian Balzer wrote:
> > 
> > Hello,
> > 
> > 2nd post in 10 years, lets see if this one gets an answer unlike the first
> > one...

Do you want to make me check for the old one? ;-)

> > One of the main use cases for pacemaker here are DRBD replicated
> > active/active mailbox servers (dovecot/exim) on Debian machines. 
> > We've been doing this for a loong time, as evidenced by the oldest pair
> > still running Wheezy with heartbeat and pacemaker 1.1.7.
> > 
> > The majority of cluster pairs is on Jessie with corosync and backported
> > pacemaker 1.1.16.
> > 
> > Yesterday we had a hiccup, resulting in half the machines loosing
> > their upstream router for 50 seconds which in turn caused the pingd RA to
> > trigger a fail-over of the DRBD RA and associated resource group
> > (filesystem/IP) to the other node. 
> > 
> > The old cluster performed flawlessly, the newer clusters all wound up with
> > DRBD and FS resource being BLOCKED as the processes holding open the
> > filesystem didn't get killed fast enough.
> > 
> > Comparing the 2 RAs (no versioning T_T) reveals a large change in the
> > "signal_processes" routine.
> > 
> > So with the old Filesystem RA using fuser we get something like this and
> > thousands of processes killed per second:
> > ---
> > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: 
> > (res_Filesystem_mb07:stop:stdout)   3478  3593   ...
> > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: 
> > (res_Filesystem_mb07:stop:stderr) 
> > cmccmccmccmcmcmcmcmccmccmcmcmcmcmcmcmcmcmcmcmcmccmcm
> > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: 
> > (res_Filesystem_mb07:stop:stdout)   4032  4058   ...
> > ---
> > 
> > Whereas the new RA (newer isn't better) that goes around killing processes
> > individually with beautiful logging was a total fail at about 4 processes
> > per second killed...
> > ---
> > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: 
> > sending signal TERM to: mail42264909  0 09:43 ?S  
> > 0:00 dovecot/imap 
> > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: 
> > sending signal TERM to: mail42294909  0 09:43 ?S  
> > 0:00 dovecot/imap [idling]
> > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: 
> > sending signal TERM to: mail42384909  0 09:43 ?S  
> > 0:00 dovecot/imap 
> > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: 
> > sending signal TERM to: mail42394909  0 09:43 ?S  
> > 0:00 dovecot/imap 
> > ---
> > 
> > So my questions are:
> > 
> > 1. Am I the only one with more than a handful of processes per FS who
> > can't afford to wait hours the new routine to finish?
> 
> The change was introduced about five years ago.

Also, usually there should be no process anymore,
because whatever is using the Filesystem should have it's own RA,
which should have appropriate constraints,
which means that should have been called and "stop"ped first,
before the Filesystem stop and umount, and only the "accidental,
stray, abandoned, idle since three weeks, operator shell session,
that happend to cd into that file system" is supposed to be around
*unexpectedly* and in need of killing, and not "thousands of service
processes, expectedly".

So arguably your setup is broken,
relying on a fall-back workaround
which used to "perform" better.

The bug is not that this fall-back workaround now
has pretty printing and is much slower (and eventually times out),
the bug is that you don't properly kill the service first.
[and that you don't have fencing].

> > 2. Can we have the old FUSER (kill) mode back?
> 
> Yes. I'll make a pull request.

Still, that's a sane thing to do,
thanks, dejanm.

Maybe we can even come up with a way
to both "pretty print" and kill fast?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0

2017-10-16 Thread Lars Ellenberg
On Tue, Sep 26, 2017 at 07:17:15AM +, Eric Robinson wrote:
> > I don't know the tool, but isn't the expectation a bit high that the tool 
> > will trim
> > the correct blocks throuch drbd->LVM/mdadm->device? Why not use the tool
> > on the affected devices directly?
> > 
> 
> I did, and the corruption did not occur. It only happened when writing
> through the DRBD layer. Also, I disabled the TRIM function of the
> tools and merely used it as a drive burn-in without triggering any
> trim commands. Same results.

Just to put this out here too,
and not only in the corresponding thread on the DRBD list:

"the tool" is broken.

Below is very slightly edited for typo and context
from my original post on drbd-user yesterday.

2017-10-12 11:14:55 +0200, Robert Altnoeder @ linbit on drbd-user:
> ... this program does not appear to be very trustworthy, because ...
> of incorrect datatypes for the purpose - apparently, it is attempting to
> memory-map quite large files (~ 70 GiB) and check using a byte-indexed
> offset declared as type 'unsigned', which is commonly only 32 bits wide,
> and therefore inadequate for the byte-wise indexing of anything that is
> larger than 4 GiB.

The test program has other issues as well,
like off-by-one (and thus stack corruption) when initializing the
"buffer" in its "writeAtomically",
unlinking known non-existent files, and other things.
Probably harmless.

But this 32bit index vs >= 4 GiByte file content
is the real bug here, thank you Robert for pointing that out.


Why it does not trigger if DRBD is not in the stack I cannot tell,
maybe the timing is just strangely skewed, and somehow your disk fills
up and everything terminates before the "DetectCorruption" thread tries
to check a >= 4GiB file for the first time.

Or the 8 writer threads starve out the single mmap reader so violently,
that is simply checks so slow it did not get around to the >= 4GiB files.

Anyways: what happens is:

https://github.com/algolia/trimtester/blob/48c44d5beb88/trimtester.cpp#L213

void _checkFile(const std::string , const char *file, std::string 
) {
filename.resize(0);
filename.append(path);
filename.push_back('/');
filename.append(file);
MMapedFile mmap(filename.c_str());
if (mmap.loaded()) {
bool corrupted = false;
// Detect all 512-bytes page inside the file filled by 0 -> can be 
caused by a buggy Trim
for (unsigned i = 0; !corrupted && i < mmap.len(); i += 512) {

// after some number of iterrations,
// i = 4294966784, 2 ** 32 - 512;
// mmap.len however is *larger*.
// in the "i << mmpa.len()", the 32bit integer i is "upscaled",
// size-extended, before the comparison, so that remains true.

if (mmap.len() - i > 4) { // only check page > 4-bytes to avoid 
false positive

// again, size-extension to 64bit, condition is true

bool pagecorrupted = true; 

// *assume* that the "page" was corrupted,

for (unsigned j = i; j < mmap.len() && j < (i + 512); ++j) 
{

// j = i, which is j = 4294966784, (i << mmap.len) is again true because
// of the size-extension of i to 64bit in that term,
// but for the (j < i+ 512) term, neither j nor i is size-extended,
// i + 512 wraps to 0, j < 0 is false,
// loop will not execute even once,
// which means no single byte is checked

if (mmap.content()[j] != 0)
pagecorrupted = false;
}
if (pagecorrupted)
corrupted = true;

// any we "won" a "corrupted" flag by simply "assuming"
// no bytes are bad bytes.
// "So sad." ;-)

}
}
if (corrupted) {
std::cerr << "Corrupted file found: " << filename << std::endl;
exit(1);
}

}
}


Just change "unsigned" to "uint64_t" there, and be happy.


Don't believe it?
Create any file of 4 GiB or larger,
make sure it does not contain 512 (aligned) consecutive zeros,
and "check" it for "corruption" with that logic of trimtester.
It will report that file as corrupted each time.


rm -rf trimtester-is-broken/
mkdir trimtester-is-broken
o=trimtester-is-broken/x1
echo X > $o
l=$o
for i in `seq 2 32`; do
o=trimtester-is-broken/x$i;
cat $l $l > $o ;
rm -f $l;
l=$o;
done
./TrimTester trimtester-is-broken

Wahwahwa Corrupted file found: trimtester-is-broken/x32 mimimimi


Thanks,
that was a nice excercise in proofreading c++ code.



-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosy

Re: [ClusterLabs] big trouble with a DRBD resource

2017-08-10 Thread Lars Ellenberg
On Wed, Aug 09, 2017 at 06:48:01PM +0200, Lentes, Bernd wrote:
> 
> 
> - Am 8. Aug 2017 um 15:36 schrieb Lars Ellenberg 
> lars.ellenb...@linbit.com:
>  
> > crm shell in "auto-commit"?
> > never seen that.
> 
> i googled for "crmsh autocommit pacemaker" and found that: 
> https://github.com/ClusterLabs/crmsh/blob/master/ChangeLog
> See line 650. Don't know what that means.
> > 
> > You are sure you did not forget this necessary piece?
> > ms WebDataClone WebData \
> >meta master-max="1" master-node-max="1" clone-max="2"
> >clone-node-max="1" notify="true"
> 
> I didn't come so far. I followed that guide 
> (http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html#_configure_the_cluster_for_drbd),
> but didn't use the shadow cib.

if you use crmsh "interactively",
crmsh does implicitly use a shadow cib,
and will only commit changes once you "commit",
see "crm configure help commit"

At least that's my experience with crmsh for the last nine years or so.

> The cluster is in testing, not in production, so i thought "nothing
> severe can happen". Misjudged. My error.
> After configuring the primitive without the ms clone my resource
> ClusterMon reacted promptly and sent 2 snmp traps to my management
> station in 193 seconds, which triggered 2 e-Mails ...
> I understand now that the cluster missed the ms clone configuration.
> But so much traps in such a short period. Is that intended ? Or a bug ?

If you configure a resource to fail immediately,
but in a way that pacemaker thinks can be "recovered" from
by stoping and restarting, then pacemaker will do so.
If that results in 2 "actions" within 192 seconds,
that's 100 actions per second, then that seems "quick",
but not a bug per se.
if every single such action triggers a trap,
because you configured the system to send traps for every action,
that's yet a different thing.

So what now?
Where exactly is the "big trouble with DRBD"?
Someone was "almost" following some tutorial, and got in trouble.

How could we keep that from happening to the next person?
Any suggestions which component or behavior we should improve, and how?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] big trouble with a DRBD resource

2017-08-08 Thread Lars Ellenberg
On Fri, Aug 04, 2017 at 06:20:22PM +0200, Lentes, Bernd wrote:
> Hi,
> 
> first: is there a tutorial or s.th. else which helps in understanding what 
> pacemaker logs in syslog and /var/log/cluster/corosync.log ?
> I try hard to find out what's going wrong, but they are difficult to 
> understand, also because of the amount of information.
> Or should i deal more with "crm histroy" or hb_report ?
> 
> What happened:
> I tried to configure a simple drbd resource following 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html#idm140457860751296
> I used this simple snip from the doc:
> configure primitive WebData ocf:linbit:drbd params drbd_resource=wwwdata \
> op monitor interval=60s
> 
> I did it on live cluster, which is in testing currently. I will never do this 
> again. Shadow will be my friend.
> 
> The cluster reacted promptly:
> crm(live)# configure primitive prim_drbd_idcc_devel ocf:linbit:drbd params 
> drbd_resource=idcc-devel \
>> op monitor interval=60
> WARNING: prim_drbd_idcc_devel: default timeout 20s for start is smaller than 
> the advised 240
> WARNING: prim_drbd_idcc_devel: default timeout 20s for stop is smaller than 
> the advised 100
> WARNING: prim_drbd_idcc_devel: action monitor not advertised in meta-data, it 
> may not be supported by the RA


crm shell in "auto-commit"?
never seen that.

You are sure you did not forget this necessary piece?
ms WebDataClone WebData \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"

anyways: you somehow managed to configure drbd as primitive only,
it does not like that.

If you ever are stuck in a situation like that,
I suggest you put your cluster in "maintenance mode",
then fix up your configuration
(remove the primitive, or add the ms definition),
do cleanups for "everything",
simulate the "maintenance mode off",
and if that looks plausible, commit the maintenance mode off.


Also, even though that has nothing to do with your issue there:
just because you *can* do dual-primary DRBD + GFS2 does not mean that it
is a good idea. That "Cluster from scratch" is a prove of concept,
NOT a "best practices how to set up a web server on pacemaker and DRBD"

If you don't have a *very* good reason to use a cluster file
system, for things like web servers, mail servers, file servers,
...  most services actually, a "classic" file system as xfs or
ext4 in failover configuration will usually easily outperform a
two-node GFS2 setup, while being less complex at the same time.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] ocf_take_lock is NOT actually safe to use

2017-06-21 Thread Lars Ellenberg

Repost to a wider audience, to raise awareness for this.
ocf_take_lock may or may not be better than nothing.

It at least "annotates" that the auther would like to protect something
that is considered a "critical region" of the resource agent.

At the same time, it does NOT deliver what the name seems to imply.

I think I brought this up a few times over the years, but was not noisy
enough about it, because it seemed not important enough: no-one was
actually using this anyways.

But since new usage has been recently added with
[ClusterLabs/resource-agents] targetcli lockfile (#917)
here goes:

On Wed, Jun 07, 2017 at 02:49:41PM -0700, Dejan Muhamedagic wrote:
> On Wed, Jun 07, 2017 at 05:52:33AM -0700, Lars Ellenberg wrote:
> > Note: ocf_take_lock is NOT actually safe to use.
> > 
> > As implemented, it uses "echo $pid > lockfile" to create the lockfile,
> > which means if several such "ocf_take_lock" happen at the same time,
> > they all "succeed", only the last one will be the "visible" one to future 
> > waiters.
> 
> Ugh.

Exactly.

Reproducer:
#
#!/bin/bash
export OCF_ROOT=/usr/lib/ocf/ ;
.  /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs ;

x() (
ocf_take_lock dummy-lock ;
ocf_release_lock_on_exit dummy-lock  ;
set -C;
echo x > protected && sleep 0.15 && rm -f protected || touch BROKEN;
);

mkdir -p /run/ocf_take_lock_demo
cd /run/ocf_take_lock_demo
rm -f BROKEN; i=0;
time while ! test -e BROKEN; do
x &  x &
wait;
i=$(( i+1 ));
done ;
test -e BROKEN && echo "reproduced race in $i iterations"
#

x() above takes, and, because of the () subshell and
ocf_release_lock_on_exit, releases the "dummy-lock",
and within the protected region of code,
creates and removes a file "protected".

If ocf_take_lock was good, there could never be two instances
inside the lock, so echo x > protected should never fail.

With the current implementation of ocf_take_lock,
it takes "just a few" iterations here to reproduce the race.
(usually within a minute).

The races I see in ocf_take_lock:
"creation race":
test -e $lock
# someone else may create it here
echo $$ > $lock
# but we override it with ours anyways

"still empty race":
test -e $lock   # maybe it already exists (open O_CREAT|O_TRUNC)
# but does not yet contain target pid,
pid=`cat $lock` # this one is empty,
kill -0 $pid# and this one fails
and thus a "just being created" one is considered stale

There are other problems around "stale pid file detection",
but let's not go into that minefield right now.

> > Maybe we should change it to 
> > ```
> > while ! ( set -C; echo $pid > lockfile ); do
> > if test -e lockfile ; then
> > : error handling for existing lockfile, stale lockfile detection
> > else
> > : error handling for not being able to create lockfile
> > fi
> > done
> > : only reached if lockfile was successfully created
> > ```
> > 
> > (or use flock or other tools designed for that purpose)
> 
> flock would probably be the easiest. mkdir would do too, but for
> upgrade issues.

and, being part of util-linux, flock should be available "everywhere".

but because writing "wrappers" around flock similar to the intended
semantics of ocf_take_lock and ocf_release_lock_on_exit is not easy
either, usually you'd be better of using flock directly in the RA.

so, still trying to do this with shell:

"set -C" (respectively set -o noclober):
If set, disallow existing regular files to be overwritten
by redirection of output.

normal '>' means: O_WRONLY|O_CREAT|O_TRUNC,
set -C '>' means: O_WRONLY|O_CREAT|O_EXCL

using "set -C ; echo $$ > $lock" instead of 
"test -e $lock || echo $$ > $lock"
gets rid of the "creation race".

To get rid of the "still empty race",
we'd need to play games with hardlinks:

(
set -C
success=false
if echo $$ > .$lock ; then
ln .$lock $lock && success=true
rm -f .$lock
fi
$success
)

That should be "good enough",
and much better than what we have now.



back to a possible "flock" wrapper,
maybe something like this:

ocf_synchronize() {
local lockfile=$1
shift
(
flock -x 8 || exit 1
( "$@" ) 8>&-
) 8> "$lockfile"
}
# and then
ocf_synchronize my_exclusive_shell_function with some args

A

Re: [ClusterLabs] Pacemaker 1.1.17-rc1 now available

2017-05-09 Thread Lars Ellenberg
Yay!

On Mon, May 08, 2017 at 07:50:49PM -0500, Ken Gaillot wrote:
> "crm_attribute --pattern" to update or delete all node
> attributes matching a regular expression

Just a nit, but "pattern" usually is associated with "glob pattern".
If it's not a "pattern" but a "regex",
"--regex" would be more appropriate.

 :-)

Cheers,

Lars


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Coming in Pacemaker 1.1.17: start a node in standby

2017-04-27 Thread Lars Ellenberg
On Thu, Apr 27, 2017 at 09:19:55AM +0200, Jehan-Guillaume de Rorthais wrote:
> > > > I seem to remember that at some deployment,
> > > > we set the node instance attribute standby=on, always,
> > > > and took it out of standby using the node_state transient_attribute :-)
> > > > 
> > > > As in
> > > > # crm node standby ava

> > > > # crm node status-attr ava set standby off

> > Well, you want the "persistent" setting "on",
> > and override it with a "transient" setting "off".

> Quick questions:
> 
>   * is it what happen in the CIB when you call crm_standby?

crm_standby --node emma --lifetime reboot --update off
crm_standby --node emma --lifetime forever --update on

(or -n -l -v,
default node is current node,
default lifetime is forever)

>   * is it possible to do the opposite? persistent setting "off" and override 
> it
> with the transient setting?

see above, also man crm_standby,
which again is only a wrapper around crm_attribute.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Coming in Pacemaker 1.1.17: start a node in standby

2017-04-25 Thread Lars Ellenberg
On Tue, Apr 25, 2017 at 10:27:43AM +0200, Jehan-Guillaume de Rorthais wrote:
> On Tue, 25 Apr 2017 10:02:21 +0200
> Lars Ellenberg <lars.ellenb...@linbit.com> wrote:
> 
> > On Mon, Apr 24, 2017 at 03:08:55PM -0500, Ken Gaillot wrote:
> > > Hi all,
> > > 
> > > Pacemaker 1.1.17 will have a feature that people have occasionally asked
> > > for in the past: the ability to start a node in standby mode.  
> > 
> > 
> > I seem to remember that at some deployment,
> > we set the node instance attribute standby=on, always,
> > and took it out of standby using the node_state transient_attribute :-)
> > 
> > As in
> > # crm node standby ava
> >   
> > 
> >   
> >   ...
> > 
> >   
> >   ...
> 
> This solution seems much more elegant and obvious to me. A cli
> (crm_standby?) interface would be ideal.
> 
> It feels weird to mix setup interfaces (through crm_standby or through the
> config file) to manipulate the same node attribute. Isn't it possible to set
> the standby instance attribute of a node **before** it is added to the 
> cluster?
> 
> > # crm node status-attr ava set standby off
> >> crm-debug-origin="do_update_resource" join="member" expected="member"> ...
> > 
> >   
> > ...
> > 
> >   
> > 
> >   
> 
> It is not really straight forward to understand why you need to edit a second
> different nvpair to exit the standby mode... :/

Well, you want the "persistent" setting "on",
and override it with a "transient" setting "off".

That's how to do it in pacemaker.

But yes, what exactly has ever been "obvious" in pacemaker,
before you knew?  :-)(or HA in general, to be fair)

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Coming in Pacemaker 1.1.17: start a node in standby

2017-04-25 Thread Lars Ellenberg
On Mon, Apr 24, 2017 at 03:08:55PM -0500, Ken Gaillot wrote:
> Hi all,
> 
> Pacemaker 1.1.17 will have a feature that people have occasionally asked
> for in the past: the ability to start a node in standby mode.


I seem to remember that at some deployment,
we set the node instance attribute standby=on, always,
and took it out of standby using the node_state transient_attribute :-)

As in
# crm node standby ava
  

  
  ...

  
  ...

# crm node status-attr ava set standby off
  
...

  
...

  

  

Seemed to work good enough at the time.

Though that may have been an un-intentional side-effect
of checking both sets of attributes?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLabs Developers] checking all procs on system enough during stop action?

2017-04-24 Thread Lars Ellenberg
On Mon, Apr 24, 2017 at 04:34:07PM +0200, Jehan-Guillaume de Rorthais wrote:
> Hi all,
> 
> In the PostgreSQL Automatic Failover (PAF) project, one of most frequent
> negative feedback we got is how difficult it is to experience with it because 
> of
> fencing occurring way too frequently. I am currently hunting this kind of
> useless fencing to make life easier.
> 
> It occurs to me, a frequent reason of fencing is because during the stop
> action, we check the status of the PostgreSQL instance using our monitor
> function before trying to stop the resource. If the function does not return
> OCF_NOT_RUNNING, OCF_SUCCESS or OCF_RUNNING_MASTER, we just raise an error,
> leading to a fencing. See:
> https://github.com/dalibo/PAF/blob/d50d0d783cfdf5566c3b7c8bd7ef70b11e4d1043/script/pgsqlms#L1291-L1301
> 
> I am considering adding a check to define if the instance is stopped even if 
> the
> monitor action returns an error. The idea would be to parse **all** the local
> processes looking for at least one pair of "/proc//{comm,cwd}" related to
> the PostgreSQL instance we want to stop. If none are found, we consider the
> instance is not running. Gracefully or not, we just know it is down and we can
> return OCF_SUCCESS.
> 
> Just for completeness, the piece of code would be:
> 
>my @pids;
>foreach my $f (glob "/proc/[0-9]*") {
>push @pids => basename($f)
>if -r $f
>and basename( readlink( "$f/exe" ) ) eq "postgres"
>and readlink( "$f/cwd" ) eq $pgdata;
>}
> 
> I feels safe enough to me. The only risk I could think of is in a shared disk
> cluster with multiple nodes accessing the same data in RW (such setup can
> fail in so many ways :)). However, PAF is not supposed to work in such 
> context,
> so I can live with this.
> 
> Do you guys have some advices? Do you see some drawbacks? Hazards?

Isn't that the wrong place to "fix" it?
Why did your _monitor  return something "weird"?
What did it return?
Should you not fix it there?

Just thinking out loud.

Cheers,
Lars


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Syncing data and reducing CPU utilization of cib process

2017-04-03 Thread Lars Ellenberg
On Mon, Apr 03, 2017 at 03:44:21PM +0530, Nikhil Utane wrote:
> Here's the snapshot. As seen below, the messages are coming at more than a
> second frequency.
> I checked that the cib.xml file was not updated (no change to timestamp of
> file)
> Then i took tcpdump and did not see any message other than keep-alives.
> Is the cib process looping incorrectly?
> Can share strace output if required.
> 
> Apr 03 14:48:28 [6372] 0005B932ED72cib: info:
> crm_compress_string:  Compressed 427943 bytes into 13559 (ratio 31:1ms
> Apr 03 14:48:29 [6372] 0005B932ED72cib: info:
> crm_compress_string:  Compressed 427943 bytes into 13536 (ratio 31:1ms
> Apr 03 14:48:29 [6372] 0005B932ED72cib: info:
> crm_compress_string:  Compressed 427943 bytes into 13551 (ratio 31:1ms
> Apr 03 14:48:30 [6372] 0005B932ED72cib: info:
> crm_compress_string:  Compressed 427943 bytes into 13552 (ratio 31:1ms
> Apr 03 14:48:31 [6372] 0005B932ED72cib: info:
> crm_compress_string:  Compressed 427943 bytes into 13537 (ratio 31:1ms
> Apr 03 14:48:32 [6372] 0005B932ED72cib: info:
> crm_compress_string:  Compressed 427943 bytes into 13534 (ratio 31:1ms
> Apr 03 14:48:32 [6372] 0005B932ED72cib: info:
> crm_compress_string:  Compressed 427943 bytes into 13546 (ratio 31:1ms

Each and every "cibadmin -Q" (or equivalent) will trigger that,
also for local IPC.

Stop polling the cib several times per seconds.

If you have to, "subscribe" to cib updates, using the API.

And stop pushing that much data into the cib.
Maybe, as a stop gap, compress it yourself,
before you stuff it into the cib.


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to PCMK_OCF_UNKNOWN_ERROR

2017-03-06 Thread Lars Ellenberg
On Thu, Mar 02, 2017 at 05:31:33PM -0600, Ken Gaillot wrote:
> On 03/01/2017 05:28 PM, Andrew Beekhof wrote:
> > On Tue, Feb 28, 2017 at 12:06 AM, Lars Ellenberg
> > <lars.ellenb...@linbit.com> wrote:
> >> When I recently tried to make use of the DEGRADED monitoring results,
> >> I found out that it does still not work.
> >>
> >> Because LRMD choses to filter them in ocf2uniform_rc(),
> >> and maps them to PCMK_OCF_UNKNOWN_ERROR.
> >>
> >> See patch suggestion below.
> >>
> >> It also filters away the other "special" rc values.
> >> Do we really not want to see them in crmd/pengine?
> > 
> > I would think we do.

> >> Note: I did build it, but did not use this yet,
> >> so I have no idea if the rest of the implementation of the DEGRADED
> >> stuff works as intended or if there are other things missing as well.
> > 
> > failcount might be the other place that needs some massaging.
> > specifically, not incrementing it when a degraded rc comes through
> 
> I think that's already taken care of.
> 
> >> Thougts?\
> > 
> > looks good to me
> > 
> >>
> >> diff --git a/lrmd/lrmd.c b/lrmd/lrmd.c
> >> index 724edb7..39a7dd1 100644
> >> --- a/lrmd/lrmd.c
> >> +++ b/lrmd/lrmd.c
> >> @@ -800,11 +800,40 @@ hb2uniform_rc(const char *action, int rc, const char 
> >> *stdout_data)
> >>  static int
> >>  ocf2uniform_rc(int rc)
> >>  {
> >> -if (rc < 0 || rc > PCMK_OCF_FAILED_MASTER) {
> >> -return PCMK_OCF_UNKNOWN_ERROR;
> 
> Let's simply use > PCMK_OCF_OTHER_ERROR here, since that's guaranteed to
> be the high end.
> 
> Lars, do you want to test that?

Why would we want to filter at all, then?

I get it that we may want to map non-ocf agent exit codes
into the "ocf" range,
but why mask exit codes from "ocf" agents at all (in lrmd)?

Lars


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to PCMK_OCF_UNKNOWN_ERROR

2017-02-27 Thread Lars Ellenberg
When I recently tried to make use of the DEGRADED monitoring results,
I found out that it does still not work.

Because LRMD choses to filter them in ocf2uniform_rc(),
and maps them to PCMK_OCF_UNKNOWN_ERROR.

See patch suggestion below.

It also filters away the other "special" rc values.
Do we really not want to see them in crmd/pengine?
Why does LRMD think it needs to outsmart the pengine?

Note: I did build it, but did not use this yet,
so I have no idea if the rest of the implementation of the DEGRADED
stuff works as intended or if there are other things missing as well.

Thougts?

diff --git a/lrmd/lrmd.c b/lrmd/lrmd.c
index 724edb7..39a7dd1 100644
--- a/lrmd/lrmd.c
+++ b/lrmd/lrmd.c
@@ -800,11 +800,40 @@ hb2uniform_rc(const char *action, int rc, const char 
*stdout_data)
 static int
 ocf2uniform_rc(int rc)
 {
-if (rc < 0 || rc > PCMK_OCF_FAILED_MASTER) {
-return PCMK_OCF_UNKNOWN_ERROR;
+switch (rc) {
+default:
+   return PCMK_OCF_UNKNOWN_ERROR;
+
+case PCMK_OCF_OK:
+case PCMK_OCF_UNKNOWN_ERROR:
+case PCMK_OCF_INVALID_PARAM:
+case PCMK_OCF_UNIMPLEMENT_FEATURE:
+case PCMK_OCF_INSUFFICIENT_PRIV:
+case PCMK_OCF_NOT_INSTALLED:
+case PCMK_OCF_NOT_CONFIGURED:
+case PCMK_OCF_NOT_RUNNING:
+case PCMK_OCF_RUNNING_MASTER:
+case PCMK_OCF_FAILED_MASTER:
+
+case PCMK_OCF_DEGRADED:
+case PCMK_OCF_DEGRADED_MASTER:
+   return rc;
+
+#if 0
+   /* What about these?? */
+/* 150-199 reserved for application use */
+PCMK_OCF_CONNECTION_DIED = 189, /* Operation failure implied by 
disconnection of the LRM API to a local or remote node */
+
+PCMK_OCF_EXEC_ERROR= 192, /* Generic problem invoking the agent */
+PCMK_OCF_UNKNOWN   = 193, /* State of the service is unknown - used 
for recording in-flight operations */
+PCMK_OCF_SIGNAL= 194,
+PCMK_OCF_NOT_SUPPORTED = 195,
+PCMK_OCF_PENDING   = 196,
+PCMK_OCF_CANCELLED = 197,
+PCMK_OCF_TIMEOUT   = 198,
+PCMK_OCF_OTHER_ERROR   = 199, /* Keep the same codes as PCMK_LSB */
+#endif
 }
-
-return rc;
 }
 
 static int

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-22 Thread Lars Ellenberg
On Thu, Sep 22, 2016 at 08:01:44AM +0200, Klaus Wenninger wrote:
> On 09/22/2016 06:34 AM, renayama19661...@ybb.ne.jp wrote:
> > Hi Klaus,
> >
> > Thank you for comment.
> >
> > Okay!
> >
> > Will it mean that improvement is considered in community in future?
> 
> Speaking for me I'd like to have some feedback if we might
> have overseen something so that it is rather a config issue.
> 
> One of my current projects is to introduce improved
> observation of pacemaker_remoted by sbd. (Saying improved
> here because there is already something when you enable
> pacemaker-watcher on remote-nodes but it creates unneeded
> watchdog-reboots in a couple of cases ...)
> Looks as if some additional (direct) communication (heartbeat -
> the principle not the communication & membership for
> clusters) between pacemaker_remoted (very similar to lrmd)
> and sbd would come handy for that.
> 
> So in this light it might make sense
> to consider expanding that for crmd as well ...
> 
> If we are finally facing an issue I'd herewith like to ask for
> input.

In a somewhat extended context, there used to be "apphbd",
which itself would register with some watchdog to "monitor" itself,
and which "applications" would register with to negotiate their own
"application heartbeat".
Not neccessarily only components of the cluster manager,
but cluster aware "resources" as well.

If they fail to feed their app hb, apphbd would then "trigger a
notification", and some other entity would react on that based on
yet an other configuration.  And plugins.
Didn't old heartbeat like the concept of plugins...

Anyways, you get the idea.

Currently, we have SBD chosen as such a "watchdog proxy",
maybe we can generalize it?

All of that would require cooperation within the node itself, though.

In this scenario, the cluster is not trusting the "sanity"
of the "commander in chief".

So maybe in addition of this "in-node application heartbeat",
all non-DCs should periodically actively challenge the sanity
of the DC from the outside, and trigger re-election if they have
"reasonable doubt"?


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Lars Ellenberg
On Tue, Sep 20, 2016 at 11:44:58AM +, Auer, Jens wrote:
> Hi,
> 
> I've decided to create two answers for the two problems. The cluster
> still fails to relocate the resource after unloading the modules even
> with resource-agents 3.9.7

>From the point of view of the resource agent,
you configured it to use a non-existing network.
Which it considers to be a configuration error,
which is treated by pacemaker as
"don't try to restart anywhere
but let someone else configure it properly, first".

I think the OCF_ERR_CONFIGURED is good, though, otherwise 
configuration errors might go unnoticed for quite some time.
A network interface is not supposed to "vanish".

You may disagree with that choice,
in which case you could edit the resource agent to treat it not as
configuration error, but as "required component not installed"
(OCF_ERR_CONFIGURED vs OCF_ERR_INSTALLED), and pacemaker will
"try to find some other node with required components available",
before giving up completely.

Still, I have yet to see what scenario you are trying to test here.
To me, this still looks like "scenario evil admin".  If so, I'd not even
try, at least not on the pacemaker configuration level.

> CONFIDENTIALITY NOTICE:

Oh please :-/
This is a public mailing list.

> There seems to be some difference because the device is not RUNNING;

> Also the netmask and the ip address are wrong. I have configured the
> device to 192.168.120.10 with netmask 192.168.120.10. How does IpAddr2
> get the wrong configuration? I have no idea.

A netmask of "192.168.120.10" is nonsense.
That is the address, not a mask.

Also, according to some posts back,
you have configured it in pacemaker with
cidr_netmask=32, which is not particularly useful either.

You should use the netmask of whatever subnet is supposedly actually
reachable via that address and interface. Typical masks are e.g.
/24, /20, /16 resp. 255.255.255.0, 255.255.240.0, 255.255.0.0

Apparently the RA is "nice" enough (or maybe buggy enough)
to let that slip, and guess the netmask from the routing tables,
or fall back to whatever builtin defaults there are on the various
layers of tools involved.

Again: the IPaddr2 resource agent is supposed to control the assignment
of an IP address, hence the name.

It is not supposed to create or destroy network interfaces,
or configure bonding, or bridges, or anything like that.

In fact, it is not even supposed to bring up or down the interfaces,
even though for "convenience" it seems to do "ip link set up".

That is not a bug, but limited scope.

If you wanted to test the reaction of the cluster to a vanishing
IP address, the correct test would be an
  "ip addr del 192.168.120.10 dev bond0"

And the expectation is that it will notice, and just re-add the address.
That is the scope of the IPaddr2 resource agent.

Monitoring connectivity, or dealing with removed interface drivers,
or unplugged devices, or whatnot, has to be dealt with elsewhere.

What you did is: down the bond, remove all slave assignments, even
remove the driver, and expect the resource agent to "heal" things that
it does not know about. It can not.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Lars Ellenberg
On Mon, Sep 19, 2016 at 02:57:57PM +0200, Jan Pokorný wrote:
> On 19/09/16 09:15 +, Auer, Jens wrote:
> > After the restart ifconfig still shows the device bond0 to be not RUNNING:
> > MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
> > bond0: flags=5123<UP,BROADCAST,MASTER,MULTICAST>  mtu 1500
> > inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
> > ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
> > RX packets 2034  bytes 286728 (280.0 KiB)
> > RX errors 0  dropped 29  overruns 0  frame 0
> > TX packets 2284  bytes 355975 (347.6 KiB)
> > TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> This seems to suggest bond0 interface is up and address-assigned
> (well, the netmask is strange).  So there would be nothing
> contradictory to what I said on the address of IPaddr2.
> 
> Anyway, you should rather be using "ip" command from iproute suite
> than various if* tools that come short in some cases:
> http://inai.de/2008/02/19
> This would also be consistent with IPaddr2 uses under the hood.

The resource agent only controlls and checks
the presence of a certain IP on a certain NIC
(and some parameters).

What you likely ended up with after the "restart"
is an "empty" bonding device with that IP assigned,
but without any "slave" devices, or at least
with the slave devices still set to link down.

If you really wanted the RA to also know about the slaves,
and be able to properly and fully configure a bonding,
you'd have to enhance that resource agent.

If you want the IP to move to some other node,
if it has connectivity problems, use a "ping" and/or
"ethmonitor" resource in addition to the IP.

If you wanted to test-drive cluster response against a
failing network device, your test was wrong.

If you wanted to test-drive cluster response against
a "fat fingered" (or even evil) operator or admin:
give up right there...
You'll never be able to cover it all :-)


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf:linbit:drbd Deprecated? Not.

2016-09-16 Thread Lars Ellenberg
On Thu, Sep 15, 2016 at 11:41:12PM +, Jason A Ramsey wrote:
> I note from http://linux-ha.org/doc/man-pages/re-ra-drbd.html that this 
> resource agent is deprecated…? What’s the alternative?

Nope.  There had been, lng time ago,
an "ocf:HEARTBEAT:drbd" resource agent.
and it was considered not good enough.

That's why we provide the ocf:LINBIT:drbd resource agent.
Which you are supposed to use with DRBD 8.4 on Pacemaker.


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-31 Thread Lars Ellenberg
On Wed, Aug 31, 2016 at 12:29:59PM +0200, Dejan Muhamedagic wrote:
> > Also remember that sometimes we set a "local" variable in a function
> > and expect it to be visible in nested functions, but also set a new
> > value in a nested function and expect that value to be reflected
> > in the outer scope (up to the last "local").
> 
> I hope that this wasn't (ab)used much, it doesn't sound like it
> would be easy to follow.
> 
> > diff --git a/heartbeat/ocf-shellfuncs.in b/heartbeat/ocf-shellfuncs.in
> > index 6d9669d..4151630 100644
> > --- a/heartbeat/ocf-shellfuncs.in
> > +++ b/heartbeat/ocf-shellfuncs.in
> > @@ -920,3 +920,37 @@ ocf_is_true "$OCF_TRACE_RA" && ocf_start_trace
> >  if ocf_is_true "$HA_use_logd"; then
> > : ${HA_LOGD:=yes}
> >  fi
> > +
> > +# We use a lot of function local variables with the "local" keyword.
> > +# Which works fine with dash and bash,
> > +# but does not work with e.g. ksh.
> > +# Fail cleanly with a sane error message,
> > +# if the current shell does not feel compatible.
> > +
> > +__ocf_check_for_incompatible_shell_l2()
> > +{
> > +   [ $__ocf_check_for_incompatible_shell_k = v1 ] || return 1
> > +   local __ocf_check_for_incompatible_shell_k=v2
> > +   [ $__ocf_check_for_incompatible_shell_k = v2 ] || return 1
> > +   return 0
> > +}
> > +
> > +__ocf_check_for_incompatible_shell_l1()
> > +{
> > +   [ $__ocf_check_for_incompatible_shell_k = v0 ] || return 1
> 
> If there's no "local" and that in the function below fails, won't
> this produce a syntax error (due to __ocf_..._k being undefined)?

Which is ok with me, still return 1 ;-)

> > +   local __ocf_check_for_incompatible_shell_k=v1
> > +   __ocf_check_for_incompatible_shell_l2

This ^^ needs to be:
> > +   __ocf_check_for_incompatible_shell_l2 || return 1

> > +   [ $__ocf_check_for_incompatible_shell_k = v1 ] || return 1
> > +   return 0
> > +}
> > +
> > +__ocf_check_for_incompatible_shell()
> > +{
> > +   local __ocf_check_for_incompatible_shell_k=v0

Similarly, this:
> > +   __ocf_check_for_incompatible_shell_l1

should be
> > +   __ocf_check_for_incompatible_shell_l1 || return 1

> > +   [ $__ocf_check_for_incompatible_shell_k = v0 ] && return 0
> > +   ocf_exit_reason "Current shell seems to be incompatible. We suggest 
> > dash or bash (compatible)."
> > +   exit $OCF_ERR_GENERIC
> > +}
> > +
> > +__ocf_check_for_incompatible_shell
> 
> Looks good otherwise. If somebody's willing to test it on
> solaris...

There is a ksh93 for linux as well, and it appears to be very similar to
the one apparenly shipped with solaris. But yes, you are right ;-)

Lars


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-30 Thread Lars Ellenberg
On Tue, Aug 30, 2016 at 06:15:49PM +0200, Dejan Muhamedagic wrote:
> On Tue, Aug 30, 2016 at 10:08:00AM -0500, Dmitri Maziuk wrote:
> > On 2016-08-30 03:44, Dejan Muhamedagic wrote:
> > 
> > >The kernel reads the shebang line and it is what defines the
> > >interpreter which is to be invoked to run the script.
> > 
> > Yes, and does the kernel read when the script is source'd or executed via
> > any of the mechanisms that have the executable specified in the call,
> > explicitly or implicitly?
> 
> I suppose that it is explained in enough detail here:
> 
> https://en.wikipedia.org/wiki/Shebang_(Unix)
> 
> In particular:
> 
> https://en.wikipedia.org/wiki/Shebang_(Unix)#Magic_number
> 
> > >None of /bin/sh RA requires bash.
> > 
> > Yeah, only "local".
> 
> As already mentioned elsewhere in the thread, local is supported
> in most shell implementations and without it we otherwise
> wouldn't to be able to maintain software. Not sure where local
> originates, but wouldn't bet that it's bash.

Let's just agree that as currently implemented,
our collection of /bin/sh scripts won't run on ksh as shipped with
solaris (while there likely are ksh derivatives in *BSD somewhere
that would be mostly fine with them).

And before this turns even more into a "yes, I'm that old, too" thread,
may I suggest to document that we expect a
"dash compatible" /bin/sh, and that we expect scripts
to have a bash shebang (or as appropriate) if they go beyond that.

Then check for incompatible shells in ocf-shellfuncs,
and just exit early if we detect incompatibilities.

For a suggestion on checking for a proper "local" see below.
(Add more checks later, if someone feels like it.)

Though, if someone rewrites not the current agents, but the "lib/ocf*"
help stuff to be sourced by shell based agents in a way that would
support RAs in all bash, dash, ash, ksh, whatever,
and the result turns out not too much worse than what we have now,
I'd have no problem with that...

Cheers,

Lars


And for the "typeset" crowd,
if you think s/local/typeset/ was all that was necessary
to support function local variables in ksh, think again:

ksh -c '
function a {
echo "start of a: x=$x"
typeset x=a
echo "before b: x=$x"
b
echo "end of a: x=$x"
}
function b {
echo "start of b: x=$x ### HAHA guess this one was unexpected 
to all but ksh users"
typeset x=b
echo "end of b: x=$x"
}
x=x
echo "before a: x=$x"
a
echo "after a: x=$x"
'

Try the same with bash.
Also remember that sometimes we set a "local" variable in a function
and expect it to be visible in nested functions, but also set a new
value in a nested function and expect that value to be reflected
in the outer scope (up to the last "local").




diff --git a/heartbeat/ocf-shellfuncs.in b/heartbeat/ocf-shellfuncs.in
index 6d9669d..4151630 100644
--- a/heartbeat/ocf-shellfuncs.in
+++ b/heartbeat/ocf-shellfuncs.in
@@ -920,3 +920,37 @@ ocf_is_true "$OCF_TRACE_RA" && ocf_start_trace
 if ocf_is_true "$HA_use_logd"; then
: ${HA_LOGD:=yes}
 fi
+
+# We use a lot of function local variables with the "local" keyword.
+# Which works fine with dash and bash,
+# but does not work with e.g. ksh.
+# Fail cleanly with a sane error message,
+# if the current shell does not feel compatible.
+
+__ocf_check_for_incompatible_shell_l2()
+{
+   [ $__ocf_check_for_incompatible_shell_k = v1 ] || return 1
+   local __ocf_check_for_incompatible_shell_k=v2
+   [ $__ocf_check_for_incompatible_shell_k = v2 ] || return 1
+   return 0
+}
+
+__ocf_check_for_incompatible_shell_l1()
+{
+   [ $__ocf_check_for_incompatible_shell_k = v0 ] || return 1
+   local __ocf_check_for_incompatible_shell_k=v1
+   __ocf_check_for_incompatible_shell_l2
+   [ $__ocf_check_for_incompatible_shell_k = v1 ] || return 1
+   return 0
+}
+
+__ocf_check_for_incompatible_shell()
+{
+   local __ocf_check_for_incompatible_shell_k=v0
+   __ocf_check_for_incompatible_shell_l1
+   [ $__ocf_check_for_incompatible_shell_k = v0 ] && return 0
+   ocf_exit_reason "Current shell seems to be incompatible. We suggest 
dash or bash (compatible)."
+   exit $OCF_ERR_GENERIC
+}
+
+__ocf_check_for_incompatible_shell


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-29 Thread Lars Ellenberg
On Mon, Aug 29, 2016 at 04:37:00PM +0200, Dejan Muhamedagic wrote:
> Hi,
> 
> On Mon, Aug 29, 2016 at 02:58:11PM +0200, Gabriele Bulfon wrote:
> > I think the main issue is the usage of the "local" operator in ocf*
> > I'm not an expert on this operator (never used!), don't know how hard it is 
> > to replace it with a standard version.
> 
> Unfortunately, there's no command defined in POSIX which serves
> the purpose of local, i.e. setting variables' scope. "local" is,
> however, supported in almost all shells (including most versions
> of ksh, but apparently not the one you run) and hence we
> tolerated that in /bin/sh resource agents.

local variables in shell:

  dash (which we probably need to support) knows about "local",
  and as far as I know, nothing else.

  Some versions of dash treat "local a=A b=B"
  different from "local a=A; local b=B;"

  bash knows about typeset (which it considers obsolete),
  declare (which is the replacement for typeset)
  and local (which is mostly, but not completely, identical to declare).

  ksh can do function local variables with "typeset",
  but only in functions defined with the function keyword,
  NOT in functions that are defined with the "name()" syntax.

function definitions in shell:

  ksh treats "function x {}" and "x() {}" differently (see above)
  bash knows both "function name {}" syntax, and "name() { }" syntax,
  and treats them identically,
  but dash only knows "name() {}" syntax. (at least in my version...)

that's all broken.  always was.

The result is that it is not possible to write shell scripts
using functions with local variables that run in
dash, bash and ksh.

And no, I strongly do not think that we should "fall back" to the
"art" of shell syntax and idioms that was force on you by the original"
choose-your-brand-and-year-and-version shell, just because some
"production systems" still have /bin/sh point to whatever it was
their oldest ancestor system shipped with in the 19sixties...

Maybe we should simply put some sanity check into
ony of the first typically sourced helper "include" scripts,
and bail out early with a sane message if it looks like it won't work?

And also package all shell scripts with a shebang of
/opt/bin/bash (or whatever) for non-linux systems?

Lars


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Fwd: FW: heartbeat can monitor virtual IP alive or not .

2016-04-28 Thread Lars Ellenberg
On Thu, Apr 21, 2016 at 01:18:13AM +0800, fu ml wrote:
> Ha.cf:
>
> The question is we want heartbeat monitor virtual IP,
> 
> If this virtual IP on Linux01 can’t ping or respond ,
> 
> We want Linux02 auto take over this service IP Regardless of Linux01’s
> Admin IP is alive or not,
> 
> 
> 
> We try modify ha.cf as following (ex. Linux01):
> 
> 1)ucast eth0 10.88.222.53
> 2)ucast eth0:0 10.88.222.53
> 3)ucast eth0 10.88.222.51 & ucast eth0 10.88.222.53
> 4)ucast eth0 10.88.222.51 & ucast eth0:0 10.88.222.53

> We test the four type option but all failed,

Just to clarify:
in ha.cf, you tell heartbeat which infrastructure to use
for cluster communications.
That means, IPs and NICs you mention there must already exist.

In haresources,
you'd put the resources the cluster is supposed to manage.
That could be an IP address.

But no, *heartbeat* in haresources mode
does NOT do resource monitoring.
It does node alive checks based on heartbeats,
it re-acts on node-dead events only.
For resource monitoring, you'd have to combine it with pacemaker.
(or, like in the old days, mon, or similar stuff). But don't.

If you need more than "node-dead" detection,
what you should do for a new system is:
==> use pacemaker on corosync.

Or, if all you are going to manage is a bunch of IP adresses,
maybe you should chose a different tool, VRRP with keepalived
may be better for your needs.


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts

2016-04-25 Thread Lars Ellenberg
On Thu, Apr 21, 2016 at 12:50:43PM -0500, Ken Gaillot wrote:
> Hello everybody,
> 
> The release cycle for 1.1.15 will be started soon (hopefully tomorrow)!
> 
> The most prominent feature will be Klaus Wenninger's new implementation
> of event-driven alerts -- the ability to call scripts whenever
> interesting events occur (nodes joining/leaving, resources
> starting/stopping, etc.).

What exactly is "etc." here?
What is the comprehensive list
of which "events" will trigger "alerts"?

My guess would be
 DC election/change
   which does not necessarily imply membership change
 change in membership
   which includes change in quorum
 fencing events
   (even failed fencing?)
 resource start/stop/promote/demote
  (probably) monitor failure?
   maybe only if some fail-count changes to/from infinity?
   or above a certain threshold?

 change of maintenance-mode?
 node standby/online (maybe)?
 maybe "resource cannot be run anywhere"?

would it be useful to pass in the "transaction ID"
or other pointer to the recorded cib input at the time
the "alert" was triggered?

can an alert "observer" (alert script) "register"
for only a subset of the "alerts"?

if so, can this filter be per alert script,
or per "recipient", or both?

Thanks,

Lars


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] getting "Totem is unable to form a cluster" error

2016-04-12 Thread Lars Ellenberg
On Mon, Apr 11, 2016 at 08:23:03AM +0200, Jan Friesse wrote:

...
>>> bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>>>   link/ether 74:e6:e2:73:e5:61 brd ff:ff:ff:ff:ff:ff
>>>   inet 10.150.20.91/24 brd 10.150.20.55 scope global bond0
>>>   inet 192.168.150.12/22 brd 192.168.151.255 scope global bond0:cluster
>>>   inet6 fe80::76e6:e2ff:fe73:e561/64 scope link
>>>  valid_lft forever preferred_lft forever
>>
>> This is ifconfig output? I'm just wondering how you were able to set
>> two ipv4 addresses (in this format, I would expect another interface
>> like bond0:1 or nothing at all)?
...

No, it is "ip addr show" output.

> RHEL 6:
> 
> # tunctl -p
> Set 'tap0' persistent and owned by uid 0
> 
> # ip addr add 192.168.7.1/24 dev tap0
> # ip addr add 192.168.8.1/24 dev tap0
> # ifconfig tap0
> tap0  Link encap:Ethernet  HWaddr 22:95:B1:85:67:3F
>   inet addr:192.168.7.1  Bcast:0.0.0.0  Mask:255.255.255.0
>   BROADCAST MULTICAST  MTU:1500  Metric:1
>   RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:500
>   RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)


# ip addr add 192.168.7.1/24 dev tap0
# ip addr add 192.168.8.1/24 dev tap0 label tap0:jan
# ip addr show dev tap0

And as long as you actually use those "label"s,
you then can even see these with "ifconfig tap0:jan".


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Set "start-failure-is-fatal=false" on only one resource?

2016-03-25 Thread Lars Ellenberg
On Fri, Mar 25, 2016 at 04:08:48PM +, Sam Gardner wrote:
> On 3/25/16, 10:26 AM, "Lars Ellenberg" <lars.ellenb...@linbit.com> wrote:
> 
> 
> >On Thu, Mar 24, 2016 at 09:01:18PM +, Sam Gardner wrote:
> >> I'm having some trouble on a few of my clusters in which the DRBD Slave
> >>resource does not want to come up after a reboot until I manually run
> >>resource cleanup.
> >
> >Logs?
> 
> syslog has some relevant info, but I can't tease anything out of the
> pacemaker logs that looks more useful than the following:
> 
> Mar 25 15:58:49 ha-d2 Filesystem(DRBDFS)[29570]: WARNING: Couldn't find
> device [/dev/drbd/by-res/wwwdata/0]. Expected /dev/??? to exist
> Mar 25 15:58:49 ha-d2 drbd(DRBDSlave)[29689]: ERROR: wwwdata: Called
> drbdadm -c /etc/drbd.conf syncer wwwdata

My best guess is that your drbd-utils version
(which provides the DRBD ocf agent)
is unfortunatley a broken one.

please upgrade, and complain to your distribution/vendor/package provider.

For details, see
http://git.linbit.com/drbd-utils.git/shortlog
specifically
http://git.linbit.com/drbd-utils.git/commitdiff/b766142
http://git.linbit.com/drbd-utils.git/commitdiff/449fafe

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Set "start-failure-is-fatal=false" on only one resource?

2016-03-25 Thread Lars Ellenberg
On Thu, Mar 24, 2016 at 09:01:18PM +, Sam Gardner wrote:
> I'm having some trouble on a few of my clusters in which the DRBD Slave 
> resource does not want to come up after a reboot until I manually run 
> resource cleanup.

Logs?

I mean, to get a failure count,
you have to have some operation fail.
And you should figure out which, when, and why.

Is it the start that fails?
Why does it fail?

Cheers,

Lars

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup-fencing

2016-03-19 Thread Lars Ellenberg
On Wed, Mar 16, 2016 at 01:47:52PM +0100, Ferenc Wágner wrote:
> >> And some more about fencing:
> >>
> >> 3. What's the difference in cluster behavior between
> >>- stonith-enabled=FALSE (9.3.2: how often will the stop operation be 
> >> retried?)
> >>- having no configured STONITH devices (resources won't be started, 
> >> right?)
> >>- failing to STONITH with some error (on every node)
> >>- timing out the STONITH operation
> >>- manual fencing
> >
> > I do not think there is much difference. Without fencing pacemaker
> > cannot make decision to relocate resources so cluster will be stuck.
> 
> Then I wonder why I hear the "must have working fencing if you value
> your data" mantra so often (and always without explanation).  After all,
> it does not risk the data, only the automatic cluster recovery, right?

stonith-enabled=false
means:
if some node becomes unresponsive,
it is immediately *assumed* it was "clean" dead.
no fencing takes place,
resource takeover happens without further protection.

That very much risks at least data divergence (replicas evoling
independently), if not data corruption (shared disks and the like).

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Linux-HA] Anyone successfully install PAcemaker/Corosync on Freebsd?

2016-02-10 Thread Lars Ellenberg

Moving to users@clusterlabs.org.

On Sat, Dec 19, 2015 at 06:47:54PM -0400, mike wrote:
> Hi All,
> 
> just curious if anyone has had any luck at one point installing
> Pacemaker and Corosync on FreeBSD.

According to pacemaker changelog, at least
David Shane Holden <dpejesh@...> and Ruben Kerkhof <ruben@...>
have been submitting pull requests recently with freebsd compat fixes,
maybe they can help?

   Lars

> I've run into an issue when running ./configure while trying to
> install Corosync. The process craps out at nss with this error:
> checking for nss... configure: error: in `/root/heartbeat/corosync-2.3.3':
> configure: error: The pkg-config script could not be found or is too
> old. Make sure it
> is in your PATH or set the PKG_CONFIG environment variable to the full
> path to pkg-config.​
> Alternatively, you may set the environment variables nss_CFLAGS
> and nss_LIBS to avoid the need to call pkg-config.
> See the pkg-config man page for more details.
> 
> I've looked unsuccessfully for a package called pkg-config and nss
> appears to be installed as you can see from this output:
> root@wellesley:~/heartbeat/corosync-2.3.3 # pkg install nss
> Updating FreeBSD repository catalogue...
> FreeBSD repository is up-to-date.
> All repositories are up-to-date.
> Checking integrity... done (0 conflicting)
> The most recent version of packages are already installed
> 
> Anyway - just looking for any suggestions. Hoping that perhaps
> someone has successfully done this.
> 
> thanks in advance
> -mgb

-- 
: Lars Ellenberg
: http://www.LINBIT.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Help needed getting DRBD cluster working

2015-11-30 Thread Lars Ellenberg
On Tue, Oct 06, 2015 at 10:13:00AM -0500, Ken Gaillot wrote:
> > ms ms_drbd0 drbd_disc0 \
> > meta master-max="1" master-node-max="1" clone-max="2" 
> > clone-node-max="1" notify="true" target-role="Started"
> 
> You want to omit target-role, or set it to "Master". Otherwise both
> nodes will start as slaves.

That is incorrect.  "Started" != "Slave"

target-role "Started" actually means "default for the resource being
handled" (the same as if you just removed that target-role attribute),
which in this case means "start up to clone-max instances,
then of those promote up to master-max instances"

target-role Slave would in fact prohibit promotion.

and target-role Master would, back in the day, trigger a pacemaker bug
where it would try to fulfill target-role, and happend to ignore
master-max, trying to promote all instances everywhere ;-)

not set: default behaviour
started: same as not set
slave:   do not promote
master:  nowadays for ms resources same as "Started" or not set,
 but used to trigger some nasty "promote everywhere" bug
 (a few years back)

-- 
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA  and  Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] gfs2 crashes when i, e.g., dd to a lvm volume

2015-10-09 Thread Lars Ellenberg
On Thu, Oct 08, 2015 at 01:50:50PM +0200, J. Echter wrote:
> Hi,
> 
> i have a strange issue on CentOS 6.5
> 
> If i install a new vm on node1 it works well.
> 
> If i install a new vm on node2 it gets stuck.
> 
> Same if i do a dd if=/dev/zero of=/dev/DATEN/vm-test (on node2)
> 
> On node1 it works:
> 
> dd if=/dev/zero of=vm-test
> Schreiben in „vm-test“: Auf dem Gerät ist kein Speicherplatz mehr verfügbar
> 83886081+0 Datensätze ein
> 83886080+0 Datensätze aus
> 42949672960 Bytes (43 GB) kopiert, 2338,15 s, 18,4 MB/s
> 
> 
> dmesg shows the following (while dd'ing on node2):
> 
> INFO: task flush-253:18:9820 blocked for more than 120 seconds.
>   Not tainted 2.6.32-573.7.1.el6.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> flush-253:18  D  0  9820  2 0x0080
>  8805f9bc3490 0046 8805f9bc3458 8805f9bc3454
>  880620768c00 88062fc23080 006fc80f5a1e 8800282959c0
>  02aa 00010002be11 8806241d65f8 8805f9bc3fd8
> Call Trace:
>  [] drbd_al_begin_io+0x1a5/0x240 [drbd]

What DRBD version is this?
What does the IO stack look like?

DRBD seems to block and wait for something in its make request function.
So maybe it's backend is blocked for some reason?

You'd see this for example on a thin provisioned backend that is
configured to block when out of physical space...

>  [] ? bio_alloc_bioset+0x5b/0xf0
>  [] ? autoremove_wake_function+0x0/0x40
>  [] drbd_make_request_common+0xf5c/0x14a0 [drbd]
>  [] ? mempool_alloc+0x63/0x140
>  [] ? bio_alloc_bioset+0x5b/0xf0
>  [] ? __map_bio+0xad/0x140 [dm_mod]
>  [] drbd_make_request+0x531/0x870 [drbd]
>  [] ? throtl_find_tg+0x46/0x60
>  [] ? blk_throtl_bio+0x1ea/0x5f0
>  [] ? blk_queue_bio+0x494/0x610
>  [] ? dm_make_request+0x122/0x180 [dm_mod]
>  [] generic_make_request+0x240/0x5a0
>  [] ? mempool_alloc_slab+0x15/0x20
>  [] ? mempool_alloc+0x63/0x140
>  [] ? apic_timer_interrupt+0xe/0x20
>  [] submit_bio+0x70/0x120

-- 
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA  and  Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: crm_report consumes all available RAM

2015-10-07 Thread Lars Ellenberg
 # decompressing stuff that was compressed with xz -9, we may
# need ~65 MB according to my man page, and if it was generated
# by something else, the decompressor may need even more.
# Grep itself should not use much more than single digit MB,
# so if the pipeline below needs more than 200 MB resident,
# we probably are not interested in that file in any case.
#
ulimit -m 20

# Actually no need for "local" anymore,
# this is a subshell already. Just a habbit.

local file=$1
case $file in
*.bz2) bzgrep "$file";; # or bzip2 -dc  | grep, if you prefer
*.gz)  zgrep "$file";;
*.xz)  xzgrep "$file";;
# ...
*)
local file_type=$(file "$file")
case $file_type in
*text*)
grep "$file" ;;
*)
# try anyways, let grep use its own heuristic
$have_grep_dash_I && grep --binary-files=without-match 
"$file" ;;
esac ;;
esac
)
}


-- 
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA  and  Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: crm_report consumes all available RAM

2015-10-07 Thread Lars Ellenberg
On Wed, Oct 07, 2015 at 05:39:01PM +0200, Lars Ellenberg wrote:
> Something like the below, maybe.
> Untested direct-to-email PoC code.
> 
> if echo . | grep -q -I . 2>/dev/null; then
>   have_grep_dash_I=true
> else
>   have_grep_dash_I=false
> fi
> # similar checks can be made for other decompressors
> 
> mygrep()
> {
>   (
>   # sub shell for ulimit
> 
>   # ulimit -v ... but maybe someone wants to mmap a huge file,
>   # and limiting the virtual size cripples mmap unnecessarily,
>   # so let's limit resident size instead.  Let's be generous, when
>   # decompressing stuff that was compressed with xz -9, we may
>   # need ~65 MB according to my man page, and if it was generated
>   # by something else, the decompressor may need even more.
>   # Grep itself should not use much more than single digit MB,
>   # so if the pipeline below needs more than 200 MB resident,
>   # we probably are not interested in that file in any case.
>   #
>   ulimit -m 20

Bah. scratch that.
RLIMIT_RSS No longer has any effect on linux 2.6.
so we are back to
ulimit -v 20
> 
>   # Actually no need for "local" anymore,
>   # this is a subshell already. Just a habbit.
> 
>   local file=$1
>   case $file in
>   *.bz2) bzgrep "$file";; # or bzip2 -dc  | grep, if you prefer
>   *.gz)  zgrep "$file";;
>   *.xz)  xzgrep "$file";;
>   # ...
>   *)
>   local file_type=$(file "$file")
>   case $file_type in
>   *text*)
>   grep "$file" ;;
>   *)
>   # try anyways, let grep use its own heuristic
>   $have_grep_dash_I && grep --binary-files=without-match 
> "$file" ;;
>   esac ;;
>   esac
>   )
> }

-- 
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA  and  Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Small bug in RA heartbeat/syslog-ng

2015-09-22 Thread Lars Ellenberg
On Tue, Sep 22, 2015 at 07:47:22AM +0200, Dejan Muhamedagic wrote:
> Hi,
> 
> On Mon, Sep 21, 2015 at 09:01:07AM +0200, Ulrich Windl wrote:
> > Hi!
> > 
> > Just a small notice: While having a look at the syslog-ng RA, I found this 
> > bug (in SLES11 SP3, resource-agents-3.9.5-0.37.38.19):
> > SYSLOG_NG_EXE="${OCF_RESKEY_syslog_ng_binary-/sbin/syslog-ng}" ### line 237 
> > of /usr/lib/ocf/resource.d/heartbeat/syslog-ng
> > 
> > I tried it in BASH, but if {OCF_RESKEY_syslog_ng_binary is unset, the 
> > default won't be substituted. It's because the correct syntax is:
> > SYSLOG_NG_EXE="${OCF_RESKEY_syslog_ng_binary:-/sbin/syslog-ng}"

That is incorrect.

if OCF_RESKEY_syslog_ng_binary is set to the empty string, the default
won't be substituted.

if it is *unset*, default will be substituded.

X="V" bash -c 'echo "colon-dash: X=\"${X:-default}\""; echo "dash-only: 
X=\"${X-default}\"";'
colon-dash: X="V"
dash-only: X="V"


X=""  bash -c 'echo "colon-dash: X=\"${X:-default}\""; echo "dash-only: 
X=\"${X-default}\"";'
colon-dash: X="default"
dash-only: X=""

unset X;  bash -c 'echo "colon-dash: X=\"${X:-default}\""; echo "dash-only: 
X=\"${X-default}\"";'
colon-dash: X="default"
dash-only: X="default"


So, unless you happen to have an explicitly set to the empty string
OCF_RESKEY_syslog_ng_binary in your environment, things work just fine.
And if you do, then that's the bug.
Which could be worked around by:

> Yes. Interestingly, there's some code to handle that case (but
> commented out):
> 
> # why not default to /sbin/syslog-ng?
> #if [[ -z "$SYSLOG_NG_EXE" ]]; then
> #   ocf_log err "Undefined parameter:syslog_ng_binary"
> #   exit $OCF_ERR_CONFIGURED
> #fi

-- 
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA  and  Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org