Re: [ClusterLabs] Antw: crm_report consumes all available RAM

2015-10-07 Thread Lars Ellenberg
mit -v ... but maybe someone wants to mmap a huge file, # and limiting the virtual size cripples mmap unnecessarily, # so let's limit resident size instead. Let's be generous, when # decompressing stuff that was compressed with xz -9, we may # need ~65 MB according

Re: [ClusterLabs] Antw: crm_report consumes all available RAM

2015-10-07 Thread Lars Ellenberg
On Wed, Oct 07, 2015 at 05:39:01PM +0200, Lars Ellenberg wrote: > Something like the below, maybe. > Untested direct-to-email PoC code. > > if echo . | grep -q -I . 2>/dev/null; then > have_grep_dash_I=true > else > have_grep_dash_I=false > fi > # simila

Re: [ClusterLabs] gfs2 crashes when i, e.g., dd to a lvm volume

2015-10-09 Thread Lars Ellenberg
est+0x531/0x870 [drbd] > [] ? throtl_find_tg+0x46/0x60 > [] ? blk_throtl_bio+0x1ea/0x5f0 > [] ? blk_queue_bio+0x494/0x610 > [] ? dm_make_request+0x122/0x180 [dm_mod] > [] generic_make_request+0x240/0x5a0 > [] ? mempool_alloc_slab+0x15/0x20 > [] ? mempool_alloc+0x63/0x14

Re: [ClusterLabs] Small bug in RA heartbeat/syslog-ng

2015-09-22 Thread Lars Ellenberg
uot;default" So, unless you happen to have an explicitly set to the empty string OCF_RESKEY_syslog_ng_binary in your environment, things work just fine. And if you do, then that's the bug. Which could be worked around by: > Yes. Interestingly, there's some code to handle that case

Re: [ClusterLabs] Help needed getting DRBD cluster working

2015-11-30 Thread Lars Ellenberg
to fulfill target-role, and happend to ignore master-max, trying to promote all instances everywhere ;-) not set: default behaviour started: same as not set slave: do not promote master: nowadays for ms resources same as "Started" or not set, but used to trigger s

Re: [ClusterLabs] [Linux-HA] Anyone successfully install PAcemaker/Corosync on Freebsd?

2016-02-10 Thread Lars Ellenberg
- just looking for any suggestions. Hoping that perhaps > someone has successfully done this. > > thanks in advance > -mgb -- : Lars Ellenberg : http://www.LINBIT.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listi

Re: [ClusterLabs] getting "Totem is unable to form a cluster" error

2016-04-12 Thread Lars Ellenberg
dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:500 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) # ip addr add 192.168.7.1/24 dev tap0 # ip addr add 192.168.8.1/24 dev tap0 label tap0:jan # ip addr

Re: [ClusterLabs] Pacemaker startup-fencing

2016-03-19 Thread Lars Ellenberg
ll, > it does not risk the data, only the automatic cluster recovery, right? stonith-enabled=false means: if some node becomes unresponsive, it is immediately *assumed* it was "clean" dead. no fencing takes place, resource takeover happens without further protection. That very much risks a

Re: [ClusterLabs] Set "start-failure-is-fatal=false" on only one resource?

2016-03-25 Thread Lars Ellenberg
have some operation fail. And you should figure out which, when, and why. Is it the start that fails? Why does it fail? Cheers, Lars -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker : R, Integration, Ops, Consulting, Support DRBD® a

Re: [ClusterLabs] Set "start-failure-is-fatal=false" on only one resource?

2016-03-25 Thread Lars Ellenberg
On Fri, Mar 25, 2016 at 04:08:48PM +, Sam Gardner wrote: > On 3/25/16, 10:26 AM, "Lars Ellenberg" <lars.ellenb...@linbit.com> wrote: > > > >On Thu, Mar 24, 2016 at 09:01:18PM +, Sam Gardner wrote: > >> I'm having some trouble on a few of my clust

Re: [ClusterLabs] Fwd: FW: heartbeat can monitor virtual IP alive or not .

2016-04-28 Thread Lars Ellenberg
If you need more than "node-dead" detection, what you should do for a new system is: ==> use pacemaker on corosync. Or, if all you are going to manage is a bunch of IP adresses, maybe you should chose a different tool, VRRP with keepalived may be better for your needs. -- : Lars

Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts

2016-04-25 Thread Lars Ellenberg
On Thu, Apr 21, 2016 at 12:50:43PM -0500, Ken Gaillot wrote: > Hello everybody, > > The release cycle for 1.1.15 will be started soon (hopefully tomorrow)! > > The most prominent feature will be Klaus Wenninger's new implementation > of event-driven alerts -- the ability to call scripts whenever

[ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to PCMK_OCF_UNKNOWN_ERROR

2017-02-27 Thread Lars Ellenberg
When I recently tried to make use of the DEGRADED monitoring results, I found out that it does still not work. Because LRMD choses to filter them in ocf2uniform_rc(), and maps them to PCMK_OCF_UNKNOWN_ERROR. See patch suggestion below. It also filters away the other "special" rc values. Do we

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-30 Thread Lars Ellenberg
On Tue, Aug 30, 2016 at 06:15:49PM +0200, Dejan Muhamedagic wrote: > On Tue, Aug 30, 2016 at 10:08:00AM -0500, Dmitri Maziuk wrote: > > On 2016-08-30 03:44, Dejan Muhamedagic wrote: > > > > >The kernel reads the shebang line and it is what defines the > > >interpreter which is to be invoked to

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-29 Thread Lars Ellenberg
On Mon, Aug 29, 2016 at 04:37:00PM +0200, Dejan Muhamedagic wrote: > Hi, > > On Mon, Aug 29, 2016 at 02:58:11PM +0200, Gabriele Bulfon wrote: > > I think the main issue is the usage of the "local" operator in ocf* > > I'm not an expert on this operator (never used!), don't know how hard it is >

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-31 Thread Lars Ellenberg
On Wed, Aug 31, 2016 at 12:29:59PM +0200, Dejan Muhamedagic wrote: > > Also remember that sometimes we set a "local" variable in a function > > and expect it to be visible in nested functions, but also set a new > > value in a nested function and expect that value to be reflected > > in the outer

Re: [ClusterLabs] ocf:linbit:drbd Deprecated? Not.

2016-09-16 Thread Lars Ellenberg
; resource agent. and it was considered not good enough. That's why we provide the ocf:LINBIT:drbd resource agent. Which you are supposed to use with DRBD 8.4 on Pacemaker. -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker : R, Integration, Ops, Consultin

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Lars Ellenberg
t;ethmonitor" resource in addition to the IP. If you wanted to test-drive cluster response against a failing network device, your test was wrong. If you wanted to test-drive cluster response against a "fat fingered" (or even evil) operator or admin: give up right there... You'll nev

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Lars Ellenberg
ing with removed interface drivers, or unplugged devices, or whatnot, has to be dealt with elsewhere. What you did is: down the bond, remove all slave assignments, even remove the driver, and expect the resource agent to "heal" things that it does not know about. It can not. -- : Lars Ellen

Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-22 Thread Lars Ellenberg
the idea. Currently, we have SBD chosen as such a "watchdog proxy", maybe we can generalize it? All of that would require cooperation within the node itself, though. In this scenario, the cluster is not trusting the "sanity" of the "commander in chief". So maybe in addition of t

Re: [ClusterLabs] Syncing data and reducing CPU utilization of cib process

2017-04-03 Thread Lars Ellenberg
Stop polling the cib several times per seconds. If you have to, "subscribe" to cib updates, using the API. And stop pushing that much data into the cib. Maybe, as a stop gap, compress it yourself, before you stuff it into the cib. -- : Lars Ellenberg : LINBIT | Keeping the Digital World

Re: [ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to PCMK_OCF_UNKNOWN_ERROR

2017-03-06 Thread Lars Ellenberg
On Thu, Mar 02, 2017 at 05:31:33PM -0600, Ken Gaillot wrote: > On 03/01/2017 05:28 PM, Andrew Beekhof wrote: > > On Tue, Feb 28, 2017 at 12:06 AM, Lars Ellenberg > > <lars.ellenb...@linbit.com> wrote: > >> When I recently tried to make use of the DEGRADED monito

Re: [ClusterLabs] big trouble with a DRBD resource

2017-08-10 Thread Lars Ellenberg
On Wed, Aug 09, 2017 at 06:48:01PM +0200, Lentes, Bernd wrote: > > > - Am 8. Aug 2017 um 15:36 schrieb Lars Ellenberg > lars.ellenb...@linbit.com: > > > crm shell in "auto-commit"? > > never seen that. > > i googled for "crmsh autocommit

Re: [ClusterLabs] Coming in Pacemaker 1.1.17: start a node in standby

2017-04-27 Thread Lars Ellenberg
ver) > * is it possible to do the opposite? persistent setting "off" and override > it > with the transient setting? see above, also man crm_standby, which again is only a wrapper around crm_attribute. -- : Lars Ellenberg : LINBIT | Keeping the

Re: [ClusterLabs] Coming in Pacemaker 1.1.17: start a node in standby

2017-04-25 Thread Lars Ellenberg
at the time. Though that may have been an un-intentional side-effect of checking both sets of attributes? -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker : R, Integration, Ops, Consulting, Support DRBD® and LINBIT® are registere

Re: [ClusterLabs] Coming in Pacemaker 1.1.17: start a node in standby

2017-04-25 Thread Lars Ellenberg
On Tue, Apr 25, 2017 at 10:27:43AM +0200, Jehan-Guillaume de Rorthais wrote: > On Tue, 25 Apr 2017 10:02:21 +0200 > Lars Ellenberg <lars.ellenb...@linbit.com> wrote: > > > On Mon, Apr 24, 2017 at 03:08:55PM -0500, Ken Gaillot wrote: > > > Hi all, > > > >

Re: [ClusterLabs] [ClusterLabs Developers] checking all procs on system enough during stop action?

2017-04-24 Thread Lars Ellenberg
On Mon, Apr 24, 2017 at 04:34:07PM +0200, Jehan-Guillaume de Rorthais wrote: > Hi all, > > In the PostgreSQL Automatic Failover (PAF) project, one of most frequent > negative feedback we got is how difficult it is to experience with it because > of > fencing occurring way too frequently. I am

Re: [ClusterLabs] big trouble with a DRBD resource

2017-08-08 Thread Lars Ellenberg
ctices how to set up a web server on pacemaker and DRBD" If you don't have a *very* good reason to use a cluster file system, for things like web servers, mail servers, file servers, ... most services actually, a "classic" file system as xfs or ext4 in failover configuration will usu

Re: [ClusterLabs] Pacemaker 1.1.17-rc1 now available

2017-05-09 Thread Lars Ellenberg
Yay! On Mon, May 08, 2017 at 07:50:49PM -0500, Ken Gaillot wrote: > "crm_attribute --pattern" to update or delete all node > attributes matching a regular expression Just a nit, but "pattern" usually is associated with "glob pattern". If it's not a "pattern" but a "regex", "--regex" would be

[ClusterLabs] ocf_take_lock is NOT actually safe to use

2017-06-21 Thread Lars Ellenberg
lockfile (#917) here goes: On Wed, Jun 07, 2017 at 02:49:41PM -0700, Dejan Muhamedagic wrote: > On Wed, Jun 07, 2017 at 05:52:33AM -0700, Lars Ellenberg wrote: > > Note: ocf_take_lock is NOT actually safe to use. > > > > As implemented, it uses "echo $pid > lockfile&qu

Re: [ClusterLabs] Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0

2017-10-16 Thread Lars Ellenberg
ch time. rm -rf trimtester-is-broken/ mkdir trimtester-is-broken o=trimtester-is-broken/x1 echo X > $o l=$o for i in `seq 2 32`; do o=trimtester-is-broken/x$i; cat $l $l > $o ; rm -f $l; l=$o; done ./TrimTester trimtester-is-broken Wahwahwa Corrupted file

Re: [ClusterLabs] Regression in Filesystem RA

2017-10-16 Thread Lars Ellenberg
all-back workaround which used to "perform" better. The bug is not that this fall-back workaround now has pretty printing and is much slower (and eventually times out), the bug is that you don't properly kill the service first. [and that you don't have fencing]. >

[ClusterLabs] pengine bug? Recovery after monitor failure: Restart of DRBD does not restart Filesystem -- unless explicit order start before promote on DRBD

2018-01-11 Thread Lars Ellenberg
To understand some weird behavior we observed, I dumbed down a production config to three dummy resources, while keeping some descriptive resource ids (ip, drbd, fs). For some reason, the constraints are: stuff, more stuff, IP -> DRBD -> FS -> other stuff. (In the actual real-world config, it

Re: [ClusterLabs] Corosync.log Growing at 1Gb in 15min

2018-06-20 Thread Lars Ellenberg
sion=0.45.44) > Jun 20 13:09:45 kpasterisk02 crmd:error: finalize_sync_callback: > Sync from kpasterisk01-ha failed: Protocol not supported > Jun 20 13:09:45 kpasterisk02 crmd: warning: do_log: FSA: Input > I_ELECTION_DC from finalize_sync_callback() rece

Re: [ClusterLabs] pengine bug? Recovery after monitor failure: Restart of DRBD does not restart Filesystem -- unless explicit order start before promote on DRBD

2018-01-22 Thread Lars Ellenberg
On Fri, Jan 19, 2018 at 04:52:40PM -0600, Ken Gaillot wrote: > Your constraints are: > > place IP then place drbd instance(s) with it > start IP then start drbd instance(s) > > place drbd master then place fs with it > promote drbd master then start fs > > I'm guessing you meant to

Re: [ClusterLabs] Trying to Understanding crm-fence-peer.sh

2019-01-16 Thread Lars Ellenberg
able to follow what it tries to do, and even why. Other implementations of drbd fencing policy handlers may directly escalate to node level fencing. If that is what you want, use one of those, and effectively map every DRBD replication link hickup to a hard reset of the peer. -- : Lars Ellenber

Re: [ClusterLabs] [EXTERNAL] Re: "node is unclean" leads to gratuitous reboot

2019-07-11 Thread Lars Ellenberg
has "dc-deadtime", documented as "How long to wait for a response from other nodes during startup.", but the 20s default of that in current Pacemaker is much likely shorter than what you had as initdead in your "old" setup. So maybe if you set dc-deadtime to two minutes or somet

Re: [ClusterLabs] Removing DRBD w/out Data Loss?

2020-09-10 Thread Lars Ellenberg
to just remove the DRBD specific "magic" used by libblkid to identify it. The "without Data Loss" part depends on whether the local copy was "Consistent" (or better yet: UpToDate) before you decided to remove DRBD. -- : Lars Ellenberg :

[ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after cluster partition/rejoin

2020-09-10 Thread Lars Ellenberg
Hi there. I've seen a scenario where a network "hickup" isolated the current DC in a 3 node cluster for a short time; other partition elected a new DC obviously, and all node attributes of the former DC are "cleared" together with the rest of its state. All nodes rejoin, "all happy again", BUT

Re: [ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]

2020-09-10 Thread Lars Ellenberg
Now with "reproducer" ... see below On Thu, Sep 10, 2020 at 11:55:20AM +0200, Lars Ellenberg wrote: > Hi there. > > I've seen a scenario where a network "hickup" isolated the current DC in a 3 > node cluster for a short time; other partition elected a new DC obvi

Re: [ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]

2020-09-17 Thread Lars Ellenberg
On Fri, Sep 11, 2020 at 11:42:46AM +0200, Lars Ellenberg wrote: > On Thu, Sep 10, 2020 at 11:18:58AM -0500, Ken Gaillot wrote: > > > But for some unrelated reason (stress on the cib, IPC timeout), > > > crmd on the DC was doing an error exit and was respawned: > &g

Re: [ClusterLabs] Antw: [EXT] no-quorum-policy=stop never executed, pacemaker stuck in election/integration, corosync running in "new membership" cycles with itself

2021-06-02 Thread Lars Ellenberg
lge> > I would have expected corosync to come back with a "stable lge> > non‑quorate membership" of just itself within a very short lge> > period of time, and pacemaker winning the lge> > "election"/"integration" with just itself, and then trying lge> > to call "stop" on everything it knows about.

[ClusterLabs] no-quorum-policy=stop never executed, pacemaker stuck in election/integration, corosync running in "new membership" cycles with itself

2021-06-01 Thread Lars Ellenberg
pcmk 2.0.5, corosync 3.1.0, knet, rhel8 I know fencing "solves" this just fine. what I'd like to understand though is: what exactly is corosync or pacemaker waiting for here, why does it not manage to get to the stage where it would even attempt to "stop" stuff? two "rings" aka knet interfaces.

Re: [ClusterLabs] DC marks itself as OFFLINE, continues orchestrating the other nodes

2022-09-14 Thread Lars Ellenberg
On Thu, Sep 08, 2022 at 10:11:46AM -0500, Ken Gaillot wrote: > On Thu, 2022-09-08 at 15:01 +0200, Lars Ellenberg wrote: > > Scenario: > > three nodes, no fencing (I know) > > break network, isolating nodes > > unbreak network, see how cluster partitions rejoin and resum

[ClusterLabs] DC marks itself as OFFLINE, continues orchestrating the other nodes

2022-09-08 Thread Lars Ellenberg
Scenario: three nodes, no fencing (I know) break network, isolating nodes unbreak network, see how cluster partitions rejoin and resume service Funny outcome: /usr/sbin/crm_mon -x pe-input-689.bz2 Cluster Summary: * Stack: corosync * Current DC: mqhavm24 (version

Re: [ClusterLabs] The Linux-HA site is down.

2023-06-01 Thread Lars Ellenberg
On Wed, May 03, 2023 at 11:07:20AM -0400, Madison Kelly wrote: > On 2023-05-03 05:26, 黃暄皓 wrote: > > As the title said,is it still in maintenance? > > I'm not sure who even owns or maintains that old domain. We (Linbit) did host that site still, though all of it was supposed to be