mit -v ... but maybe someone wants to mmap a huge file,
# and limiting the virtual size cripples mmap unnecessarily,
# so let's limit resident size instead. Let's be generous, when
# decompressing stuff that was compressed with xz -9, we may
# need ~65 MB according
On Wed, Oct 07, 2015 at 05:39:01PM +0200, Lars Ellenberg wrote:
> Something like the below, maybe.
> Untested direct-to-email PoC code.
>
> if echo . | grep -q -I . 2>/dev/null; then
> have_grep_dash_I=true
> else
> have_grep_dash_I=false
> fi
> # simila
est+0x531/0x870 [drbd]
> [] ? throtl_find_tg+0x46/0x60
> [] ? blk_throtl_bio+0x1ea/0x5f0
> [] ? blk_queue_bio+0x494/0x610
> [] ? dm_make_request+0x122/0x180 [dm_mod]
> [] generic_make_request+0x240/0x5a0
> [] ? mempool_alloc_slab+0x15/0x20
> [] ? mempool_alloc+0x63/0x14
uot;default"
So, unless you happen to have an explicitly set to the empty string
OCF_RESKEY_syslog_ng_binary in your environment, things work just fine.
And if you do, then that's the bug.
Which could be worked around by:
> Yes. Interestingly, there's some code to handle that case
to fulfill target-role, and happend to ignore
master-max, trying to promote all instances everywhere ;-)
not set: default behaviour
started: same as not set
slave: do not promote
master: nowadays for ms resources same as "Started" or not set,
but used to trigger s
- just looking for any suggestions. Hoping that perhaps
> someone has successfully done this.
>
> thanks in advance
> -mgb
--
: Lars Ellenberg
: http://www.LINBIT.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listi
dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:500
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
# ip addr add 192.168.7.1/24 dev tap0
# ip addr add 192.168.8.1/24 dev tap0 label tap0:jan
# ip addr
ll,
> it does not risk the data, only the automatic cluster recovery, right?
stonith-enabled=false
means:
if some node becomes unresponsive,
it is immediately *assumed* it was "clean" dead.
no fencing takes place,
resource takeover happens without further protection.
That very much risks a
have some operation fail.
And you should figure out which, when, and why.
Is it the start that fails?
Why does it fail?
Cheers,
Lars
--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support
DRBD® a
On Fri, Mar 25, 2016 at 04:08:48PM +, Sam Gardner wrote:
> On 3/25/16, 10:26 AM, "Lars Ellenberg" <lars.ellenb...@linbit.com> wrote:
>
>
> >On Thu, Mar 24, 2016 at 09:01:18PM +, Sam Gardner wrote:
> >> I'm having some trouble on a few of my clust
If you need more than "node-dead" detection,
what you should do for a new system is:
==> use pacemaker on corosync.
Or, if all you are going to manage is a bunch of IP adresses,
maybe you should chose a different tool, VRRP with keepalived
may be better for your needs.
--
: Lars
On Thu, Apr 21, 2016 at 12:50:43PM -0500, Ken Gaillot wrote:
> Hello everybody,
>
> The release cycle for 1.1.15 will be started soon (hopefully tomorrow)!
>
> The most prominent feature will be Klaus Wenninger's new implementation
> of event-driven alerts -- the ability to call scripts whenever
When I recently tried to make use of the DEGRADED monitoring results,
I found out that it does still not work.
Because LRMD choses to filter them in ocf2uniform_rc(),
and maps them to PCMK_OCF_UNKNOWN_ERROR.
See patch suggestion below.
It also filters away the other "special" rc values.
Do we
On Tue, Aug 30, 2016 at 06:15:49PM +0200, Dejan Muhamedagic wrote:
> On Tue, Aug 30, 2016 at 10:08:00AM -0500, Dmitri Maziuk wrote:
> > On 2016-08-30 03:44, Dejan Muhamedagic wrote:
> >
> > >The kernel reads the shebang line and it is what defines the
> > >interpreter which is to be invoked to
On Mon, Aug 29, 2016 at 04:37:00PM +0200, Dejan Muhamedagic wrote:
> Hi,
>
> On Mon, Aug 29, 2016 at 02:58:11PM +0200, Gabriele Bulfon wrote:
> > I think the main issue is the usage of the "local" operator in ocf*
> > I'm not an expert on this operator (never used!), don't know how hard it is
>
On Wed, Aug 31, 2016 at 12:29:59PM +0200, Dejan Muhamedagic wrote:
> > Also remember that sometimes we set a "local" variable in a function
> > and expect it to be visible in nested functions, but also set a new
> > value in a nested function and expect that value to be reflected
> > in the outer
; resource agent.
and it was considered not good enough.
That's why we provide the ocf:LINBIT:drbd resource agent.
Which you are supposed to use with DRBD 8.4 on Pacemaker.
--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consultin
t;ethmonitor" resource in addition to the IP.
If you wanted to test-drive cluster response against a
failing network device, your test was wrong.
If you wanted to test-drive cluster response against
a "fat fingered" (or even evil) operator or admin:
give up right there...
You'll nev
ing with removed interface drivers,
or unplugged devices, or whatnot, has to be dealt with elsewhere.
What you did is: down the bond, remove all slave assignments, even
remove the driver, and expect the resource agent to "heal" things that
it does not know about. It can not.
--
: Lars Ellen
the idea.
Currently, we have SBD chosen as such a "watchdog proxy",
maybe we can generalize it?
All of that would require cooperation within the node itself, though.
In this scenario, the cluster is not trusting the "sanity"
of the "commander in chief".
So maybe in addition of t
Stop polling the cib several times per seconds.
If you have to, "subscribe" to cib updates, using the API.
And stop pushing that much data into the cib.
Maybe, as a stop gap, compress it yourself,
before you stuff it into the cib.
--
: Lars Ellenberg
: LINBIT | Keeping the Digital World
On Thu, Mar 02, 2017 at 05:31:33PM -0600, Ken Gaillot wrote:
> On 03/01/2017 05:28 PM, Andrew Beekhof wrote:
> > On Tue, Feb 28, 2017 at 12:06 AM, Lars Ellenberg
> > <lars.ellenb...@linbit.com> wrote:
> >> When I recently tried to make use of the DEGRADED monito
On Wed, Aug 09, 2017 at 06:48:01PM +0200, Lentes, Bernd wrote:
>
>
> - Am 8. Aug 2017 um 15:36 schrieb Lars Ellenberg
> lars.ellenb...@linbit.com:
>
> > crm shell in "auto-commit"?
> > never seen that.
>
> i googled for "crmsh autocommit
ver)
> * is it possible to do the opposite? persistent setting "off" and override
> it
> with the transient setting?
see above, also man crm_standby,
which again is only a wrapper around crm_attribute.
--
: Lars Ellenberg
: LINBIT | Keeping the
at the time.
Though that may have been an un-intentional side-effect
of checking both sets of attributes?
--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support
DRBD® and LINBIT® are registere
On Tue, Apr 25, 2017 at 10:27:43AM +0200, Jehan-Guillaume de Rorthais wrote:
> On Tue, 25 Apr 2017 10:02:21 +0200
> Lars Ellenberg <lars.ellenb...@linbit.com> wrote:
>
> > On Mon, Apr 24, 2017 at 03:08:55PM -0500, Ken Gaillot wrote:
> > > Hi all,
> > >
>
On Mon, Apr 24, 2017 at 04:34:07PM +0200, Jehan-Guillaume de Rorthais wrote:
> Hi all,
>
> In the PostgreSQL Automatic Failover (PAF) project, one of most frequent
> negative feedback we got is how difficult it is to experience with it because
> of
> fencing occurring way too frequently. I am
ctices how to set up a web server on pacemaker and DRBD"
If you don't have a *very* good reason to use a cluster file
system, for things like web servers, mail servers, file servers,
... most services actually, a "classic" file system as xfs or
ext4 in failover configuration will usu
Yay!
On Mon, May 08, 2017 at 07:50:49PM -0500, Ken Gaillot wrote:
> "crm_attribute --pattern" to update or delete all node
> attributes matching a regular expression
Just a nit, but "pattern" usually is associated with "glob pattern".
If it's not a "pattern" but a "regex",
"--regex" would be
lockfile (#917)
here goes:
On Wed, Jun 07, 2017 at 02:49:41PM -0700, Dejan Muhamedagic wrote:
> On Wed, Jun 07, 2017 at 05:52:33AM -0700, Lars Ellenberg wrote:
> > Note: ocf_take_lock is NOT actually safe to use.
> >
> > As implemented, it uses "echo $pid > lockfile&qu
ch time.
rm -rf trimtester-is-broken/
mkdir trimtester-is-broken
o=trimtester-is-broken/x1
echo X > $o
l=$o
for i in `seq 2 32`; do
o=trimtester-is-broken/x$i;
cat $l $l > $o ;
rm -f $l;
l=$o;
done
./TrimTester trimtester-is-broken
Wahwahwa Corrupted file
all-back workaround
which used to "perform" better.
The bug is not that this fall-back workaround now
has pretty printing and is much slower (and eventually times out),
the bug is that you don't properly kill the service first.
[and that you don't have fencing].
>
To understand some weird behavior we observed,
I dumbed down a production config to three dummy resources,
while keeping some descriptive resource ids (ip, drbd, fs).
For some reason, the constraints are:
stuff, more stuff, IP -> DRBD -> FS -> other stuff.
(In the actual real-world config, it
sion=0.45.44)
> Jun 20 13:09:45 kpasterisk02 crmd:error: finalize_sync_callback:
> Sync from kpasterisk01-ha failed: Protocol not supported
> Jun 20 13:09:45 kpasterisk02 crmd: warning: do_log: FSA: Input
> I_ELECTION_DC from finalize_sync_callback() rece
On Fri, Jan 19, 2018 at 04:52:40PM -0600, Ken Gaillot wrote:
> Your constraints are:
>
> place IP then place drbd instance(s) with it
> start IP then start drbd instance(s)
>
> place drbd master then place fs with it
> promote drbd master then start fs
>
> I'm guessing you meant to
able to follow
what it tries to do, and even why.
Other implementations of drbd fencing policy handlers may directly
escalate to node level fencing. If that is what you want, use one of
those, and effectively map every DRBD replication link hickup to a hard
reset of the peer.
--
: Lars Ellenber
has "dc-deadtime", documented as
"How long to wait for a response from other nodes during startup.",
but the 20s default of that in current Pacemaker is much likely
shorter than what you had as initdead in your "old" setup.
So maybe if you set dc-deadtime to two minutes or somet
to just remove the DRBD
specific "magic" used by libblkid to identify it.
The "without Data Loss" part depends on whether the local copy was
"Consistent" (or better yet: UpToDate) before you decided to remove DRBD.
--
: Lars Ellenberg
:
Hi there.
I've seen a scenario where a network "hickup" isolated the current DC in a 3
node cluster for a short time; other partition elected a new DC obviously, and
all node attributes of the former DC are "cleared" together with the rest of
its state.
All nodes rejoin, "all happy again", BUT
Now with "reproducer" ... see below
On Thu, Sep 10, 2020 at 11:55:20AM +0200, Lars Ellenberg wrote:
> Hi there.
>
> I've seen a scenario where a network "hickup" isolated the current DC in a 3
> node cluster for a short time; other partition elected a new DC obvi
On Fri, Sep 11, 2020 at 11:42:46AM +0200, Lars Ellenberg wrote:
> On Thu, Sep 10, 2020 at 11:18:58AM -0500, Ken Gaillot wrote:
> > > But for some unrelated reason (stress on the cib, IPC timeout),
> > > crmd on the DC was doing an error exit and was respawned:
> &g
lge> > I would have expected corosync to come back with a "stable
lge> > non‑quorate membership" of just itself within a very short
lge> > period of time, and pacemaker winning the
lge> > "election"/"integration" with just itself, and then trying
lge> > to call "stop" on everything it knows about.
pcmk 2.0.5, corosync 3.1.0, knet, rhel8
I know fencing "solves" this just fine.
what I'd like to understand though is: what exactly is corosync or
pacemaker waiting for here,
why does it not manage to get to the stage where it would even attempt
to "stop" stuff?
two "rings" aka knet interfaces.
On Thu, Sep 08, 2022 at 10:11:46AM -0500, Ken Gaillot wrote:
> On Thu, 2022-09-08 at 15:01 +0200, Lars Ellenberg wrote:
> > Scenario:
> > three nodes, no fencing (I know)
> > break network, isolating nodes
> > unbreak network, see how cluster partitions rejoin and resum
Scenario:
three nodes, no fencing (I know)
break network, isolating nodes
unbreak network, see how cluster partitions rejoin and resume service
Funny outcome:
/usr/sbin/crm_mon -x pe-input-689.bz2
Cluster Summary:
* Stack: corosync
* Current DC: mqhavm24 (version
On Wed, May 03, 2023 at 11:07:20AM -0400, Madison Kelly wrote:
> On 2023-05-03 05:26, 黃暄皓 wrote:
>
> As the title said,is it still in maintenance?
>
> I'm not sure who even owns or maintains that old domain.
We (Linbit) did host that site still,
though all of it was supposed to be
46 matches
Mail list logo