date:20190709

Re: [ClusterLabs] I have a question.

2019-07-09 Thread Ken Gaillot

On Mon, 2019-07-08 at 13:11 +0900, 김동현 wrote:
> Hello.
> I'm Donghyun Kim.
> 
> I work as a system engineer in Korea.
> 
> In the meantime, I was very interested in the cluster and want to
> promote it in Korea.
> There are many high-availability cases in Linux systems.
> 
> The reason why I am sending this mail is whether I can use the name
> "ClusterLabs Korea" in my Facebook community.
> 
> It's only for the Facebook community.
> 
> Then I'll be looking forward to your reply.
> have a nice day

Hi Donghyun,

That's exciting, it's great to see more ways for people to get
involved!

The closest existing comparison I can think of is the Japanese user
group, which pre-existed the ClusterLabs effort, and so is still named
Linux-HA Japan.

Personally I'd be fine with ClusterLabs Korea, although maybe
ClusterLabs Users Korea would make it more obvious that the group is
for discussion by users and not just announcements.

If there's any way I can help, let me know!

Hopefully we can get more feedback from others. Does anyone know of
other local user groups, or have opinions about names?
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

2019-07-09 Thread Tiemen Ruiten

On Tue, Jul 9, 2019 at 4:21 PM Jehan-Guillaume de Rorthais 
wrote:

> On Tue, 9 Jul 2019 13:22:06 +0200
> Tiemen Ruiten  wrote:
>
> > On Mon, Jul 8, 2019 at 10:01 PM Jehan-Guillaume de Rorthais <
> j...@dalibo.com>
> ...
> > > I dig in xlog.c today. Maybe I can write a small extension to get the
> > > timeline
> > > from shared memory directly and make pgsqlms use it if it detects it.
> So
> > > people
> > > can decide if they feel like it is too invasive or really needed for
> > > their usecase. Maybe in next release. What do you think? Would it be
> > > useful to
> > > you?
> > >
> >
> > Yes, that would be a really useful addition IMO. I would definitely use
> it.
> > If we can avoid taking a checkpoint that will save precious minutes
> during
> > a failover and the risk of timeouts would be drastically reduced. Would
> be
> > happy to test it if you want!
>
> OK, thanks. Not sure when I'll have time to work on this. But I'll stay in
> touch with you then.
>

Great!


>
> I have to work on the v12 support as well :/
>
> > > > I managed to improve the average time checkpoints are taking already
> from
> > > > what I mentioned in that thread, mainly by decreasing
> checkpoint_timeout
> > > > and setting full_page_writes = off; ostensibly not necessary on
> ZFS.
> > >
> > > The "full_page_writes" helps lowering the amount of WAL produced. Not
> the
> > > amount of writes to sync during the checkpoint. But I am sure it helps
> for
> > > your performances :)
> >
> > If I'm saturating the IO capacity of my system during a forced checkpoint
> > and full_page_writes = off reduces IO by reducing the amount of WAL, then
> > it should help in an indirect way?
>
> The master is supposed to be gone during a failover, neither in reads or
> writes.


OK, I didn't consider this.


> The checkpoint occurs on each standby to force sync their
> controldata. The checkpoint itself does not writes to WALs or read them.
> Am I
> forgetting something obvious?
>
> Maybe you can have some writes if the standby need to sync last received
> WALs and some reads if the standby was lagging on replay...But it
> shouldn't be
> much...
>

I double-checked monitoring data: there was approximately one minute of
replication lag on one slave and two minutes of replication lag on the
other slave when the original issue occurred. By the way, I'm still seeing
worrying amounts of replication lag on both slaves at times (usually not on
both at the same time) so that's really puzzling: all hardware and
configuration is identical. Anyway, that's something for another
thread/mailinglist I suppose :)


>
> --
> Jehan-Guillaume de Rorthais
> Dalibo
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] "node is unclean" leads to gratuitous reboot

2019-07-09 Thread Ken Gaillot

On Tue, 2019-07-09 at 12:54 +, Michael Powell wrote:
> I have a two-node cluster with a problem.  If I start 

Not so much a problem as a configuration choice :)

There are trade-offs in any case.

- wait_for_all in corosync.conf: If set, this will make each starting
node wait until it sees the other before gaining quorum for the first
time. The downside is that both nodes must be up for the cluster to
start; the upside is a clean starting point and no fencing.

- startup-fencing in pacemaker properties: If disabled, either node can
start without fencing the other. This is unsafe; if the other node is
actually active and running resources, but unreachable from the newly
up node, the newly up node may start the same resources, causing split-
brain. (Easier than you might think: consider taking a node down for
hardware maintenance, bringing it back up without a network, then
plugging it back into the network -- by that point it may have brought
up resources and starts causing havoc.)

- Start corosync on both nodes, then start pacemaker. This avoids
start-up fencing since when pacemaker starts on either node, it already
sees the other node present, even if that node's pacemaker isn't up
yet.

Personally I'd go for wait_for_all in normal operation. You can always
disable it if there are special circumstances where a node is expected
to be out of the cluster for a long time.

> Corosync/Pacemaker on one node, and then delay startup on the 2nd
> node (which is otherwise up and running), the 2nd node will be
> rebooted very soon after STONITH is enabled on the first node.  This
> reboot seems to be gratuitous and could under some circumstances be
> problematic.  While, at present,  I “manually” start
> Corosync/Pacemaker by invoking a script from an ssh session, in a
> production environment, this script would be started by a systemd
> service.  It’s not hard to imagine that if both nodes were started at
> approximately the same time (each node runs on a separate motherboard
> in the same chassis), this behavior could cause one of the nodes to
> be rebooted while it’s in the process of booting up.
>  
> The two nodes’ host names are mgraid-16201289RN00023-0 and mgraid-
> 16201289RN00023-1.  Both hosts are running, but Pacemaker has been
> started on neither.  If Pacemaker is started on mgraid-
> 16201289RN00023-0, within a few seconds after STONITH is enabled, the
> following messages will appear in the system log file, and soon
> thereafter STONITH will be invoked to reboot the other node, on which
> Pacemaker has not yet been started.  (NB: The fence agent is a
> process named mgpstonith which uses the ipmi interface to reboot the
> other node.  For debugging, it prints the data it receives from
> stdin. )
>  
> 2019-07-08T13:11:14.907668-07:00 mgraid-16201289RN00023-0
> HA_STARTSTOP: Configure mgraid-stonith# This message indicates
> that STONITH is about to be configured and enabled
> …
> 2019-07-08T13:11:15.018131-07:00 mgraid-16201289RN00023-0
> MGPSTONITH[16299]: info:... action=metadata#012
> 2019-07-08T13:11:15.050817-07:00 mgraid-16201289RN00023-0
> MGPSTONITH[16301]: info:... action=metadata#012
> …
> 2019-07-08T13:11:21.085092-07:00 mgraid-16201289RN00023-0
> pengine[16216]:  warning: Scheduling Node mgraid-16201289RN00023-1
> for STONITH
> 2019-07-08T13:11:21.085615-07:00 mgraid-16201289RN00023-0
> pengine[16216]:   notice:  * Fence (reboot) mgraid-16201289RN00023-1
> 'node is unclean'
> 2019-07-08T13:11:21.085663-07:00 mgraid-16201289RN00023-0
> pengine[16216]:   notice:  * PromoteSS16201289RN00023:0 (
> Stopped -> Master mgraid-16201289RN00023-0 ) 
> 2019-07-08T13:11:21.085704-07:00 mgraid-16201289RN00023-0
> pengine[16216]:   notice:  * Start  mgraid-stonith:0   
> (   mgraid-16201289RN00023-0 ) 
> 2019-07-08T13:11:21.091673-07:00 mgraid-16201289RN00023-0
> pengine[16216]:  warning: Calculated transition 0 (with warnings),
> saving inputs in /var/lib/pacemaker/pengine/pe-warn-3.bz2
> 2019-07-08T13:11:21.093155-07:00 mgraid-16201289RN00023-0
> crmd[16218]:   notice: Initiating monitor operation
> SS16201289RN00023:0_monitor_0 locally on mgraid-16201289RN00023-0
> 2019-07-08T13:11:21.124403-07:00 mgraid-16201289RN00023-0
> crmd[16218]:   notice: Initiating monitor operation mgraid-
> stonith:0_monitor_0 locally on mgraid-16201289RN00023-0
> …
> 2019-07-08T13:11:21.132994-07:00 mgraid-16201289RN00023-0
> MGPSTONITH[16361]: info:... action=metadata#012
> …
> 2019-07-08T13:11:22.128139-07:00 mgraid-16201289RN00023-0
> crmd[16218]:   notice: Requesting fencing (reboot) of node mgraid-
> 16201289RN00023-1
> 2019-07-08T13:11:22.129150-07:00 mgraid-16201289RN00023-0
> crmd[16218]:   notice: Result of probe operation for
> SS16201289RN00023 on mgraid-16201289RN00023-0: 7 (not running)
> 2019-07-08T13:11:22.129191-07:00 mgraid-16201289RN00023-0
> crmd[16218]:   notice: mgraid-16201289RN00023-0-
> SS16201289RN00023_monitor_0:6 [ \n\n ]
>

Re: [ClusterLabs] colocation - but do not stop resources on failure

2019-07-09 Thread Ken Gaillot

On Tue, 2019-07-09 at 11:21 +0100, lejeczek wrote:
> hi guys,
> 
> how to, if possible, create colocation which would not stop dependent
> resources if the target(that would be systemd agent) resource fails
> on
> all nodes?
> 
> many thanks, L.

Sure, just use a finite score. Colocation is mandatory if the score is
INFINITY (or -INFINITY for anti-colocation), otherwise it's a
preference rather than a requirement.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-09 Thread Andrei Borzenkov

09.07.2019 13:08, Danka Ivanović пишет:
> Hi I didn't manage to start master with postgres, even if I increased start
> timeout. I checked executable paths and start options.
> When cluster is running with manually started master and slave started over
> pacemaker, everything works ok. Today we had failover again.
> I cannot find reason from the logs, can you help me with debugging? Thanks.
> 

> Jul 09 09:16:32 [2679] postgres1   lrmd:debug: child_kill_helper: 
> Kill pid 12735's group
> Jul 09 09:16:34 [2679] postgres1   lrmd:  warning: 
> child_timeout_callback:PGSQL_monitor_15000 process (PID 12735) timed 
> out

You probably want to enable debug output in resource agent. As far as I
can tell, this requires HA_debug=1 in environment of resource agent, but
for the life of me I cannot find where it is possible to set it.

Probably setting it directly in resource agent for debugging is the most
simple way.

P.S. crm_resource is called by resource agent (pgsqlms). And it shows
result of original resource probing which makes it confusing. At least
it explains where these logs entries come from.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

2019-07-09 Thread Jehan-Guillaume de Rorthais

On Tue, 9 Jul 2019 13:22:06 +0200
Tiemen Ruiten  wrote:

> On Mon, Jul 8, 2019 at 10:01 PM Jehan-Guillaume de Rorthais 
...
> > I dig in xlog.c today. Maybe I can write a small extension to get the
> > timeline
> > from shared memory directly and make pgsqlms use it if it detects it. So
> > people
> > can decide if they feel like it is too invasive or really needed for
> > their usecase. Maybe in next release. What do you think? Would it be
> > useful to
> > you?
> >  
> 
> Yes, that would be a really useful addition IMO. I would definitely use it.
> If we can avoid taking a checkpoint that will save precious minutes during
> a failover and the risk of timeouts would be drastically reduced. Would be
> happy to test it if you want!

OK, thanks. Not sure when I'll have time to work on this. But I'll stay in
touch with you then.

I have to work on the v12 support as well :/

> > > I managed to improve the average time checkpoints are taking already from
> > > what I mentioned in that thread, mainly by decreasing checkpoint_timeout
> > > and setting full_page_writes = off; ostensibly not necessary on ZFS.  
> >
> > The "full_page_writes" helps lowering the amount of WAL produced. Not the
> > amount of writes to sync during the checkpoint. But I am sure it helps for
> > your performances :)
> 
> If I'm saturating the IO capacity of my system during a forced checkpoint
> and full_page_writes = off reduces IO by reducing the amount of WAL, then
> it should help in an indirect way?

The master is supposed to be gone during a failover, neither in reads or
writes. The checkpoint occurs on each standby to force sync their
controldata. The checkpoint itself does not writes to WALs or read them. Am I
forgetting something obvious?

Maybe you can have some writes if the standby need to sync last received
WALs and some reads if the standby was lagging on replay...But it shouldn't be
much...

-- 
Jehan-Guillaume de Rorthais
Dalibo
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] "node is unclean" leads to gratuitous reboot

2019-07-09 Thread Andrei Borzenkov

On Tue, Jul 9, 2019 at 3:54 PM Michael Powell <
michael.pow...@harmonicinc.com> wrote:

> I have a two-node cluster with a problem.  If I start Corosync/Pacemaker
> on one node, and then delay startup on the 2nd node (which is otherwise
> up and running), the 2nd node will be rebooted very soon after STONITH is
> enabled on the first node.
>


Well, this is what you explicitly requested.


>
>
> quorum {
>
> provider: corosync_votequorum
>
>
>
> two_node: 1
>
>
>
> wait_for_all: 0
>
> }
>
>
>
> I’d appreciate any insight you can offer into this behavior, and any
> suggestions you may have.
>
>
I am not sure what you expect. Pacemaker cannot (re-)start resources until
state of all nodes is known. On startup it means either wait for all nodes
to appear or bring missing nodes into defined state.

May be there is "initial timeout" for how long pacemaker should wait once
after startup before assuming other nodes are not present, but I am not
aware of it out of my head.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] "node is unclean" leads to gratuitous reboot

2019-07-09 Thread Michael Powell

I have a two-node cluster with a problem.  If I start Corosync/Pacemaker on one 
node, and then delay startup on the 2nd node (which is otherwise up and 
running), the 2nd node will be rebooted very soon after STONITH is enabled on 
the first node.  This reboot seems to be gratuitous and could under some 
circumstances be problematic.  While, at present,  I "manually" start 
Corosync/Pacemaker by invoking a script from an ssh session, in a production 
environment, this script would be started by a systemd service.  It's not hard 
to imagine that if both nodes were started at approximately the same time (each 
node runs on a separate motherboard in the same chassis), this behavior could 
cause one of the nodes to be rebooted while it's in the process of booting up.

The two nodes' host names are mgraid-16201289RN00023-0 and 
mgraid-16201289RN00023-1.  Both hosts are running, but Pacemaker has been 
started on neither.  If Pacemaker is started on mgraid-16201289RN00023-0, 
within a few seconds after STONITH is enabled, the following messages will 
appear in the system log file, and soon thereafter STONITH will be invoked to 
reboot the other node, on which Pacemaker has not yet been started.  (NB: The 
fence agent is a process named mgpstonith which uses the ipmi interface to 
reboot the other node.  For debugging, it prints the data it receives from 
stdin. )

2019-07-08T13:11:14.907668-07:00 mgraid-16201289RN00023-0 HA_STARTSTOP: 
Configure mgraid-stonith# This message indicates that STONITH is about to 
be configured and enabled
...
2019-07-08T13:11:15.018131-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16299]: 
info:... action=metadata#012
2019-07-08T13:11:15.050817-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16301]: 
info:... action=metadata#012
...
2019-07-08T13:11:21.085092-07:00 mgraid-16201289RN00023-0 pengine[16216]:  
warning: Scheduling Node mgraid-16201289RN00023-1 for STONITH
2019-07-08T13:11:21.085615-07:00 mgraid-16201289RN00023-0 pengine[16216]:   
notice:  * Fence (reboot) mgraid-16201289RN00023-1 'node is unclean'
2019-07-08T13:11:21.085663-07:00 mgraid-16201289RN00023-0 pengine[16216]:   
notice:  * PromoteSS16201289RN00023:0 ( Stopped -> Master 
mgraid-16201289RN00023-0 )
2019-07-08T13:11:21.085704-07:00 mgraid-16201289RN00023-0 pengine[16216]:   
notice:  * Start  mgraid-stonith:0(   
mgraid-16201289RN00023-0 )
2019-07-08T13:11:21.091673-07:00 mgraid-16201289RN00023-0 pengine[16216]:  
warning: Calculated transition 0 (with warnings), saving inputs in 
/var/lib/pacemaker/pengine/pe-warn-3.bz2
2019-07-08T13:11:21.093155-07:00 mgraid-16201289RN00023-0 crmd[16218]:   
notice: Initiating monitor operation SS16201289RN00023:0_monitor_0 locally on 
mgraid-16201289RN00023-0
2019-07-08T13:11:21.124403-07:00 mgraid-16201289RN00023-0 crmd[16218]:   
notice: Initiating monitor operation mgraid-stonith:0_monitor_0 locally on 
mgraid-16201289RN00023-0
...
2019-07-08T13:11:21.132994-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16361]: 
info:... action=metadata#012
...
2019-07-08T13:11:22.128139-07:00 mgraid-16201289RN00023-0 crmd[16218]:   
notice: Requesting fencing (reboot) of node mgraid-16201289RN00023-1
2019-07-08T13:11:22.129150-07:00 mgraid-16201289RN00023-0 crmd[16218]:   
notice: Result of probe operation for SS16201289RN00023 on 
mgraid-16201289RN00023-0: 7 (not running)
2019-07-08T13:11:22.129191-07:00 mgraid-16201289RN00023-0 crmd[16218]:   
notice: mgraid-16201289RN00023-0-SS16201289RN00023_monitor_0:6 [ \n\n ]
2019-07-08T13:11:22.133846-07:00 mgraid-16201289RN00023-0 stonith-ng[16213]:   
notice: Client crmd.16218.a7e3cbae wants to fence (reboot) 
'mgraid-16201289RN00023-1' with device '(any)'
2019-07-08T13:11:22.133997-07:00 mgraid-16201289RN00023-0 stonith-ng[16213]:   
notice: Requesting peer fencing (reboot) of mgraid-16201289RN00023-1
2019-07-08T13:11:22.136287-07:00 mgraid-16201289RN00023-0 crmd[16218]:   
notice: Result of probe operation for mgraid-stonith on 
mgraid-16201289RN00023-0: 7 (not running)
2019-07-08T13:11:22.141393-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16444]: 
info:... action=status#012   # Status requests always return 0.
2019-07-08T13:11:22.141418-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16444]: 
info:... nodename=mgraid-16201289RN00023-1#012
2019-07-08T13:11:22.141432-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16444]: 
info:... port=mgraid-16201289RN00023-1#012
2019-07-08T13:11:22.141444-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16444]: 
info:Ignoring: port
...
2019-07-08T13:11:22.148973-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16445]: 
info:... action=status#012
2019-07-08T13:11:22.148997-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16445]: 
info:... nodename=mgraid-16201289RN00023-1#012
2019-07-08T13:11:22.149009-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16445]: 
info:... port=mgraid-16201289RN00023-1#012
2019-07-08T13:11:22.149019-07:00 mgraid-16201289RN00023-0 MGPSTONITH[16445]: 
info:Ignoring: port
...

[ClusterLabs] Question about associating with ClusterLabs wrt. a local community (Was: I have a question.)

2019-07-09 Thread Jan Pokorný

Hello Kim,

On 08/07/19 13:11 +0900, 김동현 wrote:
> I'm Donghyun Kim.
> 
> I work as a system engineer in Korea.
> 
> In the meantime, I was very interested in the cluster and want
> to promote it in Korea.
> There are many high-availability cases in Linux systems.
> 
> The reason why I am sending this mail is whether I can use the name
> "ClusterLabs Korea" in my Facebook community.

I am in no authority to decide this matter, and generally speaking,
the monicker in question we reached consensus to represent "us" as
a community (and be the label for an easy association with the SW
components of the cluster stack "we" evolve/deploy, also to avoid
further ambiguities or unnecessary verbosity when referring to "us")
is still rather an informal convention rather than something we
could claim exclusivity about.

Nonetheless, it's very kind of you that you seek approval for
label reference/association first.  It's good to be clear in
these collective entity matters, like for instance about the
license/reusability of ClusterLabs logo (originally devised by
Krig, license subsequently clarified on the developers aimed
sibling list[1]).

> It's only for the Facebook community.
> 
> Then I'll be looking forward to your reply.
> have a nice day

I am leaving this question open for more impactful community members,
but if I could have a single question, it would be about the reasons
for having a rather detached group.  It may be some cultural context
I may be missing -- does it have, for instance, something to do with
language barrier, with face to face opportunities to meet up locally,
or with heavy preference of Facebook over technically oriented forums
(like Reddit for general discussion, StackOverflow for Q, etc.; of
course, if blessed ClusterLabs communications channels[2] are not
a fit for whatever reason, otherwise they would ideally be the first
choice to avoid community fragmentation) where it would possibly be
more accessible for some people saving them also from imminent leakage
of personal data which is hardly (in a philosphical sense) justifiable
for the purpose of "having a technical discussion about particular
FOSS software"?

Just curious about this, looking forward to hearing from you.

P.S. speaking of community subdivisions, it's rather interesting to
me personally that also Japanese advanced their local group (actually
a long ago before I got to pacemaker), and as I just checked, it seems
still rather active: http://linux-ha.osdn.jp/wp/
(that's beside contributions to, say, Pacemaker coming in from people
of that country, and some stuff like https://github.com/linux-ha-japan
I never got to explore -- as mentioned, less fragmentation and we
could all be benefitting from getting to know the wider landscape
in a natural way, I believe).

[1] https://lists.clusterlabs.org/pipermail/developers/2019-April/002184.html
[2] https://clusterlabs.org/faq.html
+ https://wiki.clusterlabs.org/wiki/FAQ#Where_should_I_ask_questions.3F

-- 
Jan (Poki)

pgp1skYKDA_JZ.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

2019-07-09 Thread Tiemen Ruiten

On Mon, Jul 8, 2019 at 10:01 PM Jehan-Guillaume de Rorthais 
wrote:

> I should have step up to this thread, sorry :)
>

Really appreciate all the assistance so far.


> The real problem is not how much xact you will lost during failover, but
> how we
> can choose the best standby to elect. This election needs the timeline and
> LSN
> location of all standbys. And today, to fetch te timeline, we must issue a
> CHECKPOINT, then read the controldata file.
>
> I dig in xlog.c today. Maybe I can write a small extension to get the
> timeline
> from shared memory directly and make pgsqlms use it if it detects it. So
> people
> can decide if they feel like it is too invasive or really needed for
> their usecase. Maybe in next release. What do you think? Would it be
> useful to
> you?
>

Yes, that would be a really useful addition IMO. I would definitely use it.
If we can avoid taking a checkpoint that will save precious minutes during
a failover and the risk of timeouts would be drastically reduced. Would be
happy to test it if you want!


>
> >
> > I managed to improve the average time checkpoints are taking already from
> > what I mentioned in that thread, mainly by decreasing checkpoint_timeout
> > and setting full_page_writes = off; ostensibly not necessary on ZFS.
>
> The "full_page_writes" helps lowering the amount of WAL produced. Not the
> amount of writes to sync during the checkpoint. But I am sure it helps for
> your
> performances :)
>

If I'm saturating the IO capacity of my system during a forced checkpoint
and full_page_writes = off reduces IO by reducing the amount of WAL, then
it should help in an indirect way?


>
> Lowering "checkpoint_timeout" probably helps. As checkpoints occur more
> frequently, there is statistically less data to sync when a forced
> checkpoint
> happen during a failover.
>
> Regards,
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] colocation - but do not stop resources on failure

2019-07-09 Thread lejeczek

hi guys,

how to, if possible, create colocation which would not stop dependent
resources if the target(that would be systemd agent) resource fails on
all nodes?

many thanks, L.



pEpkey.asc
Description: application/pgp-keys
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] I have a question.

Re: [ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

Re: [ClusterLabs] "node is unclean" leads to gratuitous reboot

Re: [ClusterLabs] colocation - but do not stop resources on failure

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

Re: [ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

Re: [ClusterLabs] "node is unclean" leads to gratuitous reboot

[ClusterLabs] "node is unclean" leads to gratuitous reboot

[ClusterLabs] Question about associating with ClusterLabs wrt. a local community (Was: I have a question.)

Re: [ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

[ClusterLabs] colocation - but do not stop resources on failure

11 matches

Site Navigation

Mail list logo

Footer information