Thank you for the clear explanation and advice. Hardware is adequate: 8x SSD and 20 cores per node, but I should note that the filesystem is ZFS (stripe of mirrors) and there seems to be evidence that the way the WAL writer allocates space and ZFS' Copy-on-Write nature don't play nice. A patch that adds several GUCs to improve the situation (at least it's worth trying, there was some disagreement on the pgsql-general list over whether it would be helpful in my situation) has recently been merged: https://postgrespro.com/list/thread-id/2393057 but it won't be available in the 11.x release . So while I'm waiting until I can upgrade to PostgreSQL 12, I'll increase the notify timeout. A larger RTO is much more preferrable over manual actions in the middle of the night!
Thanks again! On Mon, Jul 8, 2019 at 11:56 AM Jehan-Guillaume de Rorthais <[email protected]> wrote: > On Sat, 6 Jul 2019 12:30:34 +0200 > Tiemen Ruiten <[email protected]> wrote: > > > On Fri, Jul 5, 2019 at 5:09 PM Jehan-Guillaume de Rorthais < > [email protected]> > > wrote: > > > > > It seems to me the problem comes from here: > > > > > > Jul 03 19:31:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com crmd: > > > notice: > > > te_rsc_command: Initiating notify operation > > > pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67 > > > Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com lrmd: > > > warning: > > > operation_finished: pgsqld_notify_0:30939 - timed out after 60000ms > > > > > > and here: > > > > > > Jul 03 19:31:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com lrmd: > > > info: > > > log_execute: executing - rsc:pgsqld action:notify call_id:38 > > > pgsqlms(pgsqld)[20881]: > > > Jul 03 19:32:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com lrmd: > > > warning: operation_finished: pgsqld_notify_0:20881 - timed > out > > > after > > > 60000ms > > > > > > The pgsql election occurs during the pre-promote action where all > remaining > > > nodes set their LSN location. During the promote action, the designed > > > primary > > > checks its LSN location is the highest one. If it is not, it just > cancel > > > the > > > promotion so the next round will elect the best one. > > > > > > Back to your issue. According to the logs, both standby nodes timed out > > > during > > > the pre-promote action. No LSN location has been set accros the whole > > > cluster. > > > I couldn't see any messages from the attrd daemon related to the > > > lsn_location > > > attribute or other cleanup actions. > > > > > > I couldn't even find the INFO message from pgsqlms giving its current > > > status > > > before actually setting it ("Current node TL#LSN: %s"). But this > message > > > appears soon after a...checkpoint. See: > > > https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017 > > > > > > Maybe you have very long checkpoints on both nodes that timed out the > > > pre-promote action? Do you have some PostgreSQL logs showing some > useful > > > info > > > around this timelapse? > > > > > > > Yes I have logs that show at what time checkpoints were taken. They're > > attached. Currently checkpoint_timeout is set to 15 minutes. I already > > lowered it from the previous value of 60 minutes. I could experiment with > > even lower values for checkpoint_timeout as it seems the problem is that > > forced checkpoints are taking too long (more than 60s). > > So it seems the theory is right. The "immediate wait" restartpoint is the > CHECKPOINT issued by pgsqlms during he pre-promote action. Timestamps > match. > > 2019-07-03 19:31:38.647 CEST [1214]: LOG: restartpoint starting: > immediate > wait > > One minute later, we reach the LRM timeout and about one more second > later, the > CRM decided to shutdown PgSQL, canceling the CHECKPOINT statement from > session > 30963: > > 2019-07-03 19:32:39.693 CEST [30963]: FATAL: terminating connection due > to > administrator command > 2019-07-03 19:32:39.693 CEST [30963]: STATEMENT: CHECKPOINT > > > Perhaps I should also increase the notify timeout? > > The checkpoints issued from pgsqlms are supposed to be "forced", which > means > they are not throttled and should sync dirty buffers as fast as possible. > Worst > case scenario is to sync the whole shared buffers. This shouldn't take a > full > minute on modern servers, unless: > > * you have a huge shared buffers > * your disk bandwidth or IOPS are saturated or very poor > * you have disks issues > > A side note about checkpoints: > > 15 minutes for checkpoint_timeout seems reasonable to me, but it really > depend > on your workload. You should just make sure most of your checkpoint are > timed triggered, but not occurring too frequently to improve write cache > and > avoid full page writes. According to your logs, last timed checkpoint > synced > about 33% of your shared buffers, 2.7GB, in 811s (checkpoint throttled at > 3.2MB/s). This is perfectly fine. > > The amount of data to sync during a failover really depend on your > workload. > Syncing 2.7GB in 60s would require a bandwidth of 45MB/s. Syncing 8GB in > 60s > would require 136MB/s. Both are quite easy to achieve on a decent server > setup. > > Back to the advice and conclusion: > > Lowering the checkpoint_timeout will not help. You can always find a > workload > that will fill your whole shared_buffers with dirty pages in less than > this GUC > value. Moreover, syncing 8GB of data (worst case scenario) should not take > ages... > > However, can raise the notify action timeout to something way larger. The > timeout should be set depending on your RTO (warning, you will have to sum > various actions timeouts). You can even raise the notify action to 300s or > larger if it makes sense to you. The point is to give enough time to the > CHECKPOINT to finish, no matter what. > > Unfortunately, we still need this checkpoint to be able to compare safely > the > TL of nodes :( Even the "pg_control_checkpoint()" function appearing in 9.6 > doesn't help as it reads the controldata file from disk where data can be > outdated with the one in memory :/ > The only way to avoid this might be to rely on a dedicated C extension, > which > seems too invasive in my opinion. > > If you need something faster, you might have to setup your hardware to > match > your performance requirement. > > Regards, >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
