Re: Syncrep and improving latency due to WAL throttling

2023-12-03 Thread Tomas Vondra
Hi,

Since the last patch version I've done a number of experiments with this
throttling idea, so let me share some of the ideas and results, and see
where that gets us.

The patch versions so far tied everything to syncrep - commit latency
with sync replica was the original motivation, so this makes sense. But
while thinking about this and discussing this with a couple people, I've
been wondering why to limit this to just that particular option. There's
a couple other places in the WAL write path where we might do a similar
thing (i.e. wait) or be a bit more aggressive (and do a write/flush),
depending on circumstances.

If I simplify this a bit, there are about 3 WAL positions that I could
think of:

- write LSN (how far we wrote WAL to disk)
- flush LSN (how far we flushed WAL to local disk)
- syncrep LSN (how far the sync replica confirmed WAL)

So, why couldn't there be a similar "throttling threshold" for these
events too? Imagine we have three GUCs, with values satisfying this:

  wal_write_after < wal_flush_after_local < wal_flush_after_remote

and this meaning:

  wal_write_after - if a backend generates this amount of WAL, it will
write the completed WAL (but only whole pages)

  wal_flush_after_local - if a backend generates this amount of WAL, it
  will not only write the WAL, but also issue a
  flush (if still needed)

  wal_flush_after_remote - if this amount of WAL is generated, it will
   wait for syncrep to confirm the flushed LSN

The attached PoC patch does this, mostly the same way as earlier
patches. XLogInsertRecord is where the decision whether throttling may
be needed is done, HandleXLogDelayPending then does the actual work
(writing WAL, flushing it, waiting for syncrep).

The one new thing HandleXLogDelayPending also does is auto-tuning the
values a bit. The idea is that with per-backend threshold, it's hard to
enforce some sort of global limit, because if depends on the number of
active backends. If you set 1MB of WAL per backend, the total might be
1MB or 1000MB, if there are 1000 backends. Who knows. So this tries to
reduce the threshold (if the backend generated only a tiny fraction of
the WAL), or increase the threshold (if it generated most of it). I'm
not entirely sure this behaves sanely under all circumstances, but for a
PoC patch it seems OK.

The first two GUCs remind me what walwriter is doing, and I've been
asking myself if maybe making it more aggressive would have the same
effect. But I don't think so, because a big part of this throttling
patch is ... well, throttling. Making the backends sleep for a bit (or
wait for something), to slow it down. And walwriter doesn't really do
that I think.


In a recent off-list discussion, someone asked if maybe this might be
useful to prevent emergencies due to archiver not keeping up and WAL
filling disk. A bit like enforcing a more "strict" limit on WAL than the
current max_wal_size GUC. I'm not sure about that, it's certainly a very
different use case than minimizing impact on OLTP latency. But it seems
like "archived LSN" might be another "event" the backends would wait
for, just like they wait for syncrep to confirm a LSN. Ideally it'd
never happen, ofc, and it seems a bit like a great footgun (outage on
archiver may kill PROD), but if you're at risk of ENOSPACE on pg_wal,
not doing anything may be risky too ...

FWIW I wonder if maybe we should frame this a as a QoS feature, where
instead of "minimize impact of bulk loads" we'd try to "guarantee" or
"reserve" some part of the capacity to certain backends/...


Now, let's look at results from some of the experiments. I wanted to see
how effective this approach could be in minimizing impact of large bulk
loads at small OLTP transactions in different setups. Thanks to the two
new GUCs this is not strictly about syncrep, so I decided to try three
cases:

1) local, i.e. single-node instance

2) syncrep on the same switch, with 0.1ms latency (1Gbit)

2) syncrep with 10ms latency (also 1Gbit)

And for each configuration I did ran a pgbench (30 minutes), either on
it's own, or concurrently with bulk COPY of 1GB data. The load was done
either by a single backend (so one backend loading 1GB of data), or the
file was split into 10 files 100MB each, and this was loaded by 10
concurrent backends.

And I did this test with three configurations:

(a) master - unpatched, current behavior

(b) throttle-1: patched with limits set like this:

   # Add settings for extensions here
   wal_write_after = '8kB'
   wal_flush_after_local = '16kB'
   wal_flush_after_remote = '32kB'

(c) throttle-2: patched with throttling limits set to 4x of (b), i.e.

   # Add settings for extensions here
   wal_write_after = '32kB'
   wal_flush_after_local = '64kB'
   wal_flush_after_remote = '128kB'

And I did this for the traditional three scales (small, medium, large),
to hit different bottlenecks. And of course, I 

Re: Syncrep and improving latency due to WAL throttling

2023-11-08 Thread Andres Freund
Hi,

On 2023-11-08 19:29:38 +0100, Tomas Vondra wrote:
> >>> I haven't checked, but I'd assume that 100bytes back and forth should 
> >>> easily
> >>> fit a new message to update LSNs and the existing feedback response. Even 
> >>> just
> >>> the difference between sending 100 bytes and sending 10k (a bit more than 
> >>> a
> >>> single WAL page) is pretty significant on a 1gbit network.
> >>>
> >>
> >> I'm on decaf so I may be a bit slow, but it's not very clear to me what
> >> conclusion to draw from these numbers. What is the takeaway?
> >>
> >> My understanding is that in both cases the latency is initially fairly
> >> stable, independent of the request size. This applies to request up to
> >> ~1000B. And then the latency starts increasing fairly quickly, even
> >> though it shouldn't hit the bandwidth (except maybe the 1MB requests).
> >
> > Except for the smallest end, these are bandwidth related, I think. 
> > Converting
> > 1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving
> > the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on
> > 1gbit. If you subtract the minmum latency of about 130us, that's nearly all 
> > of
> > the latency.
> >
>
> Maybe I don't understand what you mean "bandwidth related" but surely
> the smaller requests are not limited by bandwidth. I mean, 100B and 1kB
> (and even 10kB) requests have almost the same transaction rate, yet
> there's an order of magnitude difference in bandwidth (sure, there's
> overhead, but this much magnitude?).

What I mean is that bandwidth is the biggest factor determining latency in the
numbers I showed (due to decent sized packet and it being a local network). At
line rate it takes ~80us to send 10kB over 1gbit ethernet. So a roundtrip
cannot be faster than 80us, even if everything else added zero latency.

That's why my numbers show such a lower latency for the 10gbit network - it's
simply faster to put even small-ish amounts of data onto the wire.

That does not mean that the link is fully utilized over time - because we wait
for the other side to receive the data, wake up a user space process, send
back 100 bytes, wait for the data be transmitted, and then wake up a process,
there are periods where the link in one direction is largely idle.  But in
case of a 10kB packet on the 1gbit network, yes, we are bandwidth limited for
~80us (or perhaps more interestingly, we are bandwidth limited for 0.8ms when
sending 100kB).

Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-11-08 Thread Tomas Vondra



On 11/8/23 18:11, Andres Freund wrote:
> Hi,
> 
> On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote:
>>> I used netperf's tcp_rr between my workstation and my laptop on a local 
>>> 10Gbit
>>> network (albeit with a crappy external card for my laptop), to put some
>>> numbers to this. I used -r $s,100 to test sending a variable sized data to 
>>> the
>>> other size, with the other side always responding with 100 bytes (assuming
>>> that'd more than fit a feedback response).
>>>
>>> Command:
>>> fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate";
>>>  echo $fields; for s in 10 100 1000 1 10 100;do netperf -P0 -t 
>>> TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done
>>>
>>> 10gbe:
>>>
>>> request_sizeresponse_size   min_latency mean_latencymax_latency 
>>> p99_latency transaction_rate
>>> 10  100 43  64.30   390 
>>> 96  15526.084
>>> 100 100 57  75.12   428 
>>> 122 13286.602
>>> 1000100 47  74.41   270 
>>> 108 13412.125
>>> 1   100 89  114.63  712 
>>> 152 8700.643
>>> 10  100 167 255.90  584 
>>> 312 3903.516
>>> 100 100 891 1015.99 2470
>>> 1143983.708
>>>
>>>
>>> Same hosts, but with my workstation forced to use a 1gbit connection:
>>>
>>> request_sizeresponse_size   min_latency mean_latencymax_latency 
>>> p99_latency transaction_rate
>>> 10  100 78  131.18  2425
>>> 257 7613.416
>>> 100 100 81  129.25  425 
>>> 255 7727.473
>>> 1000100 100 162.12  1444
>>> 266 6161.388
>>> 1   100 310 686.19  1797
>>> 927 1456.204
>>> 10  100 10061114.20 1472
>>> 1199896.770
>>> 100 100 83388420.96 8827
>>> 8498118.410
> 
> Looks like the 1gbit numbers were somewhat bogus-ified due having configured
> jumbo frames and some network component doing something odd with that
> (handling them in software maybe?).
> 
> 10gbe:
> request_sizeresponse_size   min_latency mean_latencymax_latency   
>   p99_latency transaction_rate
> 10100 56  68.56   483 
> 87  14562.476
> 100   100 57  75.68   353 
> 123 13185.485
> 1000  100 60  71.97   391 
> 94  13870.659
> 1 100 58  92.42   489 
> 140 10798.444
> 10100 184 260.48  1141
> 338 3834.504
> 100   100 926 1071.46 2012
> 1466933.009
> 
> 1gbe
> request_sizeresponse_size   min_latency mean_latencymax_latency   
>   p99_latency transaction_rate
> 10100 77  132.19  1097
> 257 7555.420
> 100   100 79  127.85  534 
> 249 7810.862
> 1000  100 98  155.91  966 
> 265 6406.818
> 1 100 176 235.37  1451
> 314 4245.304
> 10100 944 1022.00 1380
> 1148977.930
> 100   100 86498768.42 9018
> 8895113.703
> 
> 
>>> I haven't checked, but I'd assume that 100bytes back and forth should easily
>>> fit a new message to update LSNs and the existing feedback response. Even 
>>> just
>>> the difference between sending 100 bytes and sending 10k (a bit more than a
>>> single WAL page) is pretty significant on a 1gbit network.
>>>
>>
>> I'm on decaf so I may be a bit slow, but it's not very clear to me what
>> conclusion to draw from these numbers. What is the takeaway?
>>
>> My understanding is that in both cases the latency is initially fairly
>> stable, independent of the request size. This applies to request up to
>> ~1000B. And then the latency starts increasing fairly quickly, even
>> though it shouldn't hit the bandwidth (except maybe the 1MB requests).
> 

Re: Syncrep and improving latency due to WAL throttling

2023-11-08 Thread Andres Freund
Hi,

On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote:
> > I used netperf's tcp_rr between my workstation and my laptop on a local 
> > 10Gbit
> > network (albeit with a crappy external card for my laptop), to put some
> > numbers to this. I used -r $s,100 to test sending a variable sized data to 
> > the
> > other size, with the other side always responding with 100 bytes (assuming
> > that'd more than fit a feedback response).
> >
> > Command:
> > fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate";
> >  echo $fields; for s in 10 100 1000 1 10 100;do netperf -P0 -t 
> > TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done
> >
> > 10gbe:
> >
> > request_sizeresponse_size   min_latency mean_latencymax_latency 
> > p99_latency transaction_rate
> > 10  100 43  64.30   390 
> > 96  15526.084
> > 100 100 57  75.12   428 
> > 122 13286.602
> > 1000100 47  74.41   270 
> > 108 13412.125
> > 1   100 89  114.63  712 
> > 152 8700.643
> > 10  100 167 255.90  584 
> > 312 3903.516
> > 100 100 891 1015.99 2470
> > 1143983.708
> >
> >
> > Same hosts, but with my workstation forced to use a 1gbit connection:
> >
> > request_sizeresponse_size   min_latency mean_latencymax_latency 
> > p99_latency transaction_rate
> > 10  100 78  131.18  2425
> > 257 7613.416
> > 100 100 81  129.25  425 
> > 255 7727.473
> > 1000100 100 162.12  1444
> > 266 6161.388
> > 1   100 310 686.19  1797
> > 927 1456.204
> > 10  100 10061114.20 1472
> > 1199896.770
> > 100 100 83388420.96 8827
> > 8498118.410

Looks like the 1gbit numbers were somewhat bogus-ified due having configured
jumbo frames and some network component doing something odd with that
(handling them in software maybe?).

10gbe:
request_sizeresponse_size   min_latency mean_latencymax_latency 
p99_latency transaction_rate
10  100 56  68.56   483 
87  14562.476
100 100 57  75.68   353 
123 13185.485
1000100 60  71.97   391 
94  13870.659
1   100 58  92.42   489 
140 10798.444
10  100 184 260.48  1141
338 3834.504
100 100 926 1071.46 2012
1466933.009

1gbe
request_sizeresponse_size   min_latency mean_latencymax_latency 
p99_latency transaction_rate
10  100 77  132.19  1097
257 7555.420
100 100 79  127.85  534 
249 7810.862
1000100 98  155.91  966 
265 6406.818
1   100 176 235.37  1451
314 4245.304
10  100 944 1022.00 1380
1148977.930
100 100 86498768.42 9018
8895113.703


> > I haven't checked, but I'd assume that 100bytes back and forth should easily
> > fit a new message to update LSNs and the existing feedback response. Even 
> > just
> > the difference between sending 100 bytes and sending 10k (a bit more than a
> > single WAL page) is pretty significant on a 1gbit network.
> >
>
> I'm on decaf so I may be a bit slow, but it's not very clear to me what
> conclusion to draw from these numbers. What is the takeaway?
>
> My understanding is that in both cases the latency is initially fairly
> stable, independent of the request size. This applies to request up to
> ~1000B. And then the latency starts increasing fairly quickly, even
> though it shouldn't hit the bandwidth (except maybe the 1MB requests).

Except for the smallest end, these are bandwidth related, I think. Converting
1gbit/s to bytes/us is 125 bytes / us - before tcp/ip 

Re: Syncrep and improving latency due to WAL throttling

2023-11-08 Thread Tomas Vondra
On 11/8/23 07:40, Andres Freund wrote:
> Hi,
> 
> On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote:
>> scope
>> -
>> Now, let's talk about scope - what the patch does not aim to do. The
>> patch is explicitly intended for syncrep clusters, not async. There have
>> been proposals to also support throttling for async replicas, logical
>> replication etc. I suppose all of that could be implemented, and I do
>> see the benefit of defining some sort of maximum lag even for async
>> replicas. But the agreement was to focus on the syncrep case, where it's
>> particularly painful, and perhaps extend it in the future.
> 
> Perhaps we should take care to make the configuration extensible in that
> direction in the future?
> 

Yes, if we can come up with a suitable configuration, that would work
for the other use cases. I don't have a very good idea what to do about
replicas that may not be connected, of have connected but need to catch
up. IMHO it would be silly to turn this into "almost a sync rep".

> 
> Hm - is this feature really tied to replication, at all? Pretty much the same
> situation exists without. On an ok-ish local nvme I ran pgbench with 1 client
> and -P1. Guess where I started a VACUUM (on a fully cached table, so no
> continuous WAL flushes):
> 
> progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed
> progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed
> progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed
> progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed
> progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed
> progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed
> progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed
> progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed
> progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed
> progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed
> 
> At 32 clients we go from ~10k to 2.5k, with a full 2s of 0.
> 
> Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the
> "good times" show a delay of ~8kB (note that this includes WAL records that
> are still being inserted). Once the VACUUM runs, it's ~2-3MB.
> 
> The picture with more clients is similar.
> 
> If I instead severely limit the amount of outstanding (but not the amount of
> unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less
> (down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32).  Of course
> that's ridiculous and will completely trash performance in many other cases,
> but it shows that limiting the amount of outstanding WAL could help without
> replication as well.  With remote storage, that'd likely be a bigger
> difference.
> 

Yeah, that's an interesting idea. I think the idea of enforcing "maximum
lag" is somewhat general, the difference is against what LSN the lag is
measured. For the syncrep case it was about LSN confirmed by the
replica, what you described would measure it for either flush LSN or
write LSN (which would be the "outstanding" case I think).

I guess the remote storage is somewhat similar to the syncrep case, in
that the lag includes some network communication.

> 
>> problems
>> 
>> Now let's talk about some problems - both conceptual and technical
>> (essentially review comments for the patch).
>>
>> 1) The goal of the patch is to limit the impact on latency, but the
>> relationship between WAL amounts and latency may not be linear. But we
>> don't have a good way to predict latency, and WAL lag is the only thing
>> we have, so there's that. Ultimately, it's a best effort.
> 
> It's indeed probably not linear. Realistically, to do better, we probably need
> statistics for the specific system in question - the latency impact will
> differ hugely between different storage/network.
> 

True. I can imagine two ways to measure that.

We could have a standalone tool similar to pg_test_fsync that would
mimic how we write/flush WAL, and measure the latency for different
amounts flushed data. The DBA would then be responsible for somehow
using this to configure the database (perhaps the tool could calculate
some "optimal" value to flush).

Alternatively, we could collect timing in XLogFlush, so that we'd track
amount of data to flush + timing, and then use that to calculate
expected latency (e.g. by binning by data size and using average latency
for each bin). And then use that, somehow.

So you could say - maximum commit latency is 10ms, and the system would
be able to estimate that the maximum amount of unflushed WAL is 256kB,
and it'd enforce this distance.

Still only best offort, no guarantees, of course.

> 
>> 2) The throttling is per backend. That makes it simple, but it means
>> that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
>> and with a single backend that works fine - the lag should not exceed
>> the 8MB value. But if there are N backends, the lag could be up to

Re: Syncrep and improving latency due to WAL throttling

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote:
> scope
> -
> Now, let's talk about scope - what the patch does not aim to do. The
> patch is explicitly intended for syncrep clusters, not async. There have
> been proposals to also support throttling for async replicas, logical
> replication etc. I suppose all of that could be implemented, and I do
> see the benefit of defining some sort of maximum lag even for async
> replicas. But the agreement was to focus on the syncrep case, where it's
> particularly painful, and perhaps extend it in the future.

Perhaps we should take care to make the configuration extensible in that
direction in the future?


Hm - is this feature really tied to replication, at all? Pretty much the same
situation exists without. On an ok-ish local nvme I ran pgbench with 1 client
and -P1. Guess where I started a VACUUM (on a fully cached table, so no
continuous WAL flushes):

progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed
progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed
progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed
progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed
progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed
progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed
progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed
progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed
progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed
progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed

At 32 clients we go from ~10k to 2.5k, with a full 2s of 0.

Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the
"good times" show a delay of ~8kB (note that this includes WAL records that
are still being inserted). Once the VACUUM runs, it's ~2-3MB.

The picture with more clients is similar.

If I instead severely limit the amount of outstanding (but not the amount of
unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less
(down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32).  Of course
that's ridiculous and will completely trash performance in many other cases,
but it shows that limiting the amount of outstanding WAL could help without
replication as well.  With remote storage, that'd likely be a bigger
difference.




> problems
> 
> Now let's talk about some problems - both conceptual and technical
> (essentially review comments for the patch).
>
> 1) The goal of the patch is to limit the impact on latency, but the
> relationship between WAL amounts and latency may not be linear. But we
> don't have a good way to predict latency, and WAL lag is the only thing
> we have, so there's that. Ultimately, it's a best effort.

It's indeed probably not linear. Realistically, to do better, we probably need
statistics for the specific system in question - the latency impact will
differ hugely between different storage/network.


> 2) The throttling is per backend. That makes it simple, but it means
> that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
> and with a single backend that works fine - the lag should not exceed
> the 8MB value. But if there are N backends, the lag could be up to
> N-times 8MB, I believe. That's a bit annoying, but I guess the only
> solution would be to have some autovacuum-like cost balancing, with all
> backends (or at least those running large stuff) doing the checks more
> often. I'm not sure we want to do that.

Hm. The average case is likely fine - the throttling of the different backends
will intersperse and flush more frequently - but the worst case is presumably
part of the issue here. I wonder if we could deal with this by somehow
offsetting the points at which backends flush at somehow.

I doubt we want to go for something autovacuum balancing like - that doesn't
seem to work well - but I think we could take the amount of actually unflushed
WAL into account when deciding whether to throttle. We have the necessary
state in local memory IIRC. We'd have to be careful to not throttle every
backend at the same time, or we'll introduce latency penalties that way. But
what if we scaled synchronous_commit_wal_throttle_threshold depending on the
amount of unflushed WAL? By still taking backendWalInserted into account, we'd
avoid throttling everyone at the same time, but still would make throttling
more aggressive depending on the amount of unflushed/unreplicated WAL.


> 3) The actual throttling (flush and wait for syncrep) happens in
> ProcessInterrupts(), which mostly works but it has two drawbacks:
>
>  * It may not happen "early enough" if the backends inserts a lot of
> XLOG records without processing interrupts in between.

Does such code exist? And if so, is there a reason not to fix said code?


>  * It may happen "too early" if the backend inserts enough WAL to need
> throttling (i.e. sets XLogDelayPending), but then after processing
> interrupts it 

Re: Syncrep and improving latency due to WAL throttling

2023-11-04 Thread Tomas Vondra
Hi,

I keep getting occasional complaints about the impact of large/bulk
transactions on latency of small OLTP transactions, so I'd like to
revive this thread a bit and move it forward.

Attached is a rebased v3, followed by 0002 patch with some review
comments, missing comments and minor tweaks. More about that later ...

It's been a couple months, and there's been a fair amount of discussion
and changes earlier, so I guess it makes sense to post a summary,
stating the purpose (and scope), and then go through the various open
questions etc.


goals
-
The goal is to limit the impact of large transactions (producing a lot
of WAL) on small OLTP transactions, in a cluster with a sync replica.
Imagine a backend executing single-row inserts, or something like that.
The commit will wait for the replica to confirm the WAL, which may be
expensive, but it's well determined by the network roundtrip.

But then a large transaction comes, and inserts a lot of WAL (imagine a
COPY which inserts 100MB of data, VACUUM, CREATE INDEX and so on). A
small transaction may insert a COMMIT record right after this WAL chunk,
and locally that's (mostly) fine. But with the sync replica it's much
worse - we don't send WAL until it's flushed locally, and then we need
to wait for the WAL to be sent, applied and confirmed by the replica.
This takes time (depending on the bandwidth), and it may not happen
until the small transaction does COMMIT (because we may not flush WAL
from in-progress transaction very often).

Jakub Wartak presented some examples of the impact when he started this
thread, and it can be rather bad. Particularly for latency-sensitive
applications. I plan to do more experiments with the current patch, but
I don't have the results yet.


scope
-
Now, let's talk about scope - what the patch does not aim to do. The
patch is explicitly intended for syncrep clusters, not async. There have
been proposals to also support throttling for async replicas, logical
replication etc. I suppose all of that could be implemented, and I do
see the benefit of defining some sort of maximum lag even for async
replicas. But the agreement was to focus on the syncrep case, where it's
particularly painful, and perhaps extend it in the future.

I believe adding throttling for physical async replication should not be
difficult - in principle we need to determine how far the replica got,
and compare it to the local LSN. But there's likely complexity with
defining which async replicas to look at, inventing a sensible way to
configure this, etc. It'd be helpful if people interested in that
feature took a look at this patch and tried extending etc.

It's not clear to me what to do about disconnected replicas, though. We
may not even know about them, if there's no slot (and how would we know
what the slot is for). So maybe this would need a new GUC listing the
interesting replicas, and all would need to be connected. But that's an
availability issue, because then all replicas need to be connected.

I'm not sure about logical replication, but I guess we could treat it
similarly to async.

But what I think would need to be different is handling of small
transactions. For syncrep we automatically wait for those at commit,
which means automatic throttling. But for async (and logical), it's
trivial to cause ever-increasing lag with only tiny transactions, thanks
to the single-process replay, so maybe we'd need to throttle those too.
(The recovery prefetching improved this for async quite a bit, ofc.)


implementation
--
The implementation is fairly straightforward, and happens in two places.
XLogInsertRecord() decides if a throttling might be needed for this
backend, and then HandleXLogDelayPending() does the wait.

XLogInsertRecord() checks if the backend produced certain amount of WAL
(might be 1MB, for example). We do this because we don't want to do the
expensive stuff in HandleXLogDelayPending() too often (e.g. after every
XLOG record).

HandleXLogDelayPending() picks a suitable LSN, flushes it and then also
waits for the sync replica, as if it was a commit. This limits the lag,
i.e. the amount of WAL that the small transaction will need to wait for
to be replicated and confirmed by the replica.

There was a fair amount of discussion about how to pick the LSN. I think
the agreement is we certainly can't pick the current LSN (because that
would lead to write amplification for the partially filled page), and we
probably even want to backoff a bit more, to make it more likely the LSN
is already flushed. So for example with the threshold set to 8MB we
might go back 1MB, or something like that. That'd still limit the lag.


problems

Now let's talk about some problems - both conceptual and technical
(essentially review comments for the patch).

1) The goal of the patch is to limit the impact on latency, but the
relationship between WAL amounts and latency may not be linear. But we
don't have a good way to predict latency, and 

Re: Syncrep and improving latency due to WAL throttling

2023-02-02 Thread Jakub Wartak
On Thu, Feb 2, 2023 at 11:03 AM Tomas Vondra
 wrote:

> > I agree that some other concurrent backend's
> > COMMIT could fsync it, but I was wondering if that's sensible
> > optimization to perform (so that issue_fsync() would be called for
> > only commit/rollback records). I can imagine a scenario with 10 such
> > concurrent backends running - all of them with this $thread-GUC set -
> > but that would cause 20k unnecessary fsyncs (?) -- (assuming single
> > HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
> > would be wasted close to 400s just due to local fsyncs?). I don't have
> > a strong opinion or in-depth on this, but that smells like IO waste.
> >
>
> Not sure what optimization you mean,

Let me clarify, let's say something like below (on top of the v3) just
to save IOPS:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2340,6 +2340,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli,
bool flexible)
if (sync_method != SYNC_METHOD_OPEN &&
sync_method != SYNC_METHOD_OPEN_DSYNC)
{
+   bool openedLogFile = false;
if (openLogFile >= 0 &&
!XLByteInPrevSeg(LogwrtResult.Write,
openLogSegNo,

wal_segment_size))
@@ -2351,9 +2352,15 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
tli, bool flexible)
openLogTLI = tli;
openLogFile = XLogFileOpen(openLogSegNo, tli);
ReserveExternalFD();
+   openedLogFile = true;
}

-   issue_xlog_fsync(openLogFile, openLogSegNo, tli);
+   /* can we bypass fsyncing() XLOG from the backend if
+* we have been called without commit request?
+* usually the feature will be off here
(XLogDelayPending=false)
+*/
+   if(openedLogFile == true || XLogDelayPending == false)
+   issue_xlog_fsync(openLogFile,
openLogSegNo, tli);
}

+ maybe some additional logic to ensure that this micro-optimization
for saving IOPS would be not enabled if the backend is calling that
XLogFlush/Write() for actual COMMIT record

> But I think the backends still have to sleep at some point, so that they
> don't queue too much unflushed WAL - that's kinda the whole point, no?

Yes, but it can be flushed to standby, flushed locally but not fsynced
locally (?) - provided that it was not COMMIT - I'm just wondering
whether it makes sense (Question 1)

> The issue is more about triggering the throttling too early, before we
> hit the bandwidth limit. Which happens simply because we don't have a
> very good way to decide whether the latency is growing, so the patch
> just throttles everything.

Maximum TCP bandwidth limit seems to be fluctuating in the real world
I suppose, so it couldn't be a hard limit. On the other hand I can
imagine operators setting
"throttle-those-backends-if-global-WALlatencyORrate>XXX"
(administrative decision). That would be cool to have but yes it would
require WAL latency and rate measurement first (on its own that would
make a very nice addition to the pg_stat_replication). But one thing
to note would be that there could be many potential latencies (& WAL
throughput rates) to consider (e.g. quorum of 3 standby sync having
different latencies) - which one to choose?

(Question 2) I think we have reached simply a decision point on
whether the WIP/PoC is good enough as it is (like Andres wanted and
you +1 to this) or it should work as you propose or maybe keep it as
an idea for the future?

-J.




Re: Syncrep and improving latency due to WAL throttling

2023-02-02 Thread Tomas Vondra
On 2/1/23 14:40, Jakub Wartak wrote:
> On Wed, Feb 1, 2023 at 2:14 PM Tomas Vondra
>  wrote:
> 
>>> Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
>>> HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
>>> we are flushing just because of WAL thortting ?) Would that still be
>>> safe?
>>
>> It's not clear to me how could this work and still be safe. I mean, we
>> *must* flush the local WAL first, otherwise the replica could get ahead
>> (if we send unflushed WAL to replica and then crash). Which would be
>> really bad, obviously.
> 
> Well it was just a thought: in this particular test - with no other
> concurrent activity happening - we are fsyncing() uncommitted
> Heap/INSERT data much earlier than the final Transaction/COMMIT WAL
> record comes into play.

Right. I see it as testing (more or less) a worst-case scenario,
measuring impact on commands generating a lot of WAL. I'm not sure the
slowdown comes from the extra fsyncs, thgouh - I'd bet it's more about
the extra waits for confirmations from the replica.

> I agree that some other concurrent backend's
> COMMIT could fsync it, but I was wondering if that's sensible
> optimization to perform (so that issue_fsync() would be called for
> only commit/rollback records). I can imagine a scenario with 10 such
> concurrent backends running - all of them with this $thread-GUC set -
> but that would cause 20k unnecessary fsyncs (?) -- (assuming single
> HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
> would be wasted close to 400s just due to local fsyncs?). I don't have
> a strong opinion or in-depth on this, but that smells like IO waste.
> 

Not sure what optimization you mean, but triggering the WAL flushes from
a separate process would be beneficial. But we already do that, more or
less - that's what WAL writer is about, right? Maybe it's not aggressive
enough or something, not sure.

But I think the backends still have to sleep at some point, so that they
don't queue too much unflushed WAL - that's kinda the whole point, no?
The issue is more about triggering the throttling too early, before we
hit the bandwidth limit. Which happens simply because we don't have a
very good way to decide whether the latency is growing, so the patch
just throttles everything.

Consider a replica on a network link with 10ms round trip. Then commit
latency can't really be better than 10ms, and throttling at that point
can't really improve anything, it just makes it slower. Instead, we
should measure the latency somehow, and only throttle when it increases.
And probably make it proportional to the delta (so the higher it's from
the "minimal" latency, the more we'd throttle).

I'd imagine we'd measure the latency (or the wait for sync replica) over
reasonably short time windows (1/10 of a second?), and using that to
drive the throttling. If the latency is below some acceptable value,
don't throttle at all. If it increases, start throttling.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Syncrep and improving latency due to WAL throttling

2023-02-01 Thread Jakub Wartak
On Wed, Feb 1, 2023 at 2:14 PM Tomas Vondra
 wrote:

> > Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
> > HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
> > we are flushing just because of WAL thortting ?) Would that still be
> > safe?
>
> It's not clear to me how could this work and still be safe. I mean, we
> *must* flush the local WAL first, otherwise the replica could get ahead
> (if we send unflushed WAL to replica and then crash). Which would be
> really bad, obviously.

Well it was just a thought: in this particular test - with no other
concurrent activity happening - we are fsyncing() uncommitted
Heap/INSERT data much earlier than the final Transaction/COMMIT WAL
record comes into play. I agree that some other concurrent backend's
COMMIT could fsync it, but I was wondering if that's sensible
optimization to perform (so that issue_fsync() would be called for
only commit/rollback records). I can imagine a scenario with 10 such
concurrent backends running - all of them with this $thread-GUC set -
but that would cause 20k unnecessary fsyncs (?) -- (assuming single
HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
would be wasted close to 400s just due to local fsyncs?). I don't have
a strong opinion or in-depth on this, but that smells like IO waste.

-J.




Re: Syncrep and improving latency due to WAL throttling

2023-02-01 Thread Tomas Vondra



On 2/1/23 11:04, Jakub Wartak wrote:
> On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy
>  wrote:
> 
> Hi Bharath, thanks for reviewing.
> 
>> I think measuring the number of WAL flushes with and without this
>> feature that the postgres generates is great to know this feature
>> effects on IOPS. Probably it's even better with variations in
>> synchronous_commit_flush_wal_after or the throttling limits.
> 
> It's the same as point 19, so I covered it there.
> 
>> Although v2 is a WIP patch, I have some comments:
>> 1. Coding style doesn't confirm to the Postgres standard:
> 
> Fixed all thoses that you mentioned and I've removed elog() and code
> for timing. BTW: is there a way to pgindent only my changes (on git
> diff?)
> 
>> 2. It'd be better to add a TAP test hitting the WAL throttling.
> 
> TODO, any hints on how that test should look like are welcome
> (checking pg_stat_wal?) I've could spot only 1 place for it --
> src/test/recovery/007_sync_rep.pl
> 
>> 3. We generally don't need timings to be calculated explicitly when we
>> emit before and after log messages. If needed one can calculate the
>> wait time from timestamps of the log messages. If it still needs an
>> explicit time difference which I don't think is required, because we
>> have a special event and before/after log messages, I think it needs
>> to be under track_wal_io_timing to keep things simple.
> 
> Removed, it was just debugging aid to demonstrate the effect in the
> session waiting.
> 
>> 4. XLogInsertRecord() can be a hot path for high-write workload, so
>> effects of adding anything in it needs to be measured. So, it's better
>> to run benchmarks with this feature enabled and disabled.
> 
> When the GUC is off(0 / default), in my tests the impact is none (it's
> just set of simple IFs), however if the feature is enabled then the
> INSERT is slowed down (I've repeated the initial test from 1st post
> and single-statement INSERT for 50MB WAL result is the same 4-5s and
> ~23s +/- 1s when feature is activated when the RTT is 10.1ms, but
> that's intentional).
> 
>> 5. Missing documentation of this feature and the GUC. I think we can
>> describe extensively why we'd need a feature of this kind in the
>> documentation for better adoption or usability.
> 
> TODO, I'm planning on adding documentation when we'll be a little bit
> closer to adding to CF.
> 
>> 6. Shouldn't the max limit be MAX_KILOBYTES?
>> +_commit_flush_wal_after,
>> +0, 0, 1024L*1024L,
> 
> Fixed.
> 
>> 7. Can this GUC name be something like
>> synchronous_commit_wal_throttle_threshold to better reflect what it
>> does for a backend?
>> +{"synchronous_commit_flush_wal_after", PGC_USERSET,
>> REPLICATION_SENDING,
> 
> Done.
> 
>> 8. A typo - s/confiration/confirmation
> [..]
>> 9. This
>> "Sets the maximum logged WAL in kbytes, after which wait for sync
>> commit confiration even without commit "
>> better be something like below?
>> "Sets the maximum amount of WAL in kilobytes a backend generates after
>> which it waits for synchronous commit confiration even without commit
>> "
> 
> Fixed as you have suggested.
> 
>> 10. I think there's nothing wrong in adding some assertions in
>> HandleXLogDelayPending():
>> Assert(synchronous_commit_flush_wal_after > 0);
>> Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L);
>> Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr);
> 
> Sure, added.
> 
>> 11. Specify the reason in the comments as to why we had to do the
>> following things:
>> Here:
>> +/* flush only up to the last fully filled page */
>> +XLogRecPtr LastFullyWrittenXLogPage = XactLastThrottledRecEnd
>> - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
>> Talk about how this avoids multiple-flushes for the same page.
>>
>> Here:
>> + * Called from ProcessMessageInterrupts() to avoid waiting while
>> being in critical section
>> + */
>> +void HandleXLogDelayPending()
>> Talk about how waiting in a critical section, that is in
>> XLogInsertRecord() causes locks to be held longer durations and other
>> effects.
> 
> Added.
> 
>> Here:
>> +/* WAL throttling */
>> +backendWalInserted += rechdr->xl_tot_len;
>> Be a bit more verbose about why we try to throttle WAL and why only
>> for sync replication, the advantages, effects etc.
> 
> Added.
> 
>> 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not
>> clutter server logs.
>> +/* XXX Debug for now */
>> +elog(WARNING, "throttling WAL down on this session
>> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
>> +backendWalInserted,
>> +LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
>> +LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
>>
>> +elog(WARNING, "throttling WAL down on this session - end (%f
>> ms)", timediff);
> 
> OK, that's entirely removed.
> 
>> 13. BTW, we don't need to hold interrupts while waiting for sync
>> replication ack as it may block query cancels or proc die pendings.
>> You can 

Re: Syncrep and improving latency due to WAL throttling

2023-02-01 Thread Jakub Wartak
On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy
 wrote:

Hi Bharath, thanks for reviewing.

> I think measuring the number of WAL flushes with and without this
> feature that the postgres generates is great to know this feature
> effects on IOPS. Probably it's even better with variations in
> synchronous_commit_flush_wal_after or the throttling limits.

It's the same as point 19, so I covered it there.

> Although v2 is a WIP patch, I have some comments:
> 1. Coding style doesn't confirm to the Postgres standard:

Fixed all thoses that you mentioned and I've removed elog() and code
for timing. BTW: is there a way to pgindent only my changes (on git
diff?)

> 2. It'd be better to add a TAP test hitting the WAL throttling.

TODO, any hints on how that test should look like are welcome
(checking pg_stat_wal?) I've could spot only 1 place for it --
src/test/recovery/007_sync_rep.pl

> 3. We generally don't need timings to be calculated explicitly when we
> emit before and after log messages. If needed one can calculate the
> wait time from timestamps of the log messages. If it still needs an
> explicit time difference which I don't think is required, because we
> have a special event and before/after log messages, I think it needs
> to be under track_wal_io_timing to keep things simple.

Removed, it was just debugging aid to demonstrate the effect in the
session waiting.

> 4. XLogInsertRecord() can be a hot path for high-write workload, so
> effects of adding anything in it needs to be measured. So, it's better
> to run benchmarks with this feature enabled and disabled.

When the GUC is off(0 / default), in my tests the impact is none (it's
just set of simple IFs), however if the feature is enabled then the
INSERT is slowed down (I've repeated the initial test from 1st post
and single-statement INSERT for 50MB WAL result is the same 4-5s and
~23s +/- 1s when feature is activated when the RTT is 10.1ms, but
that's intentional).

> 5. Missing documentation of this feature and the GUC. I think we can
> describe extensively why we'd need a feature of this kind in the
> documentation for better adoption or usability.

TODO, I'm planning on adding documentation when we'll be a little bit
closer to adding to CF.

> 6. Shouldn't the max limit be MAX_KILOBYTES?
> +_commit_flush_wal_after,
> +0, 0, 1024L*1024L,

Fixed.

> 7. Can this GUC name be something like
> synchronous_commit_wal_throttle_threshold to better reflect what it
> does for a backend?
> +{"synchronous_commit_flush_wal_after", PGC_USERSET,
> REPLICATION_SENDING,

Done.

> 8. A typo - s/confiration/confirmation
[..]
> 9. This
> "Sets the maximum logged WAL in kbytes, after which wait for sync
> commit confiration even without commit "
> better be something like below?
> "Sets the maximum amount of WAL in kilobytes a backend generates after
> which it waits for synchronous commit confiration even without commit
> "

Fixed as you have suggested.

> 10. I think there's nothing wrong in adding some assertions in
> HandleXLogDelayPending():
> Assert(synchronous_commit_flush_wal_after > 0);
> Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L);
> Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr);

Sure, added.

> 11. Specify the reason in the comments as to why we had to do the
> following things:
> Here:
> +/* flush only up to the last fully filled page */
> +XLogRecPtr LastFullyWrittenXLogPage = XactLastThrottledRecEnd
> - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
> Talk about how this avoids multiple-flushes for the same page.
>
> Here:
> + * Called from ProcessMessageInterrupts() to avoid waiting while
> being in critical section
> + */
> +void HandleXLogDelayPending()
> Talk about how waiting in a critical section, that is in
> XLogInsertRecord() causes locks to be held longer durations and other
> effects.

Added.

> Here:
> +/* WAL throttling */
> +backendWalInserted += rechdr->xl_tot_len;
> Be a bit more verbose about why we try to throttle WAL and why only
> for sync replication, the advantages, effects etc.

Added.

> 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not
> clutter server logs.
> +/* XXX Debug for now */
> +elog(WARNING, "throttling WAL down on this session
> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
> +backendWalInserted,
> +LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
> +LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
>
> +elog(WARNING, "throttling WAL down on this session - end (%f
> ms)", timediff);

OK, that's entirely removed.

> 13. BTW, we don't need to hold interrupts while waiting for sync
> replication ack as it may block query cancels or proc die pendings.
> You can remove these, unless there's a strong reason.
> +//HOLD_INTERRUPTS();
> +//RESUME_INTERRUPTS();

Sure, removed. However, one problem I've discovered is that we were
hitting Assert(InterruptHoldoffCount > 0) in 

Re: Syncrep and improving latency due to WAL throttling

2023-01-30 Thread Bharath Rupireddy
On Sat, Jan 28, 2023 at 6:06 AM Tomas Vondra
 wrote:
>
> >
> > That's not the sole goal, from my end: I'd like to avoid writing out +
> > flushing the WAL in too small chunks.  Imagine a few concurrent vacuums or
> > COPYs or such - if we're unlucky they'd each end up exceeding their 
> > "private"
> > limit close to each other, leading to a number of small writes of the
> > WAL. Which could end up increasing local commit latency / iops.
> >
> > If we instead decide to only ever flush up to something like
> >   last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ
> >
> > we'd make sure that the throttling mechanism won't cause a lot of small
> > writes.
> >
>
> I'm not saying we shouldn't do this, but I still don't see how this
> could make a measurable difference. At least assuming a sensible value
> of the throttling limit (say, more than 256kB per backend), and OLTP
> workload running concurrently. That means ~64 extra flushes/writes per
> 16MB segment (at most). Yeah, a particular page might get unlucky and be
> flushed by multiple backends, but the average still holds. Meanwhile,
> the OLTP transactions will generate (at least) an order of magnitude
> more flushes.

I think measuring the number of WAL flushes with and without this
feature that the postgres generates is great to know this feature
effects on IOPS. Probably it's even better with variations in
synchronous_commit_flush_wal_after or the throttling limits.

On Sat, Jan 28, 2023 at 6:08 AM Tomas Vondra
 wrote:
>
> On 1/27/23 22:19, Andres Freund wrote:
> > Hi,
> >
> > On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote:
> >> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund  wrote:
> >>
> >>> Huh? Why did you remove the GUC?
> >>
> >> After reading previous threads, my optimism level of getting it ever
> >> in shape of being widely accepted degraded significantly (mainly due
> >> to the discussion of wider category of 'WAL I/O throttling' especially
> >> in async case, RPO targets in async case and potentially calculating
> >> global bandwidth).
> >
> > I think it's quite reasonable to limit this to a smaller scope. Particularly
> > because those other goals are pretty vague but ambitious goals. IMO the
> > problem with a lot of the threads is precisely that that they aimed at a 
> > level
> > of generallity that isn't achievable in one step.
> >
>
> +1 to that

Okay, I agree to limit the scope first to synchronous replication.

Although v2 is a WIP patch, I have some comments:
1. Coding style doesn't confirm to the Postgres standard:
+/*
+ * Called from ProcessMessageInterrupts() to avoid waiting while
being in critical section
+ */
80-line char limit

+void HandleXLogDelayPending()
A new line missing between function return type and functin name

+elog(WARNING, "throttling WAL down on this session
(backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
+backendWalInserted,
+LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
+LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
Indentation issue - space needed in the next lines after elog(WARNING,..

+elog(WARNING, "throttling WAL down on this session - end (%f
ms)", timediff);
80-line char limit, timediff); be on the new line.

+//RESUME_INTERRUPTS();
+//HOLD_INTERRUPTS();
Multi-line comments are used elsewhere in the code.

Better to run pgindent on the patch.

2. It'd be better to add a TAP test hitting the WAL throttling.

3. We generally don't need timings to be calculated explicitly when we
emit before and after log messages. If needed one can calculate the
wait time from timestamps of the log messages. If it still needs an
explicit time difference which I don't think is required, because we
have a special event and before/after log messages, I think it needs
to be under track_wal_io_timing to keep things simple.

4. XLogInsertRecord() can be a hot path for high-write workload, so
effects of adding anything in it needs to be measured. So, it's better
to run benchmarks with this feature enabled and disabled.

5. Missing documentation of this feature and the GUC. I think we can
describe extensively why we'd need a feature of this kind in the
documentation for better adoption or usability.

6. Shouldn't the max limit be MAX_KILOBYTES?
+_commit_flush_wal_after,
+0, 0, 1024L*1024L,

7. Can this GUC name be something like
synchronous_commit_wal_throttle_threshold to better reflect what it
does for a backend?
+{"synchronous_commit_flush_wal_after", PGC_USERSET,
REPLICATION_SENDING,

8. A typo - s/confiration/confirmation
+gettext_noop("Sets the maximum logged WAL in kbytes,
after which wait for sync commit confiration even without commit "),

9. This
"Sets the maximum logged WAL in kbytes, after which wait for sync
commit confiration even without commit "
better be something like below?
"Sets the maximum amount of WAL in kilobytes a backend generates after
which it waits for synchronous commit confiration even without commit
"

10. I think 

Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Tomas Vondra
On 1/27/23 22:19, Andres Freund wrote:
> Hi,
> 
> On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote:
>> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund  wrote:
>>
>>> Huh? Why did you remove the GUC?
>>
>> After reading previous threads, my optimism level of getting it ever
>> in shape of being widely accepted degraded significantly (mainly due
>> to the discussion of wider category of 'WAL I/O throttling' especially
>> in async case, RPO targets in async case and potentially calculating
>> global bandwidth).
> 
> I think it's quite reasonable to limit this to a smaller scope. Particularly
> because those other goals are pretty vague but ambitious goals. IMO the
> problem with a lot of the threads is precisely that that they aimed at a level
> of generallity that isn't achievable in one step.
> 

+1 to that

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Tomas Vondra



On 1/27/23 22:33, Andres Freund wrote:
> Hi,
> 
> On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote:
>> On 1/27/23 08:18, Bharath Rupireddy wrote:
 I think my idea of only forcing to flush/wait an LSN some distance in the 
 past
 would automatically achieve that?
>>>
>>> I'm sorry, I couldn't get your point, can you please explain it a bit more?
>>>
>>
>> The idea is that we would not flush the exact current LSN, because
>> that's likely somewhere in the page, and we always write the whole page
>> which leads to write amplification.
>>
>> But if we backed off a bit, and wrote e.g. to the last page boundary,
>> that wouldn't have this issue (either the page was already flushed -
>> noop, or we'd have to flush it anyway).
> 
> Yep.
> 
> 
>> We could even back off a bit more, to increase the probability it was
>> actually flushed / sent to standby.
> 
> That's not the sole goal, from my end: I'd like to avoid writing out +
> flushing the WAL in too small chunks.  Imagine a few concurrent vacuums or
> COPYs or such - if we're unlucky they'd each end up exceeding their "private"
> limit close to each other, leading to a number of small writes of the
> WAL. Which could end up increasing local commit latency / iops.
> 
> If we instead decide to only ever flush up to something like
>   last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ
> 
> we'd make sure that the throttling mechanism won't cause a lot of small
> writes.
> 

I'm not saying we shouldn't do this, but I still don't see how this
could make a measurable difference. At least assuming a sensible value
of the throttling limit (say, more than 256kB per backend), and OLTP
workload running concurrently. That means ~64 extra flushes/writes per
16MB segment (at most). Yeah, a particular page might get unlucky and be
flushed by multiple backends, but the average still holds. Meanwhile,
the OLTP transactions will generate (at least) an order of magnitude
more flushes.

> 
>>> Keeping replication lag under check enables one to provide a better
>>> RPO guarantee as discussed in the other thread
>>> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
>>>
>>
>> Isn't that a bit over-complicated? RPO generally only cares about xacts
>> that committed (because that's what you want to not lose), so why not to
>> simply introduce a "sync mode" that simply uses a bit older LSN when
>> waiting for the replica? Seems much simpler and similar to what we
>> already do.
> 
> I don't think that really helps you that much. If there's e.g. a huge VACUUM /
> COPY emitting loads of WAL you'll suddenly see commit latency of a
> concurrently committing transactions spike into oblivion. Whereas a general
> WAL throttling mechanism would throttle the VACUUM, without impacting the
> commit latency of normal transactions.
> 

True, but it solves the RPO goal which is what the other thread was about.

IMHO it's useful to look at this as a resource scheduling problem:
limited WAL bandwidth consumed by backends, with the bandwidth
distributed using some sort of policy.

The patch discussed in this thread uses fundamentally unfair policy,
with throttling applied only on backends that produce a lot of WAL. And
trying to leave the OLTP as unaffected as possible.

The RPO thread seems to be aiming for a "fair" policy, providing the
same fraction of bandwidth to all processes. This will affect all xacts
the same way (sleeps proportional to amount of WAL generated by the xact).

Perhaps we want such alternative scheduling policies, but it'll probably
require something like the autovacuum throttling.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote:
> On 1/27/23 08:18, Bharath Rupireddy wrote:
> >> I think my idea of only forcing to flush/wait an LSN some distance in the 
> >> past
> >> would automatically achieve that?
> > 
> > I'm sorry, I couldn't get your point, can you please explain it a bit more?
> > 
> 
> The idea is that we would not flush the exact current LSN, because
> that's likely somewhere in the page, and we always write the whole page
> which leads to write amplification.
> 
> But if we backed off a bit, and wrote e.g. to the last page boundary,
> that wouldn't have this issue (either the page was already flushed -
> noop, or we'd have to flush it anyway).

Yep.


> We could even back off a bit more, to increase the probability it was
> actually flushed / sent to standby.

That's not the sole goal, from my end: I'd like to avoid writing out +
flushing the WAL in too small chunks.  Imagine a few concurrent vacuums or
COPYs or such - if we're unlucky they'd each end up exceeding their "private"
limit close to each other, leading to a number of small writes of the
WAL. Which could end up increasing local commit latency / iops.

If we instead decide to only ever flush up to something like
  last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ

we'd make sure that the throttling mechanism won't cause a lot of small
writes.


> > Keeping replication lag under check enables one to provide a better
> > RPO guarantee as discussed in the other thread
> > https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
> > 
> 
> Isn't that a bit over-complicated? RPO generally only cares about xacts
> that committed (because that's what you want to not lose), so why not to
> simply introduce a "sync mode" that simply uses a bit older LSN when
> waiting for the replica? Seems much simpler and similar to what we
> already do.

I don't think that really helps you that much. If there's e.g. a huge VACUUM /
COPY emitting loads of WAL you'll suddenly see commit latency of a
concurrently committing transactions spike into oblivion. Whereas a general
WAL throttling mechanism would throttle the VACUUM, without impacting the
commit latency of normal transactions.

Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 12:48:43 +0530, Bharath Rupireddy wrote:
> Looking at the patch, the feature, in its current shape, focuses on
> improving replication lag (by throttling WAL on the primary) only when
> synchronous replication is enabled. Why is that? Why can't we design
> it for replication in general (async, sync, and logical replication)?
> 
> Keeping replication lag under check enables one to provide a better
> RPO guarantee as discussed in the other thread
> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.

I think something narrower and easier to achieve is a good thing. We've
already had loads of discussion for the more general problem, without a lot of
actual progress.

Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote:
> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund  wrote:
> 
> > Huh? Why did you remove the GUC?
> 
> After reading previous threads, my optimism level of getting it ever
> in shape of being widely accepted degraded significantly (mainly due
> to the discussion of wider category of 'WAL I/O throttling' especially
> in async case, RPO targets in async case and potentially calculating
> global bandwidth).

I think it's quite reasonable to limit this to a smaller scope. Particularly
because those other goals are pretty vague but ambitious goals. IMO the
problem with a lot of the threads is precisely that that they aimed at a level
of generallity that isn't achievable in one step.


> I've assumed that it is a working sketch, and as such not having GUC name
> right now (just for sync case) would still allow covering various other
> async cases in future without breaking compatibility with potential name GUC
> changes (see my previous "wal_throttle_larger_transactions="
> proposal ).

It's harder to benchmark something like this without a GUC, so I think it's
worth having, even if it's not the final name.


> > SyncRepWaitForLSN() has this comment:
> >  * 'lsn' represents the LSN to wait for.  'commit' indicates whether this 
> > LSN
> >  * represents a commit record.  If it doesn't, then we wait only for the WAL
> >  * to be flushed if synchronous_commit is set to the higher level of
> >  * remote_apply, because only commit records provide apply feedback.
> 
> Hm, not sure if I understand: are you saying that we should (in the
> throttled scenario) have some special feedback msgs or not --
> irrespective of the setting? To be honest the throttling shouldn't
> wait for the standby full setting, it's just about slowdown fact (so
> IMHO it would be fine even in remote_write/remote_apply scenario if
> the remote walreceiver just received the data, not necessarily write
> it into file or wait for for applying it). Just this waiting for a
> round-trip ack about LSN progress would be enough to slow down the
> writer (?). I've added some timing log into the draft and it shows
> more or less constantly solid RTT even as it stands:

My problem was that the function header for SyncRepWaitForLSN() seems to say
that we don't wait at all if commit=false and synchronous_commit <
remote_apply. But I think that might just be bad formulation.

   [...] 'commit' indicates whether this LSN
 * represents a commit record.  If it doesn't, then we wait only for the WAL
 * to be flushed if synchronous_commit is set to the higher level of
 * remote_apply, because only commit records provide apply feedback.

because the code does something entirely different afaics:

/* Cap the level for anything other than commit to remote flush only. */
if (commit)
mode = SyncRepWaitMode;
else
mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);


Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Tomas Vondra
On 1/27/23 08:18, Bharath Rupireddy wrote:
> On Thu, Jan 26, 2023 at 9:21 PM Andres Freund  wrote:
>>
>>> 7. I think we need to not let backends throttle too frequently even
>>> though they have crossed wal_throttle_threshold bytes. The best way is
>>> to rely on replication lag, after all the goal of this feature is to
>>> keep replication lag under check - say, throttle only when
>>> wal_distance > wal_throttle_threshold && replication_lag >
>>> wal_throttle_replication_lag_threshold.
>>
>> I think my idea of only forcing to flush/wait an LSN some distance in the 
>> past
>> would automatically achieve that?
> 
> I'm sorry, I couldn't get your point, can you please explain it a bit more?
> 

The idea is that we would not flush the exact current LSN, because
that's likely somewhere in the page, and we always write the whole page
which leads to write amplification.

But if we backed off a bit, and wrote e.g. to the last page boundary,
that wouldn't have this issue (either the page was already flushed -
noop, or we'd have to flush it anyway).

We could even back off a bit more, to increase the probability it was
actually flushed / sent to standby. That would still work, because the
whole point is not to allow one process to generate too much unflushed
WAL, forcing the other (small) xacts to wait at commit.

Imagine we have the limit set to 8MB, i.e. the backend flushes WAL after
generating 8MB of WAL. If we flush to the exact current LSN, the other
backends will wait for ~4MB on average. If we back off to 1MB, the wait
average increases to ~4.5MB. (This is simplified, as it ignores WAL from
the small xacts. But those flush regularly, which limit the amount. It
also ignores there might be multiple large xacts.)

> Looking at the patch, the feature, in its current shape, focuses on
> improving replication lag (by throttling WAL on the primary) only when
> synchronous replication is enabled. Why is that? Why can't we design
> it for replication in general (async, sync, and logical replication)?
> 

This focuses on sync rep, because that's where the commit latency comes
from. Async doesn't have that issue, because it doesn't wait for the
standby.

In particular, the trick is in penalizing the backends generating a lot
of WAL, while leaving the small xacts alone.

> Keeping replication lag under check enables one to provide a better
> RPO guarantee as discussed in the other thread
> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
> 

Isn't that a bit over-complicated? RPO generally only cares about xacts
that committed (because that's what you want to not lose), so why not to
simply introduce a "sync mode" that simply uses a bit older LSN when
waiting for the replica? Seems much simpler and similar to what we
already do.

Yeah, if someone generates a lot of WAL in uncommitted transaction, all
of that may be lost. But who cares (from the RPO point of view)?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Jakub Wartak
Hi Bharath,

On Fri, Jan 27, 2023 at 12:04 PM Bharath Rupireddy
 wrote:
>
> On Fri, Jan 27, 2023 at 2:03 PM Alvaro Herrera  
> wrote:
> >
> > On 2023-Jan-27, Bharath Rupireddy wrote:
> >
> > > Looking at the patch, the feature, in its current shape, focuses on
> > > improving replication lag (by throttling WAL on the primary) only when
> > > synchronous replication is enabled. Why is that? Why can't we design
> > > it for replication in general (async, sync, and logical replication)?
> > >
> > > Keeping replication lag under check enables one to provide a better
> > > RPO guarantee

Sorry for not answering earlier; although the title of the thread goes
for the SyncRep-only I think everyone would be open to also cover the
more advanced async scenarios that Satyanarayana proposed in those
earlier threads (as just did Simon much earlier). I was  proposing
wal_throttle_larger_transactions=<..> (for the lack of better name),
however v2 of the patch from today right now contains again reference
to syncrep (it could be changed of course). It's just the code that is
missing that could be also added on top of v2, so we could combine our
efforts. It's just the competency and time that I lack on how to
implement such async-scenario code-paths (maybe Tomas V. has something
in mind with his own words [1])  so also any feedback from senior
hackers is more than welcome ...

-J.

[1] - "The solution I've imagined is something like autovacuum
throttling - do some accounting of how much "WAL bandwidth" each
process consumed, and then do the delay/sleep in a suitable place. "




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Jakub Wartak
Hi,

v2 is attached.

On Thu, Jan 26, 2023 at 4:49 PM Andres Freund  wrote:

> Huh? Why did you remove the GUC?

After reading previous threads, my optimism level of getting it ever
in shape of being widely accepted degraded significantly (mainly due
to the discussion of wider category of 'WAL I/O throttling' especially
in async case, RPO targets in async case and potentially calculating
global bandwidth). I've assumed that it is a working sketch, and as
such not having GUC name right now (just for sync case) would still
allow covering various other async cases in future without breaking
compatibility with potential name GUC changes (see my previous
"wal_throttle_larger_transactions=" proposal ). Although
I've restored the synchronous_commit_flush_wal_after GUC into v2
patch, sticking to such a name removes the way of using the code to
achieve async WAL throttling in future.

> Clearly this isn't something we can default to on..

Yes, I agree. It's never meant to be on by default.

> I don't like the resets around like this. Why not just move it into
> XLogFlush()?

Fixed.

> Hm - wonder if it'd be better to determine the LSN to wait for in
> XLogInsertRecord(). There'll be plenty cases where we'll insert multiple (but
> bounded number of) WAL records before processing interrupts. No need to flush
> more aggressively than necessary...

Fixed.

> SyncRepWaitForLSN() has this comment:
>  * 'lsn' represents the LSN to wait for.  'commit' indicates whether this LSN
>  * represents a commit record.  If it doesn't, then we wait only for the WAL
>  * to be flushed if synchronous_commit is set to the higher level of
>  * remote_apply, because only commit records provide apply feedback.

Hm, not sure if I understand: are you saying that we should (in the
throttled scenario) have some special feedback msgs or not --
irrespective of the setting? To be honest the throttling shouldn't
wait for the standby full setting, it's just about slowdown fact (so
IMHO it would be fine even in remote_write/remote_apply scenario if
the remote walreceiver just received the data, not necessarily write
it into file or wait for for applying it). Just this waiting for a
round-trip ack about LSN progress would be enough to slow down the
writer (?). I've added some timing log into the draft and it shows
more or less constantly solid RTT even as it stands:

psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6CB70B8 flushingTo=2/A6CB6000)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session -
end (10.500052 ms)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6CF7C08 flushingTo=2/A6CF6000)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session -
end (10.655370 ms)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6D385E0 flushingTo=2/A6D38000)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session -
end (10.627334 ms)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6D78FA0 flushingTo=2/A6D78000)
[..]

 > I think we'd want a distinct wait event for this.

Added and tested that it shows up.

> > +volatile sig_atomic_t XLogDelayPending = false;
> I don't think the new variable needs to be volatile, or even
> sig_atomic_t. We'll just manipulate it from outside signal handlers.

Changed to bool, previously I wanted it to "fit" the above ones.

-J.


v2-0001-WIP-Syncrep-and-improving-latency-due-to-WAL-thro.patch
Description: Binary data


Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Bharath Rupireddy
On Fri, Jan 27, 2023 at 2:03 PM Alvaro Herrera  wrote:
>
> On 2023-Jan-27, Bharath Rupireddy wrote:
>
> > Looking at the patch, the feature, in its current shape, focuses on
> > improving replication lag (by throttling WAL on the primary) only when
> > synchronous replication is enabled. Why is that? Why can't we design
> > it for replication in general (async, sync, and logical replication)?
> >
> > Keeping replication lag under check enables one to provide a better
> > RPO guarantee
>
> Hmm, but you can already do that by tuning walwriter, no?

IIUC, one can increase wal_writer_delay or wal_writer_flush_after to
control the amount of WAL walwriter writes and flushes so that the
walsenders will get slower a bit thus improving replication lag. If my
understanding is correct, what if other backends doing WAL writes and
flushes? How do we control that?

I think tuning walwriter alone may not help to keep replication lag
under check. Even if it does, it requires manual intervention.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Alvaro Herrera
On 2023-Jan-27, Bharath Rupireddy wrote:

> Looking at the patch, the feature, in its current shape, focuses on
> improving replication lag (by throttling WAL on the primary) only when
> synchronous replication is enabled. Why is that? Why can't we design
> it for replication in general (async, sync, and logical replication)?
> 
> Keeping replication lag under check enables one to provide a better
> RPO guarantee

Hmm, but you can already do that by tuning walwriter, no?

-- 
Álvaro Herrera




Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Bharath Rupireddy
On Thu, Jan 26, 2023 at 9:21 PM Andres Freund  wrote:
>
> > 7. I think we need to not let backends throttle too frequently even
> > though they have crossed wal_throttle_threshold bytes. The best way is
> > to rely on replication lag, after all the goal of this feature is to
> > keep replication lag under check - say, throttle only when
> > wal_distance > wal_throttle_threshold && replication_lag >
> > wal_throttle_replication_lag_threshold.
>
> I think my idea of only forcing to flush/wait an LSN some distance in the past
> would automatically achieve that?

I'm sorry, I couldn't get your point, can you please explain it a bit more?

Looking at the patch, the feature, in its current shape, focuses on
improving replication lag (by throttling WAL on the primary) only when
synchronous replication is enabled. Why is that? Why can't we design
it for replication in general (async, sync, and logical replication)?

Keeping replication lag under check enables one to provide a better
RPO guarantee as discussed in the other thread
https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Tomas Vondra



On 1/26/23 16:40, Andres Freund wrote:
> Hi,
> 
> On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:
>> It's not clear to me how could it cause deadlocks, as we're not waiting
>> for a lock/resource locked by someone else, but it's certainly an issue
>> for uninterruptible hangs.
> 
> Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
> lock-ordering rules.
> 

Not sure what lock ordering issues you have in mind, but I agree it's
not the right place for the sleep, no argument here.

> 
>>> My best idea for how to implement this in a somewhat safe way would be for
>>> XLogInsertRecord() to set a flag indicating that we should throttle, and set
>>> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed 
>>> to
>>> proceed (i.e. we'll not be in a critical / interrupts off section) can
>>> actually perform the delay.  That should fix the hard deadlock danger and
>>> remove most of the increase in lock contention.
>>>
>>
>> The solution I've imagined is something like autovacuum throttling - do
>> some accounting of how much "WAL bandwidth" each process consumed, and
>> then do the delay/sleep in a suitable place.
> 
> Where would such a point be, except for ProcessInterrupts()? Iteratively
> adding a separate set of "wal_delay()" points all over the executor,
> commands/, ... seems like a bad plan.
> 

I haven't thought about where to do the work, TBH. ProcessInterrupts()
may very well be a good place.

I should have been clearer, but the main benefit of autovacuum-like
throttling is IMHO that it involves all processes and a global limit,
while the current approach is per-backend.

> 
>>> I don't think doing an XLogFlush() of a record that we just wrote is a good
>>> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
>>> we're doing. To make matters worse, this will often cause partially filled 
>>> WAL
>>> pages to be flushed out - rewriting a WAL page multiple times can
>>> significantly increase overhead on the storage level. At the very least 
>>> this'd
>>> have to flush only up to the last fully filled page.
>>>
>>
>> I don't see why would that significantly increase the number of flushes.
>> The goal is to do this only every ~1MB of a WAL or so, and chances are
>> there were many (perhaps hundreds or more) commits in between. At least
>> in a workload with a fair number of OLTP transactions (which is kinda
>> the point of all this).
> 
> Because the accounting is done separately in each process. Even if you just
> add a few additional flushes for each connection, in aggregate that'll be a
> lot.
> 

How come? Imagine the backend does flush only after generating e.g. 1MB
of WAL. Small transactions won't do any additional flushes at all
(because commit resets the accounting). Large transactions will do an
extra flush every 1MB, so 16x per WAL segment. But in between there will
be many commits from the small transactions. If we backoff to the last
complete page, that should eliminate even most of these flushes.

So where would the additional flushes come from?

> 
>> And the "small" OLTP transactions don't really do any extra fsyncs,
>> because the accounting resets at commit (or places that flush WAL).
>> [...]
>>> Just counting the number of bytes inserted by a backend will make the 
>>> overhead
>>> even worse, as the flush will be triggered even for OLTP sessions doing tiny
>>> transactions, even though they don't contribute to the problem you're trying
>>> to address.  How about counting how many bytes of WAL a backend has inserted
>>> since the last time that backend did an XLogFlush()?
>>>
>>
>> No, we should reset the counter at commit, so small OLTP transactions
>> should not reach really trigger this. That's kinda the point, to only
>> delay "large" transactions producing a lot of WAL.
> 
> I might have missed something, but I don't think the first version of patch
> resets the accounting at commit?
> 

Yeah, it doesn't. It's mostly a minimal PoC patch, sufficient to test
the behavior / demonstrate the effect.

> 
>> Same for the flushes of partially flushed pages - if there's enough of
>> small OLTP transactions, we're already having this issue. I don't see
>> why would this make it measurably worse.
> 
> Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
> could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
> calls to partial boundaries by every autovacuum, by every process doing a bit
> more bulky work. Because you're flushing the LSN after a record you just
> inserted, you're commonly not going to be "joining" the work of an already
> in-process XLogFlush().
> 

Right. We do ~16 additional flushes per 16MB segment (or something like
that, depending on the GUC value). Considering we do thousand of commits
per segment, each of which does a flush, I don't see how would this make
it measurably worse.

> 
>> But if needed, we can simply backoff to the last page boundary, 

Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 13:33:27 +0530, Bharath Rupireddy wrote:
> 6. Backends can ignore throttling for WAL records marked as unimportant, no?

Why would that be a good idea? Not that it matters today, but those records
still need to be flushed in case of a commit by another transaction.


> 7. I think we need to not let backends throttle too frequently even
> though they have crossed wal_throttle_threshold bytes. The best way is
> to rely on replication lag, after all the goal of this feature is to
> keep replication lag under check - say, throttle only when
> wal_distance > wal_throttle_threshold && replication_lag >
> wal_throttle_replication_lag_threshold.

I think my idea of only forcing to flush/wait an LSN some distance in the past
would automatically achieve that?

Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 14:40:56 +0100, Jakub Wartak wrote:
> In summary:  Attached is a slightly reworked version of this patch.
> 1. Moved logic outside XLogInsertRecord() under ProcessInterrupts()
> 2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN()
> 3. Removed GUC for now (always on->256kB); potentially to be restored

Huh? Why did you remove the GUC? Clearly this isn't something we can default
to on.


> diff --git a/src/backend/access/transam/xact.c 
> b/src/backend/access/transam/xact.c
> index d85e313908..05d56d65f9 100644
> --- a/src/backend/access/transam/xact.c
> +++ b/src/backend/access/transam/xact.c
> @@ -2395,6 +2395,7 @@ CommitTransaction(void)
>  
>   XactTopFullTransactionId = InvalidFullTransactionId;
>   nParallelCurrentXids = 0;
> + backendWalInserted = 0;
>  
>   /*
>* done with commit processing, set current transaction state back to

I don't like the resets around like this. Why not just move it into
XLogFlush()?



> +/*
> + * Called from ProcessMessageInterrupts() to avoid waiting while being in 
> critical section
> + */ 
> +void HandleXLogDelayPending()
> +{
> + /* flush only up to the last fully filled page */
> + XLogRecPtr  LastFullyWrittenXLogPage = XactLastRecEnd - 
> (XactLastRecEnd % XLOG_BLCKSZ);
> + XLogDelayPending = false;

Hm - wonder if it'd be better to determine the LSN to wait for in
XLogInsertRecord(). There'll be plenty cases where we'll insert multiple (but
bounded number of) WAL records before processing interrupts. No need to flush
more aggressively than necessary...


> + //HOLD_INTERRUPTS();
> +
> + /* XXX Debug for now */
> + elog(WARNING, "throttling WAL down on this session 
> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)", 
> + backendWalInserted, 
> + LSN_FORMAT_ARGS(XactLastRecEnd),
> + LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
> +
> + /* XXX: refactor SyncRepWaitForLSN() to have different waitevent than 
> default WAIT_EVENT_SYNC_REP  */
> + /* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */
> + XLogFlush(LastFullyWrittenXLogPage);
> + SyncRepWaitForLSN(LastFullyWrittenXLogPage, false);

SyncRepWaitForLSN() has this comment:
 * 'lsn' represents the LSN to wait for.  'commit' indicates whether this LSN
 * represents a commit record.  If it doesn't, then we wait only for the WAL
 * to be flushed if synchronous_commit is set to the higher level of
 * remote_apply, because only commit records provide apply feedback.


> + elog(WARNING, "throttling WAL down on this session - end");
> + backendWalInserted = 0;
> +
> + //RESUME_INTERRUPTS();
> +}

I think we'd want a distinct wait event for this.



> diff --git a/src/backend/utils/init/globals.c 
> b/src/backend/utils/init/globals.c
> index 1b1d814254..8ed66b2eae 100644
> --- a/src/backend/utils/init/globals.c
> +++ b/src/backend/utils/init/globals.c
> @@ -37,6 +37,7 @@ volatile sig_atomic_t IdleSessionTimeoutPending = false;
>  volatile sig_atomic_t ProcSignalBarrierPending = false;
>  volatile sig_atomic_t LogMemoryContextPending = false;
>  volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
> +volatile sig_atomic_t XLogDelayPending = false;
>  volatile uint32 InterruptHoldoffCount = 0;
>  volatile uint32 QueryCancelHoldoffCount = 0;
>  volatile uint32 CritSectionCount = 0;

I don't think the new variable needs to be volatile, or even
sig_atomic_t. We'll just manipulate it from outside signal handlers.


Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:
> It's not clear to me how could it cause deadlocks, as we're not waiting
> for a lock/resource locked by someone else, but it's certainly an issue
> for uninterruptible hangs.

Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
lock-ordering rules.


> > My best idea for how to implement this in a somewhat safe way would be for
> > XLogInsertRecord() to set a flag indicating that we should throttle, and set
> > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed 
> > to
> > proceed (i.e. we'll not be in a critical / interrupts off section) can
> > actually perform the delay.  That should fix the hard deadlock danger and
> > remove most of the increase in lock contention.
> > 
> 
> The solution I've imagined is something like autovacuum throttling - do
> some accounting of how much "WAL bandwidth" each process consumed, and
> then do the delay/sleep in a suitable place.

Where would such a point be, except for ProcessInterrupts()? Iteratively
adding a separate set of "wal_delay()" points all over the executor,
commands/, ... seems like a bad plan.


> > I don't think doing an XLogFlush() of a record that we just wrote is a good
> > idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> > we're doing. To make matters worse, this will often cause partially filled 
> > WAL
> > pages to be flushed out - rewriting a WAL page multiple times can
> > significantly increase overhead on the storage level. At the very least 
> > this'd
> > have to flush only up to the last fully filled page.
> > 
> 
> I don't see why would that significantly increase the number of flushes.
> The goal is to do this only every ~1MB of a WAL or so, and chances are
> there were many (perhaps hundreds or more) commits in between. At least
> in a workload with a fair number of OLTP transactions (which is kinda
> the point of all this).

Because the accounting is done separately in each process. Even if you just
add a few additional flushes for each connection, in aggregate that'll be a
lot.


> And the "small" OLTP transactions don't really do any extra fsyncs,
> because the accounting resets at commit (or places that flush WAL).
> [...]
> > Just counting the number of bytes inserted by a backend will make the 
> > overhead
> > even worse, as the flush will be triggered even for OLTP sessions doing tiny
> > transactions, even though they don't contribute to the problem you're trying
> > to address.  How about counting how many bytes of WAL a backend has inserted
> > since the last time that backend did an XLogFlush()?
> > 
> 
> No, we should reset the counter at commit, so small OLTP transactions
> should not reach really trigger this. That's kinda the point, to only
> delay "large" transactions producing a lot of WAL.

I might have missed something, but I don't think the first version of patch
resets the accounting at commit?


> Same for the flushes of partially flushed pages - if there's enough of
> small OLTP transactions, we're already having this issue. I don't see
> why would this make it measurably worse.

Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
calls to partial boundaries by every autovacuum, by every process doing a bit
more bulky work. Because you're flushing the LSN after a record you just
inserted, you're commonly not going to be "joining" the work of an already
in-process XLogFlush().


> But if needed, we can simply backoff to the last page boundary, so that we
> only flush the complete page. That would work too - the goal is not to flush
> everything, but to reduce how much of the lag is due to the current process
> (i.e. to wait for sync replica to catch up).

Yes, as I proposed...


> > I also suspect the overhead will be more manageable if you were to force a
> > flush only up to a point further back than the last fully filled page. We
> > don't want to end up flushing WAL for every page, but if you just have a
> > backend-local accounting mechanism, I think that's inevitably what you'd end
> > up with when you have a large number of sessions. But if you'd limit the
> > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, 
> > and
> > only ever to a prior boundary, the amount of unnecessary WAL flushes would 
> > be
> > proportional to synchronous_commit_flush_wal_after.
> > 
> 
> True, that's kinda what I suggested above as a solution for partially
> filled WAL pages.

I think flushing only up to a point further in the past than the last page
boundary is somewhat important to prevent unnecessary flushes.


Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Jakub Wartak
> On 1/25/23 20:05, Andres Freund wrote:
> > Hi,
> >
> > Such a feature could be useful - but I don't think the current place of
> > throttling has any hope of working reliably:
[..]
> > You're blocking in the middle of an XLOG insertion.
[..]
> Yeah, I agree the sleep would have to happen elsewhere.

Fixed.

> > My best idea for how to implement this in a somewhat safe way would be for
> > XLogInsertRecord() to set a flag indicating that we should throttle, and set
> > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed 
> > to
> > proceed (i.e. we'll not be in a critical / interrupts off section) can
> > actually perform the delay.  That should fix the hard deadlock danger and
> > remove most of the increase in lock contention.
> >
>
> The solution I've imagined is something like autovacuum throttling - do
> some accounting of how much "WAL bandwidth" each process consumed, and
> then do the delay/sleep in a suitable place.
>

By the time you replied I've already tried what Andres has recommended.

[..]
>> At the very least this'd
> > have to flush only up to the last fully filled page.
> >
> Same for the flushes of partially flushed pages - if there's enough of
> small OLTP transactions, we're already having this issue. I don't see
> why would this make it measurably worse. But if needed, we can simply
> backoff to the last page boundary, so that we only flush the complete
> page. That would work too - the goal is not to flush everything, but to
> reduce how much of the lag is due to the current process (i.e. to wait
> for sync replica to catch up).

I've introduced the rounding to the last written page (hopefully).

> > Just counting the number of bytes inserted by a backend will make the 
> > overhead
> > even worse, as the flush will be triggered even for OLTP sessions doing tiny
> > transactions, even though they don't contribute to the problem you're trying
> > to address.  How about counting how many bytes of WAL a backend has inserted
> > since the last time that backend did an XLogFlush()?
> >
>
> No, we should reset the counter at commit, so small OLTP transactions
> should not reach really trigger this. That's kinda the point, to only
> delay "large" transactions producing a lot of WAL.

Fixed.

> > I also suspect the overhead will be more manageable if you were to force a
> > flush only up to a point further back than the last fully filled page. We
> > don't want to end up flushing WAL for every page, but if you just have a
> > backend-local accounting mechanism, I think that's inevitably what you'd end
> > up with when you have a large number of sessions. But if you'd limit the
> > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, 
> > and
> > only ever to a prior boundary, the amount of unnecessary WAL flushes would 
> > be
> > proportional to synchronous_commit_flush_wal_after.
> >
>
> True, that's kinda what I suggested above as a solution for partially
> filled WAL pages.
>
> I agree this is crude and we'd probably need some sort of "balancing"
> logic that distributes WAL bandwidth between backends, and throttles
> backends producing a lot of WAL.

OK - that's not included (yet?), as it would make this much more complex.

In summary:  Attached is a slightly reworked version of this patch.
1. Moved logic outside XLogInsertRecord() under ProcessInterrupts()
2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN()
3. Removed GUC for now (always on->256kB); potentially to be restored
4. Resets backendWal counter on commits

It's still crude, but first tests indicate that it behaves similarly
(to the initial one with GUC = 256kB).

Also from the Bharath email, I've found another patch proposal by
Simon [1] and I would like to avoid opening the Pandora's box again,
but to stress this: this feature is specifically aimed at solving OLTP
latency on *sync*rep (somewhat some code could be added/generalized
later on and this feature could be extended to async case, but this
then opens the question of managing the possible WAL throughput
budget/back throttling - this was also raised in first thread here [2]
by Konstantin).

IMHO it could implement various strategies under user-settable GUC
"wal_throttle_larger_transactions=(sync,256kB)" or
"wal_throttle_larger_transactions=off" , and someday later someone
could teach the code the async case (let's say under
"wal_throttle_larger_transactions=(asyncMaxRPO, maxAllowedLag=8MB,
256kB)"). Thoughts?

[1] - 
https://www.postgresql.org/message-id/flat/CA%2BU5nMLfxBgHQ1VLSeBHYEMjHXz_OHSkuFdU6_1quiGM0RNKEg%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/71f3e6fb-2fca-a798-856a-f23c8ede2333%40garret.ru


0001-WIP-Syncrep-and-improving-latency-due-to-WAL-throttl.patch
Description: Binary data


Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Tomas Vondra



On 1/25/23 20:05, Andres Freund wrote:
> Hi,
> 
> On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:
>> In other words it allows slow down of any backend activity. Any feedback on
>> such a feature is welcome, including better GUC name proposals ;) and
>> conditions in which such feature should be disabled even if it would be
>> enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
>> not in the patch).
> 
> Such a feature could be useful - but I don't think the current place of
> throttling has any hope of working reliably:
> 
>> @@ -1021,6 +1025,21 @@ XLogInsertRecord(XLogRecData *rdata,
>>  pgWalUsage.wal_bytes += rechdr->xl_tot_len;
>>  pgWalUsage.wal_records++;
>>  pgWalUsage.wal_fpi += num_fpi;
>> +
>> +backendWalInserted += rechdr->xl_tot_len;
>> +
>> +if ((synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY || 
>> synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_WRITE) &&
>> +synchronous_commit_flush_wal_after > 0 &&
>> +backendWalInserted > synchronous_commit_flush_wal_after 
>> * 1024L)
>> +{
>> +elog(DEBUG3, "throttling WAL down on this session 
>> (backendWalInserted=%d)", backendWalInserted);
>> +XLogFlush(EndPos);
>> +/* XXX: refactor SyncRepWaitForLSN() to have different 
>> waitevent than default WAIT_EVENT_SYNC_REP  */
>> +/* maybe new WAIT_EVENT_SYNC_REP_BIG or something like 
>> that */
>> +SyncRepWaitForLSN(EndPos, false);
>> +elog(DEBUG3, "throttling WAL down on this session - 
>> end");
>> +backendWalInserted = 0;
>> +}
>>  }
> 
> You're blocking in the middle of an XLOG insertion. We will commonly hold
> important buffer lwlocks, it'll often be in a critical section (no cancelling
> / terminating the session!). This'd entail loads of undetectable deadlocks
> (i.e. hard hangs).  And even leaving that aside, doing an unnecessary
> XLogFlush() with important locks held will seriously increase contention.
> 

Yeah, I agree the sleep would have to happen elsewhere.

It's not clear to me how could it cause deadlocks, as we're not waiting
for a lock/resource locked by someone else, but it's certainly an issue
for uninterruptible hangs.

> 
> My best idea for how to implement this in a somewhat safe way would be for
> XLogInsertRecord() to set a flag indicating that we should throttle, and set
> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> proceed (i.e. we'll not be in a critical / interrupts off section) can
> actually perform the delay.  That should fix the hard deadlock danger and
> remove most of the increase in lock contention.
> 

The solution I've imagined is something like autovacuum throttling - do
some accounting of how much "WAL bandwidth" each process consumed, and
then do the delay/sleep in a suitable place.

> 
> I don't think doing an XLogFlush() of a record that we just wrote is a good
> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> we're doing. To make matters worse, this will often cause partially filled WAL
> pages to be flushed out - rewriting a WAL page multiple times can
> significantly increase overhead on the storage level. At the very least this'd
> have to flush only up to the last fully filled page.
> 

I don't see why would that significantly increase the number of flushes.
The goal is to do this only every ~1MB of a WAL or so, and chances are
there were many (perhaps hundreds or more) commits in between. At least
in a workload with a fair number of OLTP transactions (which is kinda
the point of all this).

And the "small" OLTP transactions don't really do any extra fsyncs,
because the accounting resets at commit (or places that flush WAL).

Same for the flushes of partially flushed pages - if there's enough of
small OLTP transactions, we're already having this issue. I don't see
why would this make it measurably worse. But if needed, we can simply
backoff to the last page boundary, so that we only flush the complete
page. That would work too - the goal is not to flush everything, but to
reduce how much of the lag is due to the current process (i.e. to wait
for sync replica to catch up).

> 
> Just counting the number of bytes inserted by a backend will make the overhead
> even worse, as the flush will be triggered even for OLTP sessions doing tiny
> transactions, even though they don't contribute to the problem you're trying
> to address.  How about counting how many bytes of WAL a backend has inserted
> since the last time that backend did an XLogFlush()?
> 

No, we should reset the counter at commit, so small OLTP transactions
should not reach really trigger this. That's kinda the point, to only
delay "large" transactions producing a lot of WAL.

> A bulk writer won't do a lot of XLogFlush()es, so the 

Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Bharath Rupireddy
On Thu, Jan 26, 2023 at 12:35 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:
> > In other words it allows slow down of any backend activity. Any feedback on
> > such a feature is welcome, including better GUC name proposals ;) and
> > conditions in which such feature should be disabled even if it would be
> > enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
> > not in the patch).
>
> Such a feature could be useful - but I don't think the current place of
> throttling has any hope of working reliably:

+1 for the feature as it keeps replication lag in check and provides a
way for defining RPO (recovery point objective).

> You're blocking in the middle of an XLOG insertion. We will commonly hold
> important buffer lwlocks, it'll often be in a critical section (no cancelling
> / terminating the session!). This'd entail loads of undetectable deadlocks
> (i.e. hard hangs).  And even leaving that aside, doing an unnecessary
> XLogFlush() with important locks held will seriously increase contention.
>
> My best idea for how to implement this in a somewhat safe way would be for
> XLogInsertRecord() to set a flag indicating that we should throttle, and set
> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> proceed (i.e. we'll not be in a critical / interrupts off section) can
> actually perform the delay.  That should fix the hard deadlock danger and
> remove most of the increase in lock contention.

We've discussed this feature quite extensively in a recent thread -
https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
Folks were agreeing to Andres' suggestion there -
https://www.postgresql.org/message-id/20220105174643.lozdd3radxv4tlmx%40alap3.anarazel.de.
However, that thread got lost from my radar. It's good that someone
else started working on it and I'm happy to help with this feature.

> I don't think doing an XLogFlush() of a record that we just wrote is a good
> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> we're doing. To make matters worse, this will often cause partially filled WAL
> pages to be flushed out - rewriting a WAL page multiple times can
> significantly increase overhead on the storage level. At the very least this'd
> have to flush only up to the last fully filled page.
>
> Just counting the number of bytes inserted by a backend will make the overhead
> even worse, as the flush will be triggered even for OLTP sessions doing tiny
> transactions, even though they don't contribute to the problem you're trying
> to address.

Right.

> How about counting how many bytes of WAL a backend has inserted
> since the last time that backend did an XLogFlush()?

Seems reasonable.

> A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
> last XLogFlush() will increase quickly, triggering a flush at the next
> opportunity. But short OLTP transactions will do XLogFlush()es at least at
> every commit.

Right.

> I also suspect the overhead will be more manageable if you were to force a
> flush only up to a point further back than the last fully filled page. We
> don't want to end up flushing WAL for every page, but if you just have a
> backend-local accounting mechanism, I think that's inevitably what you'd end
> up with when you have a large number of sessions. But if you'd limit the
> flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
> only ever to a prior boundary, the amount of unnecessary WAL flushes would be
> proportional to synchronous_commit_flush_wal_after.

When the WAL records are of large in size, then the backend inserting
them would throttle frequently generating more flushes and more waits
for sync replication ack and more contention on WALWriteLock. Not sure
if this is good unless the impact is measured.

Few more thoughts:

1. This feature keeps replication lag in check at the cost of
throttling write traffic. I'd be curious to know improvement in
replication lag vs drop in transaction throughput, say pgbench with a
custom insert script and one or more async/sync standbys.

2. So, heavy WAL throttling can turn into a lot of writes and fsyncs.
Eventually, each backend gets a chance to throttle WAL if it ever
generates WAL irrespective of whether there's a replication lag or
not. How about we let backends throttle themselves not just based on
wal_distance_from_last_flush but also depending on the replication lag
at the moment, say, if replication lag crosses
wal_throttle_replication_lag_threshold bytes?

3. Should the backends wait indefinitely for sync rep ack when they
crossed wal_throttle_threshold? Well, I don't think so, it must be
given a chance to do its stuff instead of favouring other backends by
throttling itself.

4. I'd prefer adding a TAP test for this feature to check if the WAL
throttle is picked up by a backend.

5. Can we also extend this 

Re: Syncrep and improving latency due to WAL throttling

2023-01-25 Thread Andres Freund
Hi,

On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:
> In other words it allows slow down of any backend activity. Any feedback on
> such a feature is welcome, including better GUC name proposals ;) and
> conditions in which such feature should be disabled even if it would be
> enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
> not in the patch).

Such a feature could be useful - but I don't think the current place of
throttling has any hope of working reliably:

> @@ -1021,6 +1025,21 @@ XLogInsertRecord(XLogRecData *rdata,
>   pgWalUsage.wal_bytes += rechdr->xl_tot_len;
>   pgWalUsage.wal_records++;
>   pgWalUsage.wal_fpi += num_fpi;
> +
> + backendWalInserted += rechdr->xl_tot_len;
> +
> + if ((synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY || 
> synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_WRITE) &&
> + synchronous_commit_flush_wal_after > 0 &&
> + backendWalInserted > synchronous_commit_flush_wal_after 
> * 1024L)
> + {
> + elog(DEBUG3, "throttling WAL down on this session 
> (backendWalInserted=%d)", backendWalInserted);
> + XLogFlush(EndPos);
> + /* XXX: refactor SyncRepWaitForLSN() to have different 
> waitevent than default WAIT_EVENT_SYNC_REP  */
> + /* maybe new WAIT_EVENT_SYNC_REP_BIG or something like 
> that */
> + SyncRepWaitForLSN(EndPos, false);
> + elog(DEBUG3, "throttling WAL down on this session - 
> end");
> + backendWalInserted = 0;
> + }
>   }

You're blocking in the middle of an XLOG insertion. We will commonly hold
important buffer lwlocks, it'll often be in a critical section (no cancelling
/ terminating the session!). This'd entail loads of undetectable deadlocks
(i.e. hard hangs).  And even leaving that aside, doing an unnecessary
XLogFlush() with important locks held will seriously increase contention.


My best idea for how to implement this in a somewhat safe way would be for
XLogInsertRecord() to set a flag indicating that we should throttle, and set
InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
proceed (i.e. we'll not be in a critical / interrupts off section) can
actually perform the delay.  That should fix the hard deadlock danger and
remove most of the increase in lock contention.


I don't think doing an XLogFlush() of a record that we just wrote is a good
idea - that'll sometimes significantly increase the number of fdatasyncs/sec
we're doing. To make matters worse, this will often cause partially filled WAL
pages to be flushed out - rewriting a WAL page multiple times can
significantly increase overhead on the storage level. At the very least this'd
have to flush only up to the last fully filled page.


Just counting the number of bytes inserted by a backend will make the overhead
even worse, as the flush will be triggered even for OLTP sessions doing tiny
transactions, even though they don't contribute to the problem you're trying
to address.  How about counting how many bytes of WAL a backend has inserted
since the last time that backend did an XLogFlush()?

A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
last XLogFlush() will increase quickly, triggering a flush at the next
opportunity. But short OLTP transactions will do XLogFlush()es at least at
every commit.


I also suspect the overhead will be more manageable if you were to force a
flush only up to a point further back than the last fully filled page. We
don't want to end up flushing WAL for every page, but if you just have a
backend-local accounting mechanism, I think that's inevitably what you'd end
up with when you have a large number of sessions. But if you'd limit the
flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
only ever to a prior boundary, the amount of unnecessary WAL flushes would be
proportional to synchronous_commit_flush_wal_after.


Greetings,

Andres Freund




Syncrep and improving latency due to WAL throttling

2023-01-25 Thread Jakub Wartak
Hi,

attached is proposal idea by Tomas (in CC) for protecting and
prioritizing OLTP latency on syncrep over other heavy WAL hitting
sessions. This is the result of internal testing and research related
to the syncrep behavior with Tomas, Alvaro and me. The main objective
of this work-in-progress/crude patch idea (with GUC default disabled)
is to prevent build-up of lag against the synchronous standby. It
allows DBA to maintain stable latency for many backends when in
parallel there is some WAL-heavy activity happening (full table
rewrite, VACUUM, MV build, archiving, etc.). In other words it allows
slow down of any backend activity. Any feedback on such a feature is
welcome, including better GUC name proposals ;) and conditions in
which such feature should be disabled even if it would be enabled
globally (right now only anti- wraparound VACUUM comes to mind, it's
not in the patch).

Demo; Given:
- two DBs in syncrep configuration and artificial RTT 10ms latency
between them (introduced via tc qdisc netem)
- insertBIG.sql = "insert into bandwidthhog select repeat('c', 1000)
from generate_series(1, 50);" (50MB of WAL data)
- pgbench (8c) and 1x INSERT session

There are clearly visible drops of pgbench (OLTP) latency when the WAL
socket is saturated:

with 16devel/master and synchronous_commit_flush_wal_after=0
(disabled, default/baseline):
postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1
pgbench (16devel)
progress: 1.0 s, 59.0 tps, lat 18.840 ms stddev 11.251, 0 failed, lag 0.059 ms
progress: 2.0 s, 48.0 tps, lat 14.332 ms stddev 4.272, 0 failed, lag 0.063 ms
progress: 3.0 s, 56.0 tps, lat 15.383 ms stddev 6.270, 0 failed, lag 0.061 ms
progress: 4.0 s, 51.0 tps, lat 15.104 ms stddev 5.850, 0 failed, lag 0.061 ms
progress: 5.0 s, 47.0 tps, lat 15.184 ms stddev 5.472, 0 failed, lag 0.063 ms
progress: 6.0 s, 23.0 tps, lat 88.495 ms stddev 141.845, 0 failed, lag 0.064 ms
progress: 7.0 s, 1.0 tps, lat 999.053 ms stddev 0.000, 0 failed, lag 0.077 ms
progress: 8.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed, lag 0.000 ms
progress: 9.0 s, 1.0 tps, lat 2748.142 ms stddev NaN, 0 failed, lag 0.072 ms
progress: 10.0 s, 68.1 tps, lat 3368.267 ms stddev 282.842, 0 failed,
lag 2911.857 ms
progress: 11.0 s, 97.0 tps, lat 2560.750 ms stddev 216.844, 0 failed,
lag 2478.261 ms
progress: 12.0 s, 96.0 tps, lat 1463.754 ms stddev 376.276, 0 failed,
lag 1383.873 ms
progress: 13.0 s, 94.0 tps, lat 616.243 ms stddev 230.673, 0 failed,
lag 527.241 ms
progress: 14.0 s, 59.0 tps, lat 48.265 ms stddev 72.533, 0 failed, lag 15.181 ms
progress: 15.0 s, 39.0 tps, lat 14.237 ms stddev 6.073, 0 failed, lag 0.063 ms
transaction type: 
[..]
latency average = 931.383 ms
latency stddev = 1188.530 ms
rate limit schedule lag: avg 840.170 (max 3605.569) ms

session2 output:
postgres=# \i insertBIG.sql
Timing is on.
INSERT 0 50
Time: 4119.485 ms (00:04.119)

This new GUC makes it possible for the OLTP traffic to be less
affected (latency-wise) when the heavy bulk traffic hits. With
synchronous_commit_flush_wal_after=1024 (kB) it's way better, but
latency rises up to 45ms:
postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1
pgbench (16devel)
progress: 1.0 s, 52.0 tps, lat 17.300 ms stddev 10.178, 0 failed, lag 0.061 ms
progress: 2.0 s, 51.0 tps, lat 19.490 ms stddev 12.626, 0 failed, lag 0.061 ms
progress: 3.0 s, 48.0 tps, lat 14.839 ms stddev 5.429, 0 failed, lag 0.061 ms
progress: 4.0 s, 53.0 tps, lat 24.635 ms stddev 13.449, 0 failed, lag 0.062 ms
progress: 5.0 s, 48.0 tps, lat 17.999 ms stddev 9.291, 0 failed, lag 0.062 ms
progress: 6.0 s, 57.0 tps, lat 21.513 ms stddev 17.011, 0 failed, lag 0.058 ms
progress: 7.0 s, 50.0 tps, lat 28.071 ms stddev 21.622, 0 failed, lag 0.061 ms
progress: 8.0 s, 45.0 tps, lat 27.244 ms stddev 11.975, 0 failed, lag 0.064 ms
progress: 9.0 s, 57.0 tps, lat 35.988 ms stddev 25.752, 0 failed, lag 0.057 ms
progress: 10.0 s, 56.0 tps, lat 45.478 ms stddev 39.831, 0 failed, lag 0.534 ms
progress: 11.0 s, 62.0 tps, lat 45.146 ms stddev 32.881, 0 failed, lag 0.058 ms
progress: 12.0 s, 51.0 tps, lat 24.250 ms stddev 12.405, 0 failed, lag 0.063 ms
progress: 13.0 s, 57.0 tps, lat 18.554 ms stddev 8.677, 0 failed, lag 0.060 ms
progress: 14.0 s, 44.0 tps, lat 15.923 ms stddev 6.958, 0 failed, lag 0.065 ms
progress: 15.0 s, 54.0 tps, lat 19.773 ms stddev 10.024, 0 failed, lag 0.063 ms
transaction type: 
[..]
latency average = 25.575 ms
latency stddev = 21.540 ms

session2 output:
postgres=# set synchronous_commit_flush_wal_after = 1024;
SET
postgres=# \i insertBIG.sql
INSERT 0 50
Time: 8889.318 ms (00:08.889)


With 16devel/master and synchronous_commit_flush_wal_after=256 (kB)
all is smooth:
postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1
pgbench (16devel)
progress: 1.0 s, 49.0 tps, lat 14.345 ms stddev 4.700, 0 failed, lag 0.062 ms
progress: 2.0 s, 45.0 tps, lat 14.812 ms stddev 5.816, 0 failed, lag 0.064 ms
progress: 3.0 s, 49.0 tps, lat 13.145 ms stddev 4.320, 0