Re: walwriter interacts quite badly with synchronous_commit=off

2023-11-27 Thread Andres Freund
On 2023-11-27 17:55:34 +0200, Heikki Linnakangas wrote:
> On 25/10/2023 21:59, Andres Freund wrote:
> > > See attached. It's the same logic as in your patch, just formulatd more
> > > clearly IMHO.
> > Yep, makes sense!
> 
> Pushed this. Thanks for the investigation!

Thanks!




Re: walwriter interacts quite badly with synchronous_commit=off

2023-11-27 Thread Heikki Linnakangas

On 25/10/2023 21:59, Andres Freund wrote:

See attached. It's the same logic as in your patch, just formulatd more
clearly IMHO.

Yep, makes sense!


Pushed this. Thanks for the investigation!

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: walwriter interacts quite badly with synchronous_commit=off

2023-10-25 Thread Heikki Linnakangas

On 25/10/2023 21:59, Andres Freund wrote:

On 2023-10-25 12:17:03 +0300, Heikki Linnakangas wrote:

On 25/10/2023 02:09, Andres Freund wrote:

Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...


That can only happen when walwriter has just come out of "hibernation", ie.
when the system has been idle for a while. So probably not a big deal in
practice.


Maybe I am missing something here - why can this only happen when hibernating?
Even outside of that two backends can decide that that they need to wake up
walwriter?


Ah sure, multiple backends can decide to wake up walwriter at the same 
time. I thought you meant that the window for that was somehow wider 
when XLogCtl->WalWriterSleeping.



We could prevent that, by updating state when requesting walwriter to be woken
up. But with the changes we're discussing below, that should be rare.


One small easy thing we could do to reduce the redundant wakeups: only 
wake up walwriter if asyncXactLSN points to different page than 
prevAsyncXactLSN.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: walwriter interacts quite badly with synchronous_commit=off

2023-10-25 Thread Andres Freund
Hi,

On 2023-10-25 12:17:03 +0300, Heikki Linnakangas wrote:
> On 25/10/2023 02:09, Andres Freund wrote:
> > Because of the inherent delay between the checks of 
> > XLogCtl->WalWriterSleeping
> > and Latch->is_set, we also sometimes end up with multiple processes 
> > signalling
> > walwriter, which can be bad, because it increases the likelihood that some 
> > of
> > the signals may be received when we are already holding WALWriteLock, 
> > delaying
> > its release...
>
> That can only happen when walwriter has just come out of "hibernation", ie.
> when the system has been idle for a while. So probably not a big deal in
> practice.

Maybe I am missing something here - why can this only happen when hibernating?
Even outside of that two backends can decide that that they need to wake up
walwriter?

We could prevent that, by updating state when requesting walwriter to be woken
up. But with the changes we're discussing below, that should be rare.


> > I think the most important optimization we need is to have
> > XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
> > WAL. Unless walsender is hibernating, walsender will wake up on its own 
> > after
> > wal_writer_delay.  I think we can just reuse WalWriterFlushAfter for this.
> >
> > E.g. a condition like
> > if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * 
> > XLOG_BLCKSZ)
> > return;
> > drastically cuts down on the amount of wakeups, without - I think - loosing
> > guarantees around synchronous_commit=off.
>
> In the patch, you actually did:
>
> > +   if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * 
> > XLOG_BLCKSZ)
> > +   return;
>
> It means that you never wake up the walwriter to merely *write* the WAL. You
> only wake it up if it's going to also fsync() it. I think that's correct and
> appropriate, but it took me a while to reach that conclusion:

Yea, after writing the email I got worried that just looking at Write would
perhaps lead to not flushing data soon enough...


> It might be beneficial to wake up the walwriter just to perform a write(),
> to offload that work from the backend. And walwriter will actually also
> perform an fsync() after finishing the current segment, so it would make
> sense to also wake it up when 'asyncXactLSN' crosses a segment boundary.
> However, if those extra wakeups make sense, they would also make sense when
> there are no asynchronous commits involved. Therefore those extra wakeups
> should be done elsewhere, perhaps somewhere around AdvanceXLInsertBuffer().
> The condition you have in the patch is appropriate for
> XLogSetAsyncXactLSN().

Yea. I agree we should wake up walsender in other situations too...


> Another reason to write the WAL aggressively, even if you don't flush it,
> would be to reduce the number of lost transactions on a segfault. But we
> don't give any guarantees on that, and even with the current aggressive
> logic, we only write when a page is full so you're anyway going to lose the
> last partial page.

Wal writer does end up writing the trailing partially filled page during the
next wal_writer_delay cycle.


> It also took me a while to convince myself that this calculation matches the
> calculation that XLogBackgroundFlush() uses to determine whether it needs to
> flush or not. XLogBackgroundFlush() first divides the request and result
> with XLOG_BLCKSZ and then compares the number of blocks, whereas here you
> perform the calculation in bytes. I think the result is the same, but to
> make it more clear, let's do it the same way in both places.
>
> See attached. It's the same logic as in your patch, just formulatd more
> clearly IMHO.

Yep, makes sense!


> > Because of the frequent wakeups, we do something else that's not smart: We
> > write out 8k blocks individually, many times a second. Often thousands of
> > 8k pwrites a second.
>
> Even with this patch, when I bumped up wal_writer_delay to 2 so that the wal
> writer gets woken up by the async commits rather than the timeout, the write
> pattern is a bit silly:
>
> $ strace -p 1099926 # walwriter
> strace: Process 1099926 attached
> epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
> u64=94261056289248}}], 1, 1991) = 1
> read(3,
> "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 128
> pwrite64(5, 
> "\24\321\5\0\1\0\0\0\0\300\0\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0n\276\242\305"...,
> 1007616, 49152) = 1007616
> fdatasync(5)= 0
> pwrite64(5, "\24\321\5\0\1\0\0\0\0
> \20\373\5\0\0\0003\0\0\0\0\0\0\0\320\37\20\373\5\0\0\0"..., 16384, 1056768)
> = 16384
> epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
> u64=94261056289248}}], 1, 2000) = 1
> read(3,
> "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 128
> pwrite64(5,
> "\24\321\5\0\1\0\0\0\0`\20\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0\5~\23\261"...,
> 1040384, 1073152) 

Re: walwriter interacts quite badly with synchronous_commit=off

2023-10-25 Thread Heikki Linnakangas

On 25/10/2023 02:09, Andres Freund wrote:

Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...


That can only happen when walwriter has just come out of "hibernation", 
ie. when the system has been idle for a while. So probably not a big 
deal in practice.



I think the most important optimization we need is to have
XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
WAL. Unless walsender is hibernating, walsender will wake up on its own after
wal_writer_delay.  I think we can just reuse WalWriterFlushAfter for this.

E.g. a condition like
if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * 
XLOG_BLCKSZ)
return;
drastically cuts down on the amount of wakeups, without - I think - loosing
guarantees around synchronous_commit=off.


In the patch, you actually did:


+   if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * 
XLOG_BLCKSZ)
+   return;


It means that you never wake up the walwriter to merely *write* the WAL. 
You only wake it up if it's going to also fsync() it. I think that's 
correct and appropriate, but it took me a while to reach that conclusion:


It might be beneficial to wake up the walwriter just to perform a 
write(), to offload that work from the backend. And walwriter will 
actually also perform an fsync() after finishing the current segment, so 
it would make sense to also wake it up when 'asyncXactLSN' crosses a 
segment boundary. However, if those extra wakeups make sense, they would 
also make sense when there are no asynchronous commits involved. 
Therefore those extra wakeups should be done elsewhere, perhaps 
somewhere around AdvanceXLInsertBuffer(). The condition you have in the 
patch is appropriate for XLogSetAsyncXactLSN().


Another reason to write the WAL aggressively, even if you don't flush 
it, would be to reduce the number of lost transactions on a segfault. 
But we don't give any guarantees on that, and even with the current 
aggressive logic, we only write when a page is full so you're anyway 
going to lose the last partial page.


It also took me a while to convince myself that this calculation matches 
the calculation that XLogBackgroundFlush() uses to determine whether it 
needs to flush or not. XLogBackgroundFlush() first divides the request 
and result with XLOG_BLCKSZ and then compares the number of blocks, 
whereas here you perform the calculation in bytes. I think the result is 
the same, but to make it more clear, let's do it the same way in both 
places.


See attached. It's the same logic as in your patch, just formulatd more 
clearly IMHO.



Because of the frequent wakeups, we do something else that's not smart: We
write out 8k blocks individually, many times a second. Often thousands of
8k pwrites a second.


Even with this patch, when I bumped up wal_writer_delay to 2 so that the 
wal writer gets woken up by the async commits rather than the timeout, 
the write pattern is a bit silly:


$ strace -p 1099926 # walwriter
strace: Process 1099926 attached
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, 
u64=94261056289248}}], 1, 1991) = 1
read(3, 
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 128
pwrite64(5, 
"\24\321\5\0\1\0\0\0\0\300\0\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0n\276\242\305"..., 
1007616, 49152) = 1007616

fdatasync(5)= 0
pwrite64(5, "\24\321\5\0\1\0\0\0\0 
\20\373\5\0\0\0003\0\0\0\0\0\0\0\320\37\20\373\5\0\0\0"..., 16384, 
1056768) = 16384
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, 
u64=94261056289248}}], 1, 2000) = 1
read(3, 
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 128
pwrite64(5, 
"\24\321\5\0\1\0\0\0\0`\20\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0\5~\23\261"..., 
1040384, 1073152) = 1040384

fdatasync(5)= 0
pwrite64(5, "\24\321\4\0\1\0\0\0\0@ 
\373\5\0\0\0\0\0\0\0\0\0\0\0;\0\0\0\264'\246\3"..., 16384, 2113536) = 16384
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, 
u64=94261056289248}}], 1, 2000) = 1
read(3, 
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 128
pwrite64(5, "\24\321\5\0\1\0\0\0\0\200 
\373\5\0\0\0003\0\0\0\0\0\0\0\320\177 \373\5\0\0\0"..., 1040384, 
2129920) = 1040384

fdatasync(5)= 0

In each cycle, the wal writer writes a full 1 MB chunk 
(wal_writer_flush_after = '1MB'), flushes it, and then perform a smaller 
write before going to sleep.


Those smaller writes seem a bit silly. But I think it's fine.


More throughput for less CPU, seems neat :)


Indeed, impressive speedup from such a small patch!


I'm not 

walwriter interacts quite badly with synchronous_commit=off

2023-10-24 Thread Andres Freund
Hi,

I recently mentioned to Heikki that I was seeing latch related wakeups being
frequent enough to prevent walwriter from doing a whole lot of work. He asked
me to write that set of concerns up, which seems quite fair...


Here's a profile of walwriter while the following pgbench run was ongoing:

c=1;psql -Xq -c 'drop table if exists testtable_logged; CREATE TABLE 
testtable_logged(v int not null default 0);' && PGOPTIONS='-c 
synchronous_commit=off' pgbench -n -c$c -j$c -Mprepared -T150 -f <(echo 'INSERT 
INTO testtable_logged DEFAULT VALUES;') -P1

Looking at top, walwriter is around 15-20% busy with this
workload. Unfortunately a profile quickly shows that little of that work is
useful:

perf record --call-graph dwarf -m16M -p $(pgrep -f 'walwriter') sleep 3

-   94.42% 0.00%  postgres  postgres  [.] AuxiliaryProcessMain
 AuxiliaryProcessMain
   - WalWriterMain
  + 78.26% WaitLatch
  + 14.01% XLogBackgroundFlush
  + 0.51% pgstat_report_wal
0.29% ResetLatch
0.13% pgstat_flush_io
  + 0.02% asm_sysvec_apic_timer_interrupt
0.01% HandleWalWriterInterrupts (inlined)


Confirmed by the distribution of what syscalls are made:

perf trace -m128M --summary -p $(pgrep -f 'walwriter') sleep 5
   syscallcalls  errors  total   min   avg   max   
stddev
 (msec)(msec)(msec)(msec)   
 (%)
   ---   --  - - - 
--
   epoll_wait216610  0  3744.984 0.000 0.017 0.113  
0.03%
   read  216602  0   333.905 0.001 0.002 0.029  
0.03%
   fdatasync 27  094.703 1.939 3.50811.279  
8.83%
   pwrite642998  015.646 0.004 0.005 0.027  
0.45%
   openat 2  0 0.019 0.006 0.010 0.013 
34.84%
   close  2  0 0.004 0.002 0.002 0.003 
25.76%

We're doing far more latch related work than actual work.

The walwriter many many times wakes up without having to do anything.

And if you increase the number of clients to e.g. c=8, it gets worse in some
ways:

perf trace:
   epoll_wait291512  0  2364.067 0.001 0.008 0.693  
0.10%
   read  290938  0   479.837 0.001 0.002 0.020  
0.05%
   fdatasync146  0   410.043 2.508 2.809 7.006  
1.90%
   futex  56384  43982   183.896 0.001 0.003 2.791  
1.65%
   pwrite64   17058  0   105.625 0.004 0.006 4.015  
4.61%
   clock_nanosleep1  0 1.063 1.063 1.063 1.063  
0.00%
   openat 9  0 0.072 0.006 0.008 0.014 
14.35%
   close  9  0 0.018 0.002 0.002 0.003  
5.55%

Note that we 5x more lock waits (the futex calls) than writes!


I think the problem is mainly that XLogSetAsyncXactLSN() wakes up walwriter
whenever it is sleeping, regardless of whether the modified asyncXactLSN will
lead to a write.  We even wake up walwriter when we haven't changed
asyncXactLSN, because our LSN is older than some other backends!

So often we'll just wake up walwriter, which finds no work, immediately goes
to sleep, just to be woken again.

Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...

Because of the frequent wakeups, we do something else that's not smart: We
write out 8k blocks individually, many times a second. Often thousands of
8k pwrites a second.

We also acquire WALWriteLock and call WaitXLogInsertionsToFinish(), even if
could already know we're not going to flush! Not cheap, when you do it this
many times a second.


There is an absolutely basic optimization, helping a it at higher client
counts: Don't wake if the new asyncXactLSN is <= the old one. But it doesn't
help that much.

I think the most important optimization we need is to have
XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
WAL. Unless walsender is hibernating, walsender will wake up on its own after
wal_writer_delay.  I think we can just reuse WalWriterFlushAfter for this.

E.g. a condition like
if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * 
XLOG_BLCKSZ)
return;
drastically cuts down on the amount of wakeups, without - I think - loosing
guarantees around synchronous_commit=off.

1 client:

before:
tps = 42926.288765 (without initial connection time)

   syscallcalls  errors  total   min   avg   max   
stddev