Re: Use fadvise in wal replay

2023-04-08 Thread Gregory Stark (as CFM)
On Thu, 19 Jan 2023 at 18:19, Andres Freund  wrote:
>
> On 2023-01-19 22:19:10 +0100, Tomas Vondra wrote:
>
> > So I'm a bit unsure about this patch. I doesn't seem like it can perform
> > better than read-ahead (although perhaps it does, on a different storage
> > system).
>
> I really don't see the point of the patch as-is.
...
> I don't disagree fundamentally. But that doesn't make this patch a useful
> starting point.

It sounds like this patch has gotten off on the wrong foot and is not
worth moving forward to the next commitfest. Hopefully a starting over
from a different approach might target i/o that is more amenable to
fadvise. I'll mark it RwF.

-- 
Gregory Stark
As Commitfest Manager




Re: Use fadvise in wal replay

2023-01-19 Thread Andres Freund
Hi,

On 2023-01-19 22:19:10 +0100, Tomas Vondra wrote:
> So I'm a bit unsure about this patch. I doesn't seem like it can perform
> better than read-ahead (although perhaps it does, on a different storage
> system).

I really don't see the point of the patch as-is. It's not going to help OSs
without useful readahead, because those don't have posix_fadvise either. And
hardcoded readahead distance isn't useful - on e.g. cloud storage 128kB won't
be long enough to overcome network latency.

I could see a *tad* more point in using posix_fadvise(fd, 0, 0,
POSIX_FADV_SEQUENTIAL) when opening WAL files, as that'll increase the kernel
readahead window over what's configured.


> With disabled read-ahead it helps (at least a bit), although I'm not
> really convinced there are good reasons to run without read-ahead. The
> reason for doing that was described like this:

Agreed. Postgres currently totally crashes and burns without OS
readhead. Buffered IO without readahead makes no sense. So I just don't see
the point of this patch.


> > Because database should know better than OS which data needs to be
> > prefetched and which should not. Big OS readahead affects index scan
> > performance.

I don't disagree fundamentally. But that doesn't make this patch a useful
starting point.

Greetings,

Andres Freund




Re: Use fadvise in wal replay

2023-01-19 Thread Tomas Vondra
Hi,

I looked at this patch today. The change is fairly simple, so I decided
to do a benchmark. To prepare, I created a cluster with a 1GB database,
created a backup, and ran 1h UPDATE workload with WAL archiving. Then,
the actual benchmark does this:

1. restore the datadir backup
2. copy the WAL from archive
3. drop caches
4. start the cluster, measure time until end of recovery

I did this with master/patched, and with/without Linux readahead. And I
did this on two different machines - both have SSD storage, one (i5) has
a RAID of SATA SSD devices, the other one (xeon) has a single NVMe SSD.

The results (in seconds) look like this (the amount of WAL is different
on each machine, so the timings are not comparable).

 host ra  master  patched
--
   i5  0 615  513
 256 392  396
--
 xeon  02113 1436
 2561487 1460

On i5 (smaller machine with RAID of 6 x SATA SSD), with read-ahead
enabled it takes ~390 seconds, and the patch makes no difference.
Without read-ahead, it takes ~615 seconds, and the patch does help a
bit, but it's hardly competitive to the read-ahead.

Note: 256 is read-ahead per physical device, the read-ahead value for
the whole RAID is 6x that, i.e. 1536. I was speculating that maybe the
hard-coded 128kB RACHUNK is not sufficient, so I tried with 512kB. But
that actually made it worse, and the timing deteriorated to ~640s (that
is, slower than master without read-ahead).

On the xeon (with NVMe SSD), it's different - the patch seems about as
effective as regular read-ahead. So that's good.


So I'm a bit unsure about this patch. I doesn't seem like it can perform
better than read-ahead (although perhaps it does, on a different storage
system).

With disabled read-ahead it helps (at least a bit), although I'm not
really convinced there are good reasons to run without read-ahead. The
reason for doing that was described like this:

> Because database should know better than OS which data needs to be
> prefetched and which should not. Big OS readahead affects index scan
> performance.

I don't recall seeing such issue, and I can't find anything like that in
our mailinglist archives either. Sure, that doesn't mean it can't
happen, and read-ahead is a heuristics so it can do weird stuff. But in
my experience it tends to work fairly well. The issues I've seen are
generally in the opposite direction, i.e. read-ahead not kicking in.


Anyway, it's not my intent to prevent this patch getting committed, if
someone wishes to do that. But I'm not quite convinced it actually helps
with a practical issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use fadvise in wal replay

2022-11-27 Thread Andrey Borodin
On Fri, Nov 25, 2022 at 1:12 PM Pavel Borisov  wrote:
>
> As I've written up in the thread we can not gain much from this
> optimization. The results of Jakub shows around 2% difference:
>
> >baseline, master, default Linux readahead (128kb):
> >33.979, 0.478
> >35.137, 0.504
> >34.649, 0.518>
>
> >master+patched, readahead disabled:
> >34.338, 0.528
> >34.568, 0.575
> >34.007, 1.136
>
> >master+patched, readahead enabled (as default):
> >33.935, 0.523
> >34.109, 0.501
> >33.408, 0.557
>

The performance benefit shows up only when readahead is disabled. And
on many workloads readahead brings unneeded data into page cache, so
it's preferred configuration.
In this particular case, time to apply WAL decreases from 53s to 33s.

Thanks!

Best Regards, Andrey Borodin.




Re: Use fadvise in wal replay

2022-11-25 Thread Pavel Borisov
On Sat, 26 Nov 2022 at 01:10, Pavel Borisov  wrote:
>
> Hi, hackers!
>
> On Sun, 13 Nov 2022 at 02:02, Andrey Borodin  wrote:
> >
> > On Sun, Aug 7, 2022 at 9:41 AM Andrey Borodin  wrote:
> > >
> >
> > Hi everyone. The patch is 16 lines, looks harmless and with proven
> > benefits. I'm moving this into RfC.
>
> As I've written up in the thread we can not gain much from this
> optimization. The results of Jakub shows around 2% difference:
>
> >baseline, master, default Linux readahead (128kb):
> >33.979, 0.478
> >35.137, 0.504
> >34.649, 0.518>
>
> >master+patched, readahead disabled:
> >34.338, 0.528
> >34.568, 0.575
> >34.007, 1.136
>
> >master+patched, readahead enabled (as default):
> >33.935, 0.523
> >34.109, 0.501
> >33.408, 0.557
>
> On the other hand, the patch indeed is tiny and I don't think the
> proposed advise can ever make things bad. So, I've looked through the
> patch again and I agree it can be committed in the current state.

My mailer corrected my previous message. The right one is:
"On the other hand, the patch indeed is tiny and I don't think the
proposed _fadvise_ can ever make things bad".

Pavel.




Re: Use fadvise in wal replay

2022-11-25 Thread Pavel Borisov
Hi, hackers!

On Sun, 13 Nov 2022 at 02:02, Andrey Borodin  wrote:
>
> On Sun, Aug 7, 2022 at 9:41 AM Andrey Borodin  wrote:
> >
>
> Hi everyone. The patch is 16 lines, looks harmless and with proven
> benefits. I'm moving this into RfC.

As I've written up in the thread we can not gain much from this
optimization. The results of Jakub shows around 2% difference:

>baseline, master, default Linux readahead (128kb):
>33.979, 0.478
>35.137, 0.504
>34.649, 0.518>

>master+patched, readahead disabled:
>34.338, 0.528
>34.568, 0.575
>34.007, 1.136

>master+patched, readahead enabled (as default):
>33.935, 0.523
>34.109, 0.501
>33.408, 0.557

On the other hand, the patch indeed is tiny and I don't think the
proposed advise can ever make things bad. So, I've looked through the
patch again and I agree it can be committed in the current state.

Kind regards,
Pavel Borisov,
Supabase




Re: Use fadvise in wal replay

2022-11-12 Thread Andrey Borodin
On Sun, Aug 7, 2022 at 9:41 AM Andrey Borodin  wrote:
>

Hi everyone. The patch is 16 lines, looks harmless and with proven
benefits. I'm moving this into RfC.

Thanks!

Best regards, Andrey Borodin.




Re: Use fadvise in wal replay

2022-08-07 Thread Andrey Borodin



> On 7 Aug 2022, at 06:39, Bharath Rupireddy 
>  wrote:
> 
> Agree. Why can't we just prefetch the entire WAL file once whenever it
> is opened for the first time? Does the OS have any limitations on max
> size to prefetch at once? It may sound aggressive, but it avoids
> fadvise() system calls, this will be especially useful if there are
> many WAL files to recover (crash, PITR or standby recovery),
> eventually we would want the total WAL file to be prefetched.
> 
> If prefetching the entire WAL file is okay, we could further do this:
> 1) prefetch in XLogFileOpen() and all of segment_open callbacks, 2)
> release in XLogFileClose (it's being dong right now) and all of
> segment_close callbacks - do this perhaps optionally.
> 
> Also, can't we use an existing function FilePrefetch()? That way,
> there is no need for a new wait event type.
> 
> Thoughts?

Thomas expressed this idea upthread. Benchmarks done by Jakub showed that this 
approach had no significant improvement over existing master code.
The same benchmarks showed almost x1.5 improvement of readahead in 8Kb or 128Kb 
chunks.

Best regards, Andrey Borodin.



Re: Use fadvise in wal replay

2022-08-06 Thread Bharath Rupireddy
On Sat, Aug 6, 2022 at 10:53 AM Andrey Borodin  wrote:
>
> Hi Bharath,
>
> thank you for the suggestion.
>
> > On 5 Aug 2022, at 16:02, Bharath Rupireddy 
> >  wrote:
> >
> > On Thu, Aug 4, 2022 at 9:48 PM Andrey Borodin  wrote:
> >>
> >>> On 18 Jul 2022, at 22:55, Robert Haas  wrote:
> >>>
> >>> On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak  
> >>> wrote:
> >
> > I have a fundamental question on the overall idea - How beneficial it
> > will be if the process that's reading the current WAL page only does
> > (at least attempts) the prefetching of future WAL pages? Won't the
> > benefit be higher if "some" other background process does prefetching?
>
> IMO prefetching from other thread would have negative effect.
> fadvise() call is non-blocking, startup process won't do IO. It just informs 
> kernel to schedule asynchronous page read.
> On the other hand synchronization with other process might cost more than 
> fadvise().

Hm, POSIX_FADV_WILLNEED flag makes fadvise() non-blocking.

> Anyway cost of calling fadise() once per 16 page reads is neglectable.

Agree. Why can't we just prefetch the entire WAL file once whenever it
is opened for the first time? Does the OS have any limitations on max
size to prefetch at once? It may sound aggressive, but it avoids
fadvise() system calls, this will be especially useful if there are
many WAL files to recover (crash, PITR or standby recovery),
eventually we would want the total WAL file to be prefetched.

If prefetching the entire WAL file is okay, we could further do this:
1) prefetch in XLogFileOpen() and all of segment_open callbacks, 2)
release in XLogFileClose (it's being dong right now) and all of
segment_close callbacks - do this perhaps optionally.

Also, can't we use an existing function FilePrefetch()? That way,
there is no need for a new wait event type.

Thoughts?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: Use fadvise in wal replay

2022-08-05 Thread Andrey Borodin
Hi Bharath,

thank you for the suggestion.

> On 5 Aug 2022, at 16:02, Bharath Rupireddy 
>  wrote:
> 
> On Thu, Aug 4, 2022 at 9:48 PM Andrey Borodin  wrote:
>> 
>>> On 18 Jul 2022, at 22:55, Robert Haas  wrote:
>>> 
>>> On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak  
>>> wrote:
> 
> I have a fundamental question on the overall idea - How beneficial it
> will be if the process that's reading the current WAL page only does
> (at least attempts) the prefetching of future WAL pages? Won't the
> benefit be higher if "some" other background process does prefetching?

IMO prefetching from other thread would have negative effect.
fadvise() call is non-blocking, startup process won't do IO. It just informs 
kernel to schedule asynchronous page read.
On the other hand synchronization with other process might cost more than 
fadvise().

Anyway cost of calling fadise() once per 16 page reads is neglectable.

Thank you!

Best regards, Andrey Borodin.



Re: Use fadvise in wal replay

2022-08-05 Thread Bharath Rupireddy
On Thu, Aug 4, 2022 at 9:48 PM Andrey Borodin  wrote:
>
> > On 18 Jul 2022, at 22:55, Robert Haas  wrote:
> >
> > On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak  
> > wrote:

I have a fundamental question on the overall idea - How beneficial it
will be if the process that's reading the current WAL page only does
(at least attempts) the prefetching of future WAL pages? Won't the
benefit be higher if "some" other background process does prefetching?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: Use fadvise in wal replay

2022-08-04 Thread Andrey Borodin



> On 18 Jul 2022, at 22:55, Robert Haas  wrote:
> 
> On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak  wrote:
>> Cool. As for GUC I'm afraid there's going to be resistance of adding yet 
>> another GUC (to avoid many knobs). Ideally it would be nice if we had some 
>> advanced/deep/hidden parameters , but there isn't such thing.
>> Maybe another option would be to use (N * maintenance_io_concurrency * 
>> XLOG_BLCKSZ), so N=1 that's 80kB and N=2 160kB (pretty close to default 
>> value, and still can be tweaked by enduser). Let's wait what others say?
> 
> I don't think adding more parameters is a problem intrinsically. A
> good question to ask, though, is how the user is supposed to know what
> value they should configure. If we don't have any idea what value is
> likely to be optimal, odds are users won't either.
We know that 128KB is optimal on some representative configuration and that 
changing value won't really affect performance much. 128KB is marginally better 
then 8KB and removes some theoretical concerns about extra syscalls.

> It's not very clear to me that we have any kind of agreement on what
> the basic approach should be here, though.

Actually, the only question is offset from current read position: should it be 
1 block or full readehead chunk. Again, this does not change anything, just a 
matter of choice.


Thanks!

Best regards, Andrey Borodin.



Re: Use fadvise in wal replay

2022-07-18 Thread Robert Haas
On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak  wrote:
> Cool. As for GUC I'm afraid there's going to be resistance of adding yet 
> another GUC (to avoid many knobs). Ideally it would be nice if we had some 
> advanced/deep/hidden parameters , but there isn't such thing.
> Maybe another option would be to use (N * maintenance_io_concurrency * 
> XLOG_BLCKSZ), so N=1 that's 80kB and N=2 160kB (pretty close to default 
> value, and still can be tweaked by enduser). Let's wait what others say?

I don't think adding more parameters is a problem intrinsically. A
good question to ask, though, is how the user is supposed to know what
value they should configure. If we don't have any idea what value is
likely to be optimal, odds are users won't either.

It's not very clear to me that we have any kind of agreement on what
the basic approach should be here, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Use fadvise in wal replay

2022-07-18 Thread Andrey Borodin



> On 23 Jun 2022, at 12:50, Jakub Wartak  wrote:
> 
> Thoughts?

I've looked into the patch one more time. And I propose to change this line
+   posix_fadvise(readFile, readOff + RACHUNK, RACHUNK, 
POSIX_FADV_WILLNEED);
to
+   posix_fadvise(readFile, readOff + XLOG_BLCKSZ, RACHUNK, 
POSIX_FADV_WILLNEED);

Currently first 128Kb of the file are not prefetched. But I expect that this 
change will produce similar performance results. I propose this change only for 
consistency, so we prefetch all data that we did not prefetch yet and going to 
read. What do you think?

Best regards, Andrey Borodin.



Re: Use fadvise in wal replay

2022-06-23 Thread Justin Pryzby
On Thu, Jun 23, 2022 at 09:49:31AM +, Jakub Wartak wrote:
> it would be nice if we had some advanced/deep/hidden parameters , but there 
> isn't such thing.

There's DEVELOPER_OPTIONS gucs, although I don't know if this is a good fit for
that.

-- 
Justin




RE: Use fadvise in wal replay

2022-06-23 Thread Jakub Wartak
Hey Andrey,

> > 23 июня 2022 г., в 13:50, Jakub Wartak 
> написал(а):
> >
> > Thoughts?
> The patch leaves 1st 128KB chunk unprefetched. Does it worth to add and extra
> branch for 120KB after 1st block when readOff==0?
> Or maybe do
> + posix_fadvise(readFile, readOff + XLOG_BLCKSZ, RACHUNK,
> POSIX_FADV_WILLNEED);
> instead of
> + posix_fadvise(readFile, readOff + RACHUNK, RACHUNK,
> POSIX_FADV_WILLNEED);
> ?

> > Notes:
> > - no GUC, as the default/identical value seems to be the best
> I think adding this performance boost on most systems definitely worth 1 
> syscall
> per 16 pages. And I believe 128KB to be optimal for most storages. And having
> no GUCs sounds great.
> 
> But storage systems might be different, far beyond benchmarks.
> All in all, I don't have strong opinion on having 1 or 0 GUCs to configure 
> this.
> 
> I've added patch to the CF.

Cool. As for GUC I'm afraid there's going to be resistance of adding yet 
another GUC (to avoid many knobs). Ideally it would be nice if we had some 
advanced/deep/hidden parameters , but there isn't such thing.
Maybe another option would be to use (N * maintenance_io_concurrency * 
XLOG_BLCKSZ), so N=1 that's 80kB and N=2 160kB (pretty close to default value, 
and still can be tweaked by enduser). Let's wait what others say?

-J.


Re: Use fadvise in wal replay

2022-06-23 Thread Andrey Borodin



> 23 июня 2022 г., в 13:50, Jakub Wartak  написал(а):
> 
> Thoughts?
The patch leaves 1st 128KB chunk unprefetched. Does it worth to add and extra 
branch for 120KB after 1st block when readOff==0?
Or maybe do
+   posix_fadvise(readFile, readOff + XLOG_BLCKSZ, RACHUNK, 
POSIX_FADV_WILLNEED);
instead of
+   posix_fadvise(readFile, readOff + RACHUNK, RACHUNK, 
POSIX_FADV_WILLNEED);
?

> Notes:
> - no GUC, as the default/identical value seems to be the best
I think adding this performance boost on most systems definitely worth 1 
syscall per 16 pages. And I believe 128KB to be optimal for most storages. And 
having no GUCs sounds great.

But storage systems might be different, far beyond benchmarks.
All in all, I don't have strong opinion on having 1 or 0 GUCs to configure this.

I've added patch to the CF.

Thanks!

Best regards, Andrey Borodin.



RE: Use fadvise in wal replay

2022-06-23 Thread Jakub Wartak
>> > On 21 Jun 2022, at 16:59, Jakub Wartak  wrote:
>> Oh, wow, your benchmarks show really impressive improvement.
>> 
>> > I think that 1 additional syscall is not going to be cheap just for
>> > non-standard OS configurations
>> Also we can reduce number of syscalls by something like
>> 
>> #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
>> if ((readOff % (8 * XLOG_BLCKSZ)) == 0)
>> posix_fadvise(readFile, readOff + XLOG_BLCKSZ, XLOG_BLCKSZ * 8,
>> POSIX_FADV_WILLNEED); #endif
>> 
>> and maybe define\reuse the some GUC to control number of prefetched pages
>> at once.

Hi, I was thinking the same, so I got the patch (attached) to the point it gets 
the identical performance with and without readahead enabled:

baseline, master, default Linux readahead (128kb):
33.979, 0.478
35.137, 0.504
34.649, 0.518

master+patched, readahead disabled:
34.338, 0.528
34.568, 0.575
34.007, 1.136

master+patched, readahead enabled (as default):
33.935, 0.523
34.109, 0.501
33.408, 0.557

Thoughts?

Notes:
- no GUC, as the default/identical value seems to be the best
- POSIX_FADV_SEQUENTIAL is apparently much slower and doesn't seem to have 
effect from xlogreader.c at all while _WILLNEED does (testing again contradicts 
"common wisdom"?)

-J.


0001-Use-fadvise-to-prefetch-WAL-in-xlogrecovery.patch
Description: 0001-Use-fadvise-to-prefetch-WAL-in-xlogrecovery.patch


Re: Use fadvise in wal replay

2022-06-22 Thread Andrey Borodin



> On 22 Jun 2022, at 13:26, Pavel Borisov  wrote:
> 
> Then I'd guess that your speedup is due to speeding up the first several Mb's 
> in many files opened
I think in this case Thomas' aproach of prefetching next WAL segment would do 
better. But Jakub observed opposite results.

Best regards, Andrey Borodin.



Re: Use fadvise in wal replay

2022-06-22 Thread Pavel Borisov
On Wed, Jun 22, 2022 at 2:07 PM Andrey Borodin  wrote:

>
>
> > On 21 Jun 2022, at 20:52, Pavel Borisov  wrote:
> >
> > > On 21 Jun 2022, at 16:59, Jakub Wartak 
> wrote:
> > Oh, wow, your benchmarks show really impressive improvement.
> >
> > FWIW I was trying to speedup long sequential file reads in Postgres
> using fadvise hints. I've found no detectable improvements.
> > Then I've written 1Mb - 1Gb sequential read test with both fadvise
> POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL in Linux.
> Did you drop caches?
>
Yes. I saw nothing changes speed of long file (50Mb+) read.

> > The only improvement I've found was
> >
> > 1. when the size of read was around several Mb and fadvise len also
> around several Mb.
> > 2. when before fdavice and the first read there was a delay (which was
> supposedly used by OS for reading into prefetch buffer)
> That's the case of startup process: you read a xlog page, then redo
> records from this page.
>
Then I'd guess that your speedup is due to speeding up the first several
Mb's in many files opened (and delay for kernel prefetch is due to some
other reason). That may differ from the case I've tried to measure speedup
and this could be the cause of speedup in your case.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Use fadvise in wal replay

2022-06-22 Thread Andrey Borodin



> On 21 Jun 2022, at 20:52, Pavel Borisov  wrote:
> 
> > On 21 Jun 2022, at 16:59, Jakub Wartak  wrote:
> Oh, wow, your benchmarks show really impressive improvement.
> 
> FWIW I was trying to speedup long sequential file reads in Postgres using 
> fadvise hints. I've found no detectable improvements.
> Then I've written 1Mb - 1Gb sequential read test with both fadvise 
> POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL in Linux.
Did you drop caches?

> The only improvement I've found was 
> 
> 1. when the size of read was around several Mb and fadvise len also around 
> several Mb. 
> 2. when before fdavice and the first read there was a delay (which was 
> supposedly used by OS for reading into prefetch buffer)
That's the case of startup process: you read a xlog page, then redo records 
from this page.
> 3. If I read sequential blocks i saw speedup only on first ones. Overall read 
> speed of say 1Gb file remained unchanged no matter what.
> 
> I became convinced that if I read something long, OS does necessary speedups 
> automatically (which is also in agreement with fadvise manual/code comments).
> Could you please elaborate how have you got the results with that big 
> difference? (Though I don't against fadvise usage, at worst it is expected to 
> be useless).

FWIW we with Kirill observed drastically reduced lag on a production server 
when running patched version. Fidvise surely works :) The question is how to 
use it optimally.

Best regards, Andrey Borodin.



Re: Use fadvise in wal replay

2022-06-21 Thread Pavel Borisov
>
> > On 21 Jun 2022, at 16:59, Jakub Wartak  wrote:
> Oh, wow, your benchmarks show really impressive improvement.
>

FWIW I was trying to speedup long sequential file reads in Postgres using
fadvise hints. I've found no detectable improvements.
Then I've written 1Mb - 1Gb sequential read test with both fadvise
POSIX_FADV_WILLNEED
and POSIX_FADV_SEQUENTIAL in Linux. The only improvement I've found was

1. when the size of read was around several Mb and fadvise len also around
several Mb.
2. when before fdavice and the first read there was a delay (which was
supposedly used by OS for reading into prefetch buffer)
3. If I read sequential blocks i saw speedup only on first ones. Overall
read speed of say 1Gb file remained unchanged no matter what.

I became convinced that if I read something long, OS does necessary
speedups automatically (which is also in agreement with fadvise manual/code
comments).
Could you please elaborate how have you got the results with that big
difference? (Though I don't against fadvise usage, at worst it is expected
to be useless).

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Use fadvise in wal replay

2022-06-21 Thread Andrey Borodin



> On 21 Jun 2022, at 16:59, Jakub Wartak  wrote:
Oh, wow, your benchmarks show really impressive improvement.

> I think that 1 additional syscall is not going to be cheap just for 
> non-standard OS configurations
Also we can reduce number of syscalls by something like

#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
if ((readOff % (8 * XLOG_BLCKSZ)) == 0)
posix_fadvise(readFile, readOff + XLOG_BLCKSZ, XLOG_BLCKSZ * 8, 
POSIX_FADV_WILLNEED);
#endif

and maybe define\reuse the some GUC to control number of prefetched pages at 
once.

Best regards, Andrey Borodin.



RE: Use fadvise in wal replay

2022-06-21 Thread Jakub Wartak
> On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak 
> wrote:
> > > > Maybe the important question is why would be readahead mechanism
> > > > be
> > > disabled in the first place via /sys | blockdev ?
> > >
> > > Because database should know better than OS which data needs to be
> > > prefetched and which should not. Big OS readahead affects index scan
> > > performance.
> >
> > OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ
> which is not cheap either. The code is already hot and there is example from 
> the
> past where syscalls were limiting the performance [1]. Maybe it could be
> prefetching in larger batches (128kB? 1MB? 16MB?)  ?
> 
> I've always thought we'd want to tell it about the *next* segment file, to
> smooth the transition from one file to the next, something like the attached 
> (not
> tested).

Hey Thomas! 

Apparently it's false theory. Redo-bench [1] results (1st is total recovery 
time in seconds, 3.1GB pgdata (out of which 2.6 pg_wals/166 files). Redo-bench 
was slightly hacked to drop fs caches always after copying so that there is 
nothing in fscache (both no pgdata and no pg_wals; shared fs). M_io_c is at 
default (10), recovery_prefetch same (try; on by default)

master, default Linux readahead (128kb):
33.979, 0.478
35.137, 0.504
34.649, 0.518

master, blockdev --setra 0 /dev/nvme0n1: 
53.151, 0.603
58.329, 0.525
52.435, 0.536

master, with yours patch (readaheads disabled) -- double checked, calls to 
fadvise64(offset=0 len=0) were there
58.063, 0.593
51.369, 0.574
51.716, 0.59

master, with Kirill's original patch (readaheads disabled)
38.25, 1.134
36.563, 0.582
37.711, 0.584

I've noted also that in both cases POSIX_FADV_SEQUENTIAL is being used instead 
of WILLNEED (?). 
I haven't quantified the tradeoff of master vs Kirill's with readahead, but I 
think that 1 additional syscall is not going to be cheap just for non-standard 
OS configurations (?)

-J.

[1] - https://github.com/macdice/redo-bench


Re: Use fadvise in wal replay

2022-06-21 Thread Amit Kapila
On Tue, Jun 21, 2022 at 5:41 PM Bharath Rupireddy
 wrote:
>
> On Tue, Jun 21, 2022 at 4:55 PM Amit Kapila  wrote:
> >
> > On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin  wrote:
> > >
> > > > On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
> > > >
> > > > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> > > > help your case?
> > >
> > > AFAICS recovery_prefetch tries to prefetch main fork, but does not try to 
> > > prefetch WAL itself before reading it. Kirill is trying to solve the 
> > > problem of reading WAL segments that are our of OS page cache.
> > >
> >
> > Okay, but normally the WAL written by walreceiver is read by the
> > startup process soon after it's written as indicated in code comments
> > (get_sync_bit()). So, what is causing the delay here which makes the
> > startup process perform physical reads?
>
> That's not always true. If there's a huge apply lag and/or
> restartpoint is infrequent/frequent or there are many reads on the
> standby - in all of these cases the OS cache can replace the WAL from
> it causing the startup process to hit the disk for WAL reading.
>

It is possible that due to one or more these reasons startup process
has to physically read the WAL. I think it is better to find out what
is going on for the OP. AFAICS, there is no mention of any other kind
of reads on the problematic standby. As per the analysis shared in the
initial email, the replication lag is due to disk reads, so there
doesn't seem to be a very clear theory as to why the OP is seeing disk
reads.

-- 
With Regards,
Amit Kapila.




Re: Use fadvise in wal replay

2022-06-21 Thread Bharath Rupireddy
On Tue, Jun 21, 2022 at 4:55 PM Amit Kapila  wrote:
>
> On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin  wrote:
> >
> > > On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
> > >
> > > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> > > help your case?
> >
> > AFAICS recovery_prefetch tries to prefetch main fork, but does not try to 
> > prefetch WAL itself before reading it. Kirill is trying to solve the 
> > problem of reading WAL segments that are our of OS page cache.
> >
>
> Okay, but normally the WAL written by walreceiver is read by the
> startup process soon after it's written as indicated in code comments
> (get_sync_bit()). So, what is causing the delay here which makes the
> startup process perform physical reads?

That's not always true. If there's a huge apply lag and/or
restartpoint is infrequent/frequent or there are many reads on the
standby - in all of these cases the OS cache can replace the WAL from
it causing the startup process to hit the disk for WAL reading.

Regards,
Bharath Rupireddy.




Re: Use fadvise in wal replay

2022-06-21 Thread Amit Kapila
On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin  wrote:
>
> > On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
> >
> > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> > help your case?
>
> AFAICS recovery_prefetch tries to prefetch main fork, but does not try to 
> prefetch WAL itself before reading it. Kirill is trying to solve the problem 
> of reading WAL segments that are our of OS page cache.
>

Okay, but normally the WAL written by walreceiver is read by the
startup process soon after it's written as indicated in code comments
(get_sync_bit()). So, what is causing the delay here which makes the
startup process perform physical reads?

-- 
With Regards,
Amit Kapila.




Re: Use fadvise in wal replay

2022-06-21 Thread Bharath Rupireddy
On Tue, Jun 21, 2022 at 4:22 PM Thomas Munro  wrote:
>
> On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak  wrote:
> > > > Maybe the important question is why would be readahead mechanism be
> > > disabled in the first place via /sys | blockdev ?
> > >
> > > Because database should know better than OS which data needs to be
> > > prefetched and which should not. Big OS readahead affects index scan
> > > performance.
> >
> > OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ 
> > which is not cheap either. The code is already hot and there is example 
> > from the past where syscalls were limiting the performance [1]. Maybe it 
> > could be prefetching in larger batches (128kB? 1MB? 16MB?)  ?
>
> I've always thought we'd want to tell it about the *next* segment
> file, to smooth the transition from one file to the next, something
> like the attached (not tested).

Yes, it makes sense to prefetch the "future" WAL files that "may be"
needed for recovery (crash recovery/archive or PITR recovery/standby
recovery), not the current WAL file. Having said that, it's not a
great idea (IMO) to make the WAL readers prefetching instead WAL
prefetching can be delegated to a new background worker or existing bg
writer or checkpointer which gets started during recovery.

Also, it's a good idea to measure the benefits with and without WAL
prefetching for all recovery types - crash recovery/archive or PITR
recovery/standby recovery. For standby recovery,  the WAL files may be
in OS cache if there wasn't a huge apply lag.

Regards,
Bharath Rupireddy.




Re: Use fadvise in wal replay

2022-06-21 Thread Thomas Munro
On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak  wrote:
> > > Maybe the important question is why would be readahead mechanism be
> > disabled in the first place via /sys | blockdev ?
> >
> > Because database should know better than OS which data needs to be
> > prefetched and which should not. Big OS readahead affects index scan
> > performance.
>
> OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ 
> which is not cheap either. The code is already hot and there is example from 
> the past where syscalls were limiting the performance [1]. Maybe it could be 
> prefetching in larger batches (128kB? 1MB? 16MB?)  ?

I've always thought we'd want to tell it about the *next* segment
file, to smooth the transition from one file to the next, something
like the attached (not tested).
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6eba626420..e1dc37d3c2 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -4032,6 +4032,25 @@ rescanLatestTimeLine(TimeLineID replayTLI, XLogRecPtr replayLSN)
 }
 
 
+/*
+ * Tell the kernel to prefetch a logfile segment, if it exists.  Ignores
+ * errors, since this is only a hint.
+ */
+static void
+XLogFilePrefetch(XLogSegNo segno, TimeLineID tli)
+{
+	char		path[MAXPGPATH];
+	File		file;
+
+	XLogFilePath(path, tli, segno, wal_segment_size);
+	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (file >= 0)
+	{
+		FilePrefetch(file, 0, 0, WAIT_EVENT_WAL_PREFETCH);
+		FileClose(file);
+	}
+}
+
 /*
  * Open a logfile segment for reading (during recovery).
  *
@@ -4106,6 +4125,13 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		if (source != XLOG_FROM_STREAM)
 			XLogReceiptTime = GetCurrentTimestamp();
 
+		/*
+		 * Every time we open a file from pg_wal, hint to the kernel that we'll
+		 * likely soon be reading the next segment.
+		 */
+		if (readSource == XLOG_FROM_PG_WAL)
+			XLogFilePrefetch(segno + 1, tli);
+
 		return fd;
 	}
 	if (errno != ENOENT || !notfoundOk) /* unexpected failure? */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 87c15b9c6f..f45d32ceef 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -729,6 +729,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_INIT_WRITE:
 			event_name = "WALInitWrite";
 			break;
+		case WAIT_EVENT_WAL_PREFETCH:
+			event_name = "WALPrefetch";
+			break;
 		case WAIT_EVENT_WAL_READ:
 			event_name = "WALRead";
 			break;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index b578e2ec75..1473a4388f 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -226,6 +226,7 @@ typedef enum
 	WAIT_EVENT_WAL_COPY_WRITE,
 	WAIT_EVENT_WAL_INIT_SYNC,
 	WAIT_EVENT_WAL_INIT_WRITE,
+	WAIT_EVENT_WAL_PREFETCH,
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,


RE: Use fadvise in wal replay

2022-06-21 Thread Jakub Wartak
> > Maybe the important question is why would be readahead mechanism be
> disabled in the first place via /sys | blockdev ?
> 
> Because database should know better than OS which data needs to be
> prefetched and which should not. Big OS readahead affects index scan
> performance.

OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ which 
is not cheap either. The code is already hot and there is example from the past 
where syscalls were limiting the performance [1]. Maybe it could be prefetching 
in larger batches (128kB? 1MB? 16MB?)  ?

-J.

[1] - https://commitfest.postgresql.org/28/2606/






Re: Use fadvise in wal replay

2022-06-21 Thread Andrey Borodin



> On 21 Jun 2022, at 13:20, Jakub Wartak  wrote:
> 
> Maybe the important question is why would be readahead mechanism be disabled 
> in the first place via /sys | blockdev ?

Because database should know better than OS which data needs to be prefetched 
and which should not. Big OS readahead affects index scan performance.

Best regards, Andrey Borodin.



RE: Use fadvise in wal replay

2022-06-21 Thread Jakub Wartak
>> > On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
>> >
>> > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
>> > help your case?
>> 
>> AFAICS recovery_prefetch tries to prefetch main fork, but does not try to
>> prefetch WAL itself before reading it. Kirill is trying to solve the problem 
>> of
>> reading WAL segments that are our of OS page cache.

It seems that it is always by default set to 128 (kB) by default, another thing 
is that having (default) 16MB WAL segments might also hinder the readahead 
heuristics compared to having configured the bigger WAL segment size.

Maybe the important question is why would be readahead mechanism be disabled in 
the first place via /sys | blockdev ?

-J.




Re: Use fadvise in wal replay

2022-06-21 Thread Andrey Borodin



> On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
> 
> I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> help your case?

AFAICS recovery_prefetch tries to prefetch main fork, but does not try to 
prefetch WAL itself before reading it. Kirill is trying to solve the problem of 
reading WAL segments that are our of OS page cache.

Best regards, Andrey Borodin.



Re: Use fadvise in wal replay

2022-06-21 Thread Amit Kapila
On Tue, Jun 21, 2022 at 1:07 PM Kirill Reshke  wrote:
>
> Recently we faced a problem with one of our production clusters. We use a 
> cascade replication setup in this cluster, that is: master, standby (r1), and 
> cascade standby (r2). From time to time, the replication lag on r1 used to 
> grow, while on r2 it did not. Analysys showed that r1 startup process was 
> spending a lot of time in reading wal from disk. Increasing 
> /sys/block/md2/queue/read_ahead_kb to 16384 (from 0) helps in this case. 
> Maybe we can add fadvise call in postgresql startup, so it would not be 
> necessary to change settings on the hypervisor?
>

I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
help your case?

[1] - 
https://www.postgresql.org/docs/devel/runtime-config-wal.html#RUNTIME-CONFIG-WAL-RECOVERY

-- 
With Regards,
Amit Kapila.