Re: Purpose of wal_init_zero

2025-01-27 Thread Robert Pang
Our pg_stat_wal view currently doesn't expose the number of WAL segments
recycled, although this information is already logged by the checkpointer
in the database log. For example,

LOG:  checkpoint complete: wrote 317 buffers (1.9%); 0 WAL file(s) added, 0
removed, 3 recycled; write=0.003 s, sync=0.001 s, total=0.004 s; sync
files=8, longest=0.001 s, average=0.001 s; distance=56015 kB,
estimate=56015 kB; lsn=0/6FB8478, redo lsn=0/6FB8420

It might be beneficial to surface the number of WAL files added, removed,
and recycled in the pg_stat_wal view. As the wal_init_zero setting is
typically static, I don't think it's necessary to report the number of WAL
files zero-filled separately. It's important to note that the checkpoint
log's count of added WAL files doesn't include files added via
XLogFileInit(), which is a significant statistic to consider.

What do others think?


On Fri, Jan 24, 2025 at 7:22 AM Hannu Krosing  wrote:

> On Wed, Jan 22, 2025 at 10:18 PM Robert Pang 
> wrote:
> >
> > On Wed, Jan 15, 2025 at 12:05 PM Andres Freund 
> wrote:
> > >
> > > If you have wal_recycle=true, this overhead will only be paid the
> first time a
> > > WAL segment is used, of course, not after recycling.
> >
> > Today, our pg_stat_wal view [1] does not report the no. of WAL
> > segments recycled. How about if we add a column to report it so users
> > can track how many WAL segments are recycled vs. created over a period
> > of time?
>
> Maybe even add the full set of (recycled, created, deleted and
> zero-filled) so we have a full picture.
>
> And possible duplicate the set for "in the background" and "inline /
> stalling the production work" for the operations where it makes sense
> (created, zero-filled)
>
> > [1]
> https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-WAL-VIEW
> >
> >
>


Re: Purpose of wal_init_zero

2025-01-24 Thread Hannu Krosing
On Wed, Jan 22, 2025 at 10:18 PM Robert Pang  wrote:
>
> On Wed, Jan 15, 2025 at 12:05 PM Andres Freund  wrote:
> >
> > If you have wal_recycle=true, this overhead will only be paid the first 
> > time a
> > WAL segment is used, of course, not after recycling.
>
> Today, our pg_stat_wal view [1] does not report the no. of WAL
> segments recycled. How about if we add a column to report it so users
> can track how many WAL segments are recycled vs. created over a period
> of time?

Maybe even add the full set of (recycled, created, deleted and
zero-filled) so we have a full picture.

And possible duplicate the set for "in the background" and "inline /
stalling the production work" for the operations where it makes sense
(created, zero-filled)

> [1] 
> https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-WAL-VIEW
>
>




Re: Purpose of wal_init_zero

2025-01-22 Thread Robert Pang
On Wed, Jan 15, 2025 at 12:05 PM Andres Freund  wrote:
>
> If you have wal_recycle=true, this overhead will only be paid the first time a
> WAL segment is used, of course, not after recycling.

Today, our pg_stat_wal view [1] does not report the no. of WAL
segments recycled. How about if we add a column to report it so users
can track how many WAL segments are recycled vs. created over a period
of time?

[1] 
https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-WAL-VIEW




Re: Purpose of wal_init_zero

2025-01-21 Thread Ritu Bhandari
Hi @Andres Freund 

> I'm not sure I understand the specifics here - did the high WAL generation
> rate result in the recycling taking too long?  Or did checkpointer take
too
> long to write out data, and because of that recycling didn't happen
frequently
> enough?

If the WAL generation rate highly exceeds the max_wal_size within a
checkpoint interval,  there aren't enough recycled WAL files available, the
system will create a large number of new WAL files. This can significantly
increase the initialization time, especially if we've increased the WAL
segment size to 64 MB (which would be 4x of 16 MB segment size).
Conversely, setting a very high max_wal_size to retain more recycled WAL
files can lead to longer recovery times, as the total WAL size might become
very large.

I'll talk to Theodore and confirm on the fallocate part.


> I think what we instead ought to do is to more aggressively initialize WAL
> files ahead of time, so it doesn't happen while holding crucial locks.  We
> know the recent rate of WAL generation, and we could easily track up to
which
> LSN we have recycled WAL segments. Armed with that information walwriter
(or
> something else) should try to ensure that there's always a fair amount of
> pre-allocated WAL.

I agree. Having preallocated WAL files ahead of time will be the ideal
scenario.


> I put some patches together for this a few years ago [0], but ended up
> abandoning them due to lack of interest.  I'm happy to revisit that effort
> if folks do become interested.

Great to know about this, and it aligns with our thinking. We can continue
the discussion on the other thread. I can also help wherever needed.



On Tue, 21 Jan 2025 at 06:39, Andy Fan  wrote:

>
> Hi,
>
> > On Fri, Jan 17, 2025 at 04:29:14PM -0500, Andres Freund wrote:
> >> I think what we instead ought to do is to more aggressively initialize
> WAL
> >> files ahead of time, so it doesn't happen while holding crucial locks.
> We
> >> know the recent rate of WAL generation, and we could easily track up to
> which
> >> LSN we have recycled WAL segments. Armed with that information
> walwriter (or
> >> something else) should try to ensure that there's always a fair amount
> of
> >> pre-allocated WAL.
> >
> > I put some patches together for this a few years ago [0], but ended up
> > abandoning them due to lack of interest.  I'm happy to revisit that
> effort
> > if folks do become interested.
>
> Great to know this, I went through that thread and found the main
> considerations are pretty similar with what I am thinking when working
> out the Poc. I will go to [0] for further dicussion on this topic.
>
> > [0] https://postgr.es/m/20220408203003.GA1630183%40nathanxps13
> --
> Best Regards
> Andy Fan
>
>


Re: Purpose of wal_init_zero

2025-01-20 Thread Andy Fan


Hi, 

> On Fri, Jan 17, 2025 at 04:29:14PM -0500, Andres Freund wrote:
>> I think what we instead ought to do is to more aggressively initialize WAL
>> files ahead of time, so it doesn't happen while holding crucial locks.  We
>> know the recent rate of WAL generation, and we could easily track up to which
>> LSN we have recycled WAL segments. Armed with that information walwriter (or
>> something else) should try to ensure that there's always a fair amount of
>> pre-allocated WAL.
>
> I put some patches together for this a few years ago [0], but ended up
> abandoning them due to lack of interest.  I'm happy to revisit that effort
> if folks do become interested.

Great to know this, I went through that thread and found the main
considerations are pretty similar with what I am thinking when working
out the Poc. I will go to [0] for further dicussion on this topic.

> [0] https://postgr.es/m/20220408203003.GA1630183%40nathanxps13
-- 
Best Regards
Andy Fan





Re: Purpose of wal_init_zero

2025-01-20 Thread Nathan Bossart
On Fri, Jan 17, 2025 at 04:29:14PM -0500, Andres Freund wrote:
> I think what we instead ought to do is to more aggressively initialize WAL
> files ahead of time, so it doesn't happen while holding crucial locks.  We
> know the recent rate of WAL generation, and we could easily track up to which
> LSN we have recycled WAL segments. Armed with that information walwriter (or
> something else) should try to ensure that there's always a fair amount of
> pre-allocated WAL.

I put some patches together for this a few years ago [0], but ended up
abandoning them due to lack of interest.  I'm happy to revisit that effort
if folks do become interested.

[0] https://postgr.es/m/20220408203003.GA1630183%40nathanxps13

-- 
nathan




Re: Purpose of wal_init_zero

2025-01-20 Thread Hannu Krosing
Thinking back I can see now why disabling WAL writes with
wal_level=minimal in COPY resulted in 3X better write performance
instead of expected 2x -

With wal_level=minimal only the heap page writes were needed, whereas
with WAL writes the same page was written 3x - (heap + WAL zero-fill +
WAL).

--
Hannu

On Mon, Jan 20, 2025 at 12:06 PM Hannu Krosing  wrote:
>
> On Fri, Jan 17, 2025 at 10:29 PM Andres Freund  wrote:
> ...
> > > I see, PG once had fallocate [1] (which was reverted by [2] due to some
> > > performance regression concern). The original OSS discussion was in [3].
> > > The perf regression was reported in [4]. Looks like this was due to how
> > > ext4 handled extents and uninitialized data[5] and that seems to be fixed
> > > in [6]. I'll check with Theodore Ts'o to confirm on [6].
> > >
> > > Could we consider adding back fallocate?
> >
> > Fallocate doesn't really help unfortunately. On common filesystems (like
> > ext4/xfs) it just allocates filespace without zeroing out the underlying
> > blocks.
>
> @Theodore Tso - can you confirm that ext4 (and xfs?) does not use the
> low-level WRITE ZEROS commands for initializing the newly allocated
> blocks?
>
> And that the new blocks will be written twice - once for zero-filling
> and then with the actual data .
>
> For WAL we really don't need to zero out anything - we already do WAL
> file recycling without zero-filling the recycled segments, so
> obviously it is all right to have random garbage in the pages.
>
> > To make that correct, those filesystems keep a bitmap indicating which
> > blocks in the range are not yet written. Unfortunately updating those blocks
> > is a metadata operation and thus requires journaling.
> >
> > I've seen some mild speedups by first using fallocate and then zeroing out 
> > the
> > file, particularly with larger segment sizes.
>
> Did you just write a single zero page per file page to avoid
> duplicating the work ?
>
> > I think mainly due to avoiding
> > delayed allocation in the filesystem, rather than actually reducing
> > fragmentation. But it really isn't a whole lot.
> >
> > I've in the past tried to get the linux filesytem developers to add an
> > fallocate mode that doesn't utilize the "unwritten extents" "optimization",
> > but didn't have luck with that.
>
> Are you saying that the first write to a newly allocated empty block
> currently will do two writes to the disk - first writing the zeros and
> then writing the actual data written ?
>
> Or just that the overhead from journalling the change to
> not-yet-written bitmap cancels out the win from not writing the page
> twice ?
>
> > The block layer in linux actually does have
> > support for zeroing out regions of blocks without having to do actually 
> > write
> > the data, but it's only used in some narrow cases (don't remember the
> > details).
>
> For WAL files we should be ok by either using the declarative no-write
> zero fill in the block layer, or just using the pages as-is without
> any zero-filling at all (though this is likely not possible because of
> required Linux filesystem semantics)
>
> > Greetings,
> >
> > Andres Freund
> >
> >




Re: Purpose of wal_init_zero

2025-01-20 Thread Hannu Krosing
On Fri, Jan 17, 2025 at 10:29 PM Andres Freund  wrote:
...
> > I see, PG once had fallocate [1] (which was reverted by [2] due to some
> > performance regression concern). The original OSS discussion was in [3].
> > The perf regression was reported in [4]. Looks like this was due to how
> > ext4 handled extents and uninitialized data[5] and that seems to be fixed
> > in [6]. I'll check with Theodore Ts'o to confirm on [6].
> >
> > Could we consider adding back fallocate?
>
> Fallocate doesn't really help unfortunately. On common filesystems (like
> ext4/xfs) it just allocates filespace without zeroing out the underlying
> blocks.

@Theodore Tso - can you confirm that ext4 (and xfs?) does not use the
low-level WRITE ZEROS commands for initializing the newly allocated
blocks?

And that the new blocks will be written twice - once for zero-filling
and then with the actual data .

For WAL we really don't need to zero out anything - we already do WAL
file recycling without zero-filling the recycled segments, so
obviously it is all right to have random garbage in the pages.

> To make that correct, those filesystems keep a bitmap indicating which
> blocks in the range are not yet written. Unfortunately updating those blocks
> is a metadata operation and thus requires journaling.
>
> I've seen some mild speedups by first using fallocate and then zeroing out the
> file, particularly with larger segment sizes.

Did you just write a single zero page per file page to avoid
duplicating the work ?

> I think mainly due to avoiding
> delayed allocation in the filesystem, rather than actually reducing
> fragmentation. But it really isn't a whole lot.
>
> I've in the past tried to get the linux filesytem developers to add an
> fallocate mode that doesn't utilize the "unwritten extents" "optimization",
> but didn't have luck with that.

Are you saying that the first write to a newly allocated empty block
currently will do two writes to the disk - first writing the zeros and
then writing the actual data written ?

Or just that the overhead from journalling the change to
not-yet-written bitmap cancels out the win from not writing the page
twice ?

> The block layer in linux actually does have
> support for zeroing out regions of blocks without having to do actually write
> the data, but it's only used in some narrow cases (don't remember the
> details).

For WAL files we should be ok by either using the declarative no-write
zero fill in the block layer, or just using the pages as-is without
any zero-filling at all (though this is likely not possible because of
required Linux filesystem semantics)

> Greetings,
>
> Andres Freund
>
>




Re: Purpose of wal_init_zero

2025-01-17 Thread Andres Freund
Hi,

On 2025-01-16 14:50:57 +0530, Ritu Bhandari wrote:
> Adding to Andy Fan's point above:
>
> If we increase WAL segment size from 16MB to 64MB, initializing the 64MB
> WAL segment inline can cause several seconds of freeze on all write
> transactions when it happens. Writing out a newly zero-filled 64MB WAL
> segment takes several seconds for smaller disk sizes.
>
> Disk size (GB) throughput per GiB (MiBps) throughput (MiBps Time to write
> 64MB, seconds
> 10 0.48 5 13.33
> 32 0.48 15 4.17
> 64 0.48 31 2.08
> 128 0.48 61 1.04
> 256 0.48 123 0.52
> 500 0.48 240 0.27
> 834 0.48 400 0.16
> 1,000 0.48 480 0.13
>
>
> Writing full 64MB zeroes every WAL file switch will not just cause general
> performance degradation, but more concerningly also makes the workload more
> "jittery", by stopping all WAL writes, so all write workloads, at every WAL
> switch for the time it takes to zero-fill.

I agree. But I don't think a ~2x reduction in common cases is going to be an
OK price to default to disabling wal init.

I think what we instead ought to do is to more aggressively initialize WAL
files ahead of time, so it doesn't happen while holding crucial locks.  We
know the recent rate of WAL generation, and we could easily track up to which
LSN we have recycled WAL segments. Armed with that information walwriter (or
something else) should try to ensure that there's always a fair amount of
pre-allocated WAL.


If your disk only has a sequential write speed of 4.8MB/s, I don't think any
nontrivial database workload is going to work well. And it obviously makes no
sense whatsoever to increase the WAL segment size on such systems.

I don't think we really can the smallest disks in your list work well -
there's only so much we can do given the low limits and we can probably invest
our time much more fruitfully by focusing on systems with disks speeds that
aren't slower than spinning rust from the 1990's.

That's not to say it's not worth working on preallocating WAL files. But
that's not going to help much if initializing a single WAL segment is going to
eat the entire bandwidth budget for 10+ seconds.


> Also about WAL recycle, during our performance benchmarking, we noticed
> that high volume of updates or inserts will tend to generate WAL faster
> than standard checkpoint processes can keep up resulting in increased WAL
> file creation (instead of rotation) and zero-filling, which significantly
> degrades performance.

I'm not sure I understand the specifics here - did the high WAL generation
rate result in the recycling taking too long?  Or did checkpointer take too
long to write out data, and because of that recycling didn't happen frequently
enough?


> I see, PG once had fallocate [1] (which was reverted by [2] due to some
> performance regression concern). The original OSS discussion was in [3].
> The perf regression was reported in [4]. Looks like this was due to how
> ext4 handled extents and uninitialized data[5] and that seems to be fixed
> in [6]. I'll check with Theodore Ts'o to confirm on [6].
>
> Could we consider adding back fallocate?

Fallocate doesn't rally help unfortunately. On common filesystems (like
ext4/xfs) it just allocates filespace without zeroing out the underlying
blocks. To make that correct, those filesystems keep a bitmap indicating which
blocks in the range are not yet written. Unfortunately updating those blocks
is a metadata operation and thus requires journaling.

I've seen some mild speedups by first using fallocate and then zeroing out the
file, particularly with larger segment sizes. I think mainly due to avoiding
delayed allocation in the filesystem, rather than actually reducing
fragmentation. But it really isn't a whole lot.

I've in the past tried to get the linux filesytem developers to add an
fallocate mode that doesn't utilize the "unwritten extents" "optimization",
but didn't have luck with that.  The block layer in linux actually does have
support for zeroing out regions of blocks without having to do actually write
the data, but it's only used in some narrow cases (don't remember the
details).

Greetings,

Andres Freund




Re: Purpose of wal_init_zero

2025-01-16 Thread Hannu Krosing
On Thu, Jan 16, 2025 at 10:21 AM Ritu Bhandari
 wrote:

> Could we consider adding back fallocate?

Or if not adding it back for all then maybe have a 3-value wal_init_zero :

wal_init_zero = on;
wal_init_zero = off;
wal_init_zero = fallocate;

?




Re: Purpose of wal_init_zero

2025-01-16 Thread Ritu Bhandari
Hi,

Adding to Andy Fan's point above:

If we increase WAL segment size from 16MB to 64MB, initializing the 64MB
WAL segment inline can cause several seconds of freeze on all write
transactions when it happens. Writing out a newly zero-filled 64MB WAL
segment takes several seconds for smaller disk sizes.

Disk size (GB) throughput per GiB (MiBps) throughput (MiBps Time to write
64MB, seconds
10 0.48 5 13.33
32 0.48 15 4.17
64 0.48 31 2.08
128 0.48 61 1.04
256 0.48 123 0.52
500 0.48 240 0.27
834 0.48 400 0.16
1,000 0.48 480 0.13


Writing full 64MB zeroes every WAL file switch will not just cause general
performance degradation, but more concerningly also makes the workload more
"jittery", by stopping all WAL writes, so all write workloads, at every WAL
switch for the time it takes to zero-fill.

Also about WAL recycle, during our performance benchmarking, we noticed
that high volume of updates or inserts will tend to generate WAL faster
than standard checkpoint processes can keep up resulting in increased WAL
file creation (instead of rotation) and zero-filling, which significantly
degrades performance.

I see, PG once had fallocate [1] (which was reverted by [2] due to some
performance regression concern). The original OSS discussion was in [3].
The perf regression was reported in [4]. Looks like this was due to how
ext4 handled extents and uninitialized data[5] and that seems to be fixed
in [6]. I'll check with Theodore Ts'o to confirm on [6].

Could we consider adding back fallocate?

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=269e780
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b571bb
[3]
https://www.postgresql.org/message-id/flat/CAKuK5J0raLwOiKfSh5d8SxtCY2snJAMsfo6RGTBMfcQYB%2B-faQ%40mail.gmail.com
[4]
https://www.postgresql.org/message-id/flat/CAA-aLv7tYHDzMGg4HtDZh0RQZjJc2v2weJ-Obm4yvkw6ePe9Qw%40mail.gmail.com
[5]
https://www.postgresql.org/message-id/CAKuK5J3R-oBh%2B9f23Ko-0-gt5Zi1REgg7ng-awQuUsgiY2B7GQ%40mail.gmail.com
[6]
https://github.com/torvalds/linux/commit/b71fc079b5d8f42b2a52743c8d2f1d35d655b1c5

Thanks,
-Ritu

On Thu, 16 Jan 2025 at 12:01, Andy Fan  wrote:

>
> Hi,
>
> >
> > c=1 && \
> >   psql -c checkpoint -c 'select pg_switch_wal()' && \
> >   pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT
> pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1
> >
> > wal_init_zero = 1: 885 TPS
> > wal_init_zero = 0: 286 TPS.
>
> Your theory looks clear and the result is promsing. I can reproduce the
> similar result in my setup.
>
> on: tps = 1588.538378 (without initial connection time)
> off: tps = 857.755343 (without initial connection time)
>
> > Of course I chose this case to be intentionally extreme - each
> transaction
> > fills a bit more than one page of WAL and immediately flushes it. That
> > guarantees that each commit needs a seperate filesystem metadata flush
> and a
> > flush of the data for the fdatasync() at commit.
>
> However if I increase the clients from 1 to 64(this may break this
> extrme because of group commit) then we can see the wal_init_zero caused
> noticable regression.
>
> c=64 && \
>psql -c checkpoint -c 'select pg_switch_wal()' && \
>pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT
> pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1
>
> off:
> tps = 12135.110730 (without initial connection time)
> tps = 11964.016277 (without initial connection time)
> tps = 12078.458724 (without initial connection time)
>
> on:
> tps = 9392.374563 (without initial connection time)
> tps = 9391.916410 (without initial connection time)
> tps = 9390.503777 (without initial connection time)
>
> Now the wal_init_zero happens on the user backend and other backends also
> need to wait for it, this looks not good to me. I find walwriter doesn't
> do much things, I'd like to have a try if we can offload wal_init_zero
> to the walwriter.
>
> About the wal_recycle, IIUC, it can only recycle a wal file during
> Checkpoint, but checkpoint doesn't happens often.
>
> --
> Best Regards
> Andy Fan
>
>
>
>


Re: Purpose of wal_init_zero

2025-01-15 Thread Andy Fan


Hi,

>
> c=1 && \
>   psql -c checkpoint -c 'select pg_switch_wal()' && \
>   pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT 
> pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1
>
> wal_init_zero = 1: 885 TPS
> wal_init_zero = 0: 286 TPS.

Your theory looks clear and the result is promsing. I can reproduce the
similar result in my setup.

on: tps = 1588.538378 (without initial connection time)
off: tps = 857.755343 (without initial connection time)  

> Of course I chose this case to be intentionally extreme - each transaction
> fills a bit more than one page of WAL and immediately flushes it. That
> guarantees that each commit needs a seperate filesystem metadata flush and a
> flush of the data for the fdatasync() at commit.

However if I increase the clients from 1 to 64(this may break this
extrme because of group commit) then we can see the wal_init_zero caused
noticable regression.  

c=64 && \
   psql -c checkpoint -c 'select pg_switch_wal()' && \
   pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT 
pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1

off:
tps = 12135.110730 (without initial connection time)
tps = 11964.016277 (without initial connection time)
tps = 12078.458724 (without initial connection time)

on:
tps = 9392.374563 (without initial connection time)
tps = 9391.916410 (without initial connection time)
tps = 9390.503777 (without initial connection time)

Now the wal_init_zero happens on the user backend and other backends also
need to wait for it, this looks not good to me. I find walwriter doesn't
do much things, I'd like to have a try if we can offload wal_init_zero
to the walwriter. 

About the wal_recycle, IIUC, it can only recycle a wal file during
Checkpoint, but checkpoint doesn't happens often.

-- 
Best Regards
Andy Fan





Re: Purpose of wal_init_zero

2025-01-15 Thread Andres Freund
Hi,

On 2025-01-15 09:12:17 +, Andy Fan wrote:
> It is unclear to me why do we need wal_init_zero. Per comments:
>
>   /*
>* Zero-fill the file.  With this setting, we do this the hard 
> way to
>* ensure that all the file space has really been allocated.  On
>* platforms that allow "holes" in files, just seeking to the 
> end
>* doesn't allocate intermediate space.  This way, we know that 
> we
>* have all the space and (after the fsync below) that all the
>* indirect blocks are down on disk.  Therefore, fdatasync(2) or
>* O_DSYNC will be sufficient to sync future writes to the log 
> file.
>*/
>
> I can understand that "the file space has really been allocated", but
> why do we care about this?

Performance.

If you create an empty segment by lseek'ing to the end (i.e. typically
resulting in a large "hole" in the file that's not yet backed by storage) or
you allocate it by calling fallocate() (allocates space, but doesn't write
it), durable writes need to do more work.

The reason for the additional work is that you don't just need write the new
WAL contents and then flush the write cache, you will also (on most, but not
all filesystems) incur a filesystem metadata write.

In case of the file-with-hole approach, the filesystem has to first allocate
space to the file, journal the relevant metadata change, probably flush that
change, then write the data, then for the fdatasync() at COMMIT another cache
flush is needed.

If your workload doesn't commit very often, compared to the rate of WAL
generation, that will often be fine. E.g. if you do bulk data load. The added
number of flushes aren't big.

However, if your workload includes a lot of small WAL writes & flushes, the
increased number of flushes can hurt rather badly.

If you have wal_recycle=true, this overhead will only be paid the first time a
WAL segment is used, of course, not after recycling.


Here's an example:

Storage is an older client oriented NVMe SSD (SAMSUNG MZVLB1T0HBLR-000L7).

To make it easier to ensure that the new WAL file case is tested, I turned
wal_recycle off.

To make the pattern of WAL easily repeatable, I'm using
pg_logical_emit_message() to emit a WAL record than then needs to be flushed
to disk, because I pass transactional true.

c=1 && \
  psql -c checkpoint -c 'select pg_switch_wal()' && \
  pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT 
pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1

wal_init_zero = 1: 885 TPS
wal_init_zero = 0: 286 TPS.


Of course I chose this case to be intentionally extreme - each transaction
fills a bit more than one page of WAL and immediately flushes it. That
guarantees that each commit needs a seperate filesystem metadata flush and a
flush of the data for the fdatasync() at commit.

If I instead emit huge WAL record / flush the WAL rarely, e.g. by passing
16*1024*1024 to repeat in the command above, the difference completely
vanishes:

wal_init_zero = 1: 6.25
wal_init_zero = 0: 6.27

If anything the init_zero path now is slower, because it needs to do less
work.

The reason it doesn't hurt to have wal_init_zero disabled in this case is that
the workloads leads to huge WAL writes, which means the additional number of
metadata flushes is very small.


Similarly, if the WAL writes/flushes are very small (say a single '0' in the
test from above), there also won't be a benefit from wal_init_zero=1, because
now most of the time we're just writing to the same WAL page as the previous
transaction, which won't require filesystem metadata changes.



Note that not all filesystems can benefit from wal_init_zero=1, e.g. ZFS or
BTRFS won't benefit, because they always allocate new disk space for each
write. With the associated overheads.


Greetings,

Andres Freund




Re: Purpose of wal_init_zero

2025-01-15 Thread Aleksander Alekseev
Hi Michael,

> My understanding was that if we have pre-allocated wal space (and
> re-cycle already used wal files), we can still write wal records into
> that pre-allocated space and still issue changes to data files as long
> as we don't need to enlarge any. So an out-of-space situation is less
> bad in that case than if we fail to write WAL with ENOSPC.

True, but if we are out of space we are going to crash anyway because
we can't allocate new heap pages, write regular (text) logfiles,
update CLOG and other SLRU-based structures, etc. So if it is "less
bad" the difference seems to be negligible - either way you crash.
Unless I'm missing something of course.

-- 
Best regards,
Aleksander Alekseev




Re: Purpose of wal_init_zero

2025-01-15 Thread Michael Banck
Hi,

On Wed, Jan 15, 2025 at 09:12:17AM +, Andy Fan wrote:
> I can understand that "the file space has really been allocated", but
> why do we care about this?
> 
> One reason I can think of is it has something with "out-of-disk-space"
> sistuation, even though what's the benefit of it since we can't do
> anything in such case anyway no matter the wal space is pre-alocated or
> not? 

My understanding was that if we have pre-allocated wal space (and
re-cycle already used wal files), we can still write wal records into
that pre-allocated space and still issue changes to data files as long
as we don't need to enlarge any. So an out-of-space situation is less
bad in that case than if we fail to write WAL with ENOSPC.


Michael




Re: Purpose of wal_init_zero

2025-01-15 Thread Aleksander Alekseev
Hi,

> Good catch. This comment is not 100% clear to me either.
> [...]

TWIMC zero-filling was added in 33cc5d8a4d0d (year 2001). The commit
doesn't reference the discussion behind this change and the comment
text hasn't changed since then.

-- 
Best regards,
Aleksander Alekseev




Re: Purpose of wal_init_zero

2025-01-15 Thread Aleksander Alekseev
Hi,

>> I can understand that "the file space has really been allocated", but
>> why do we care about this?
>>
>> [...]
>
> Can you report the benchmark difference with false (disabled)?
> Maybe It's worth considering leaving false as the default.

Good catch. This comment is not 100% clear to me either. I think that
the actual idea of zero-filling *might* be to give the filesystem a
hint to allocate sequential blocks for the file, which may be
beneficial if WAL is stored on HDD (which arguably few people do in
2025).

This GUC was added not a long time ago in 2019, see commit
475861b2615d and discussion [1]. I didn't read the entire thread
(yet?) but it looks like the actual purpose of the GUC is to *disable*
zero-filling on the filesystems with copy-on-write such as ZFS where
it's useless. This is also said in the documentation [2]. Zero-filling
was the default behaviour before the commit.

However the documentation doesn't say why we can't just always use
wal_init_zero = off, and perhaps get rid of this GUC altogether.

In either case the referenced comment is misleading.

[1]: 
https://postgr.es/m/CACPQ5Fo00QR7LNAcd1ZjgoBi4y97%2BK760YABs0vQHH5dLdkkMA%40mail.gmail.com
[2]: https://www.postgresql.org/docs/current/runtime-config-wal.html
-- 
Best regards,
Aleksander Alekseev




Re: Purpose of wal_init_zero

2025-01-15 Thread Ranier Vilela
Hi.

Em qua., 15 de jan. de 2025 às 06:12, Andy Fan 
escreveu:

>
> Hi,
>
> It is unclear to me why do we need wal_init_zero. Per comments:
>
> /*
>  * Zero-fill the file.  With this setting, we do this the
> hard way to
>  * ensure that all the file space has really been
> allocated.  On
>  * platforms that allow "holes" in files, just seeking to
> the end
>  * doesn't allocate intermediate space.  This way, we know
> that we
>  * have all the space and (after the fsync below) that all
> the
>  * indirect blocks are down on disk.  Therefore,
> fdatasync(2) or
>  * O_DSYNC will be sufficient to sync future writes to the
> log file.
>  */
>
> I can understand that "the file space has really been allocated", but
> why do we care about this?
>
> One reason I can think of is it has something with "out-of-disk-space"
> sistuation, even though what's the benefit of it since we can't do
> anything in such case anyway no matter the wal space is pre-alocated or
> not?
>
> Another reason I can guess is it provides some performance gain in the
> future XLogWrite to that file. However in the normal case, the
> wal_init_zero is still under "LWLockAcquire(WALWriteLock,
> LW_EXCLUSIVE);" so it cost a lot at first. (more traffic to file system
> due to pg_pwrite_zeros and later fsync). I saw "Therefore, fdatasync(2)
> or O_DSYNC will be sufficient to sync future writes to the log file. ",
> but it is still unclear to me.
>
> I noticed this during a benchmark, where WALWriteLock is waited and the
> holder is running WAIT_EVENT_WAL_INIT_WRITE.
>
Can you report the benchmark difference with false (disabled)?
Maybe It's worth considering leaving false as the default.

best regards,
Ranier Vilela