Re: Purpose of wal_init_zero
Our pg_stat_wal view currently doesn't expose the number of WAL segments recycled, although this information is already logged by the checkpointer in the database log. For example, LOG: checkpoint complete: wrote 317 buffers (1.9%); 0 WAL file(s) added, 0 removed, 3 recycled; write=0.003 s, sync=0.001 s, total=0.004 s; sync files=8, longest=0.001 s, average=0.001 s; distance=56015 kB, estimate=56015 kB; lsn=0/6FB8478, redo lsn=0/6FB8420 It might be beneficial to surface the number of WAL files added, removed, and recycled in the pg_stat_wal view. As the wal_init_zero setting is typically static, I don't think it's necessary to report the number of WAL files zero-filled separately. It's important to note that the checkpoint log's count of added WAL files doesn't include files added via XLogFileInit(), which is a significant statistic to consider. What do others think? On Fri, Jan 24, 2025 at 7:22 AM Hannu Krosing wrote: > On Wed, Jan 22, 2025 at 10:18 PM Robert Pang > wrote: > > > > On Wed, Jan 15, 2025 at 12:05 PM Andres Freund > wrote: > > > > > > If you have wal_recycle=true, this overhead will only be paid the > first time a > > > WAL segment is used, of course, not after recycling. > > > > Today, our pg_stat_wal view [1] does not report the no. of WAL > > segments recycled. How about if we add a column to report it so users > > can track how many WAL segments are recycled vs. created over a period > > of time? > > Maybe even add the full set of (recycled, created, deleted and > zero-filled) so we have a full picture. > > And possible duplicate the set for "in the background" and "inline / > stalling the production work" for the operations where it makes sense > (created, zero-filled) > > > [1] > https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-WAL-VIEW > > > > >
Re: Purpose of wal_init_zero
On Wed, Jan 22, 2025 at 10:18 PM Robert Pang wrote: > > On Wed, Jan 15, 2025 at 12:05 PM Andres Freund wrote: > > > > If you have wal_recycle=true, this overhead will only be paid the first > > time a > > WAL segment is used, of course, not after recycling. > > Today, our pg_stat_wal view [1] does not report the no. of WAL > segments recycled. How about if we add a column to report it so users > can track how many WAL segments are recycled vs. created over a period > of time? Maybe even add the full set of (recycled, created, deleted and zero-filled) so we have a full picture. And possible duplicate the set for "in the background" and "inline / stalling the production work" for the operations where it makes sense (created, zero-filled) > [1] > https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-WAL-VIEW > >
Re: Purpose of wal_init_zero
On Wed, Jan 15, 2025 at 12:05 PM Andres Freund wrote: > > If you have wal_recycle=true, this overhead will only be paid the first time a > WAL segment is used, of course, not after recycling. Today, our pg_stat_wal view [1] does not report the no. of WAL segments recycled. How about if we add a column to report it so users can track how many WAL segments are recycled vs. created over a period of time? [1] https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-WAL-VIEW
Re: Purpose of wal_init_zero
Hi @Andres Freund > I'm not sure I understand the specifics here - did the high WAL generation > rate result in the recycling taking too long? Or did checkpointer take too > long to write out data, and because of that recycling didn't happen frequently > enough? If the WAL generation rate highly exceeds the max_wal_size within a checkpoint interval, there aren't enough recycled WAL files available, the system will create a large number of new WAL files. This can significantly increase the initialization time, especially if we've increased the WAL segment size to 64 MB (which would be 4x of 16 MB segment size). Conversely, setting a very high max_wal_size to retain more recycled WAL files can lead to longer recovery times, as the total WAL size might become very large. I'll talk to Theodore and confirm on the fallocate part. > I think what we instead ought to do is to more aggressively initialize WAL > files ahead of time, so it doesn't happen while holding crucial locks. We > know the recent rate of WAL generation, and we could easily track up to which > LSN we have recycled WAL segments. Armed with that information walwriter (or > something else) should try to ensure that there's always a fair amount of > pre-allocated WAL. I agree. Having preallocated WAL files ahead of time will be the ideal scenario. > I put some patches together for this a few years ago [0], but ended up > abandoning them due to lack of interest. I'm happy to revisit that effort > if folks do become interested. Great to know about this, and it aligns with our thinking. We can continue the discussion on the other thread. I can also help wherever needed. On Tue, 21 Jan 2025 at 06:39, Andy Fan wrote: > > Hi, > > > On Fri, Jan 17, 2025 at 04:29:14PM -0500, Andres Freund wrote: > >> I think what we instead ought to do is to more aggressively initialize > WAL > >> files ahead of time, so it doesn't happen while holding crucial locks. > We > >> know the recent rate of WAL generation, and we could easily track up to > which > >> LSN we have recycled WAL segments. Armed with that information > walwriter (or > >> something else) should try to ensure that there's always a fair amount > of > >> pre-allocated WAL. > > > > I put some patches together for this a few years ago [0], but ended up > > abandoning them due to lack of interest. I'm happy to revisit that > effort > > if folks do become interested. > > Great to know this, I went through that thread and found the main > considerations are pretty similar with what I am thinking when working > out the Poc. I will go to [0] for further dicussion on this topic. > > > [0] https://postgr.es/m/20220408203003.GA1630183%40nathanxps13 > -- > Best Regards > Andy Fan > >
Re: Purpose of wal_init_zero
Hi, > On Fri, Jan 17, 2025 at 04:29:14PM -0500, Andres Freund wrote: >> I think what we instead ought to do is to more aggressively initialize WAL >> files ahead of time, so it doesn't happen while holding crucial locks. We >> know the recent rate of WAL generation, and we could easily track up to which >> LSN we have recycled WAL segments. Armed with that information walwriter (or >> something else) should try to ensure that there's always a fair amount of >> pre-allocated WAL. > > I put some patches together for this a few years ago [0], but ended up > abandoning them due to lack of interest. I'm happy to revisit that effort > if folks do become interested. Great to know this, I went through that thread and found the main considerations are pretty similar with what I am thinking when working out the Poc. I will go to [0] for further dicussion on this topic. > [0] https://postgr.es/m/20220408203003.GA1630183%40nathanxps13 -- Best Regards Andy Fan
Re: Purpose of wal_init_zero
On Fri, Jan 17, 2025 at 04:29:14PM -0500, Andres Freund wrote: > I think what we instead ought to do is to more aggressively initialize WAL > files ahead of time, so it doesn't happen while holding crucial locks. We > know the recent rate of WAL generation, and we could easily track up to which > LSN we have recycled WAL segments. Armed with that information walwriter (or > something else) should try to ensure that there's always a fair amount of > pre-allocated WAL. I put some patches together for this a few years ago [0], but ended up abandoning them due to lack of interest. I'm happy to revisit that effort if folks do become interested. [0] https://postgr.es/m/20220408203003.GA1630183%40nathanxps13 -- nathan
Re: Purpose of wal_init_zero
Thinking back I can see now why disabling WAL writes with wal_level=minimal in COPY resulted in 3X better write performance instead of expected 2x - With wal_level=minimal only the heap page writes were needed, whereas with WAL writes the same page was written 3x - (heap + WAL zero-fill + WAL). -- Hannu On Mon, Jan 20, 2025 at 12:06 PM Hannu Krosing wrote: > > On Fri, Jan 17, 2025 at 10:29 PM Andres Freund wrote: > ... > > > I see, PG once had fallocate [1] (which was reverted by [2] due to some > > > performance regression concern). The original OSS discussion was in [3]. > > > The perf regression was reported in [4]. Looks like this was due to how > > > ext4 handled extents and uninitialized data[5] and that seems to be fixed > > > in [6]. I'll check with Theodore Ts'o to confirm on [6]. > > > > > > Could we consider adding back fallocate? > > > > Fallocate doesn't really help unfortunately. On common filesystems (like > > ext4/xfs) it just allocates filespace without zeroing out the underlying > > blocks. > > @Theodore Tso - can you confirm that ext4 (and xfs?) does not use the > low-level WRITE ZEROS commands for initializing the newly allocated > blocks? > > And that the new blocks will be written twice - once for zero-filling > and then with the actual data . > > For WAL we really don't need to zero out anything - we already do WAL > file recycling without zero-filling the recycled segments, so > obviously it is all right to have random garbage in the pages. > > > To make that correct, those filesystems keep a bitmap indicating which > > blocks in the range are not yet written. Unfortunately updating those blocks > > is a metadata operation and thus requires journaling. > > > > I've seen some mild speedups by first using fallocate and then zeroing out > > the > > file, particularly with larger segment sizes. > > Did you just write a single zero page per file page to avoid > duplicating the work ? > > > I think mainly due to avoiding > > delayed allocation in the filesystem, rather than actually reducing > > fragmentation. But it really isn't a whole lot. > > > > I've in the past tried to get the linux filesytem developers to add an > > fallocate mode that doesn't utilize the "unwritten extents" "optimization", > > but didn't have luck with that. > > Are you saying that the first write to a newly allocated empty block > currently will do two writes to the disk - first writing the zeros and > then writing the actual data written ? > > Or just that the overhead from journalling the change to > not-yet-written bitmap cancels out the win from not writing the page > twice ? > > > The block layer in linux actually does have > > support for zeroing out regions of blocks without having to do actually > > write > > the data, but it's only used in some narrow cases (don't remember the > > details). > > For WAL files we should be ok by either using the declarative no-write > zero fill in the block layer, or just using the pages as-is without > any zero-filling at all (though this is likely not possible because of > required Linux filesystem semantics) > > > Greetings, > > > > Andres Freund > > > >
Re: Purpose of wal_init_zero
On Fri, Jan 17, 2025 at 10:29 PM Andres Freund wrote: ... > > I see, PG once had fallocate [1] (which was reverted by [2] due to some > > performance regression concern). The original OSS discussion was in [3]. > > The perf regression was reported in [4]. Looks like this was due to how > > ext4 handled extents and uninitialized data[5] and that seems to be fixed > > in [6]. I'll check with Theodore Ts'o to confirm on [6]. > > > > Could we consider adding back fallocate? > > Fallocate doesn't really help unfortunately. On common filesystems (like > ext4/xfs) it just allocates filespace without zeroing out the underlying > blocks. @Theodore Tso - can you confirm that ext4 (and xfs?) does not use the low-level WRITE ZEROS commands for initializing the newly allocated blocks? And that the new blocks will be written twice - once for zero-filling and then with the actual data . For WAL we really don't need to zero out anything - we already do WAL file recycling without zero-filling the recycled segments, so obviously it is all right to have random garbage in the pages. > To make that correct, those filesystems keep a bitmap indicating which > blocks in the range are not yet written. Unfortunately updating those blocks > is a metadata operation and thus requires journaling. > > I've seen some mild speedups by first using fallocate and then zeroing out the > file, particularly with larger segment sizes. Did you just write a single zero page per file page to avoid duplicating the work ? > I think mainly due to avoiding > delayed allocation in the filesystem, rather than actually reducing > fragmentation. But it really isn't a whole lot. > > I've in the past tried to get the linux filesytem developers to add an > fallocate mode that doesn't utilize the "unwritten extents" "optimization", > but didn't have luck with that. Are you saying that the first write to a newly allocated empty block currently will do two writes to the disk - first writing the zeros and then writing the actual data written ? Or just that the overhead from journalling the change to not-yet-written bitmap cancels out the win from not writing the page twice ? > The block layer in linux actually does have > support for zeroing out regions of blocks without having to do actually write > the data, but it's only used in some narrow cases (don't remember the > details). For WAL files we should be ok by either using the declarative no-write zero fill in the block layer, or just using the pages as-is without any zero-filling at all (though this is likely not possible because of required Linux filesystem semantics) > Greetings, > > Andres Freund > >
Re: Purpose of wal_init_zero
Hi, On 2025-01-16 14:50:57 +0530, Ritu Bhandari wrote: > Adding to Andy Fan's point above: > > If we increase WAL segment size from 16MB to 64MB, initializing the 64MB > WAL segment inline can cause several seconds of freeze on all write > transactions when it happens. Writing out a newly zero-filled 64MB WAL > segment takes several seconds for smaller disk sizes. > > Disk size (GB) throughput per GiB (MiBps) throughput (MiBps Time to write > 64MB, seconds > 10 0.48 5 13.33 > 32 0.48 15 4.17 > 64 0.48 31 2.08 > 128 0.48 61 1.04 > 256 0.48 123 0.52 > 500 0.48 240 0.27 > 834 0.48 400 0.16 > 1,000 0.48 480 0.13 > > > Writing full 64MB zeroes every WAL file switch will not just cause general > performance degradation, but more concerningly also makes the workload more > "jittery", by stopping all WAL writes, so all write workloads, at every WAL > switch for the time it takes to zero-fill. I agree. But I don't think a ~2x reduction in common cases is going to be an OK price to default to disabling wal init. I think what we instead ought to do is to more aggressively initialize WAL files ahead of time, so it doesn't happen while holding crucial locks. We know the recent rate of WAL generation, and we could easily track up to which LSN we have recycled WAL segments. Armed with that information walwriter (or something else) should try to ensure that there's always a fair amount of pre-allocated WAL. If your disk only has a sequential write speed of 4.8MB/s, I don't think any nontrivial database workload is going to work well. And it obviously makes no sense whatsoever to increase the WAL segment size on such systems. I don't think we really can the smallest disks in your list work well - there's only so much we can do given the low limits and we can probably invest our time much more fruitfully by focusing on systems with disks speeds that aren't slower than spinning rust from the 1990's. That's not to say it's not worth working on preallocating WAL files. But that's not going to help much if initializing a single WAL segment is going to eat the entire bandwidth budget for 10+ seconds. > Also about WAL recycle, during our performance benchmarking, we noticed > that high volume of updates or inserts will tend to generate WAL faster > than standard checkpoint processes can keep up resulting in increased WAL > file creation (instead of rotation) and zero-filling, which significantly > degrades performance. I'm not sure I understand the specifics here - did the high WAL generation rate result in the recycling taking too long? Or did checkpointer take too long to write out data, and because of that recycling didn't happen frequently enough? > I see, PG once had fallocate [1] (which was reverted by [2] due to some > performance regression concern). The original OSS discussion was in [3]. > The perf regression was reported in [4]. Looks like this was due to how > ext4 handled extents and uninitialized data[5] and that seems to be fixed > in [6]. I'll check with Theodore Ts'o to confirm on [6]. > > Could we consider adding back fallocate? Fallocate doesn't rally help unfortunately. On common filesystems (like ext4/xfs) it just allocates filespace without zeroing out the underlying blocks. To make that correct, those filesystems keep a bitmap indicating which blocks in the range are not yet written. Unfortunately updating those blocks is a metadata operation and thus requires journaling. I've seen some mild speedups by first using fallocate and then zeroing out the file, particularly with larger segment sizes. I think mainly due to avoiding delayed allocation in the filesystem, rather than actually reducing fragmentation. But it really isn't a whole lot. I've in the past tried to get the linux filesytem developers to add an fallocate mode that doesn't utilize the "unwritten extents" "optimization", but didn't have luck with that. The block layer in linux actually does have support for zeroing out regions of blocks without having to do actually write the data, but it's only used in some narrow cases (don't remember the details). Greetings, Andres Freund
Re: Purpose of wal_init_zero
On Thu, Jan 16, 2025 at 10:21 AM Ritu Bhandari wrote: > Could we consider adding back fallocate? Or if not adding it back for all then maybe have a 3-value wal_init_zero : wal_init_zero = on; wal_init_zero = off; wal_init_zero = fallocate; ?
Re: Purpose of wal_init_zero
Hi, Adding to Andy Fan's point above: If we increase WAL segment size from 16MB to 64MB, initializing the 64MB WAL segment inline can cause several seconds of freeze on all write transactions when it happens. Writing out a newly zero-filled 64MB WAL segment takes several seconds for smaller disk sizes. Disk size (GB) throughput per GiB (MiBps) throughput (MiBps Time to write 64MB, seconds 10 0.48 5 13.33 32 0.48 15 4.17 64 0.48 31 2.08 128 0.48 61 1.04 256 0.48 123 0.52 500 0.48 240 0.27 834 0.48 400 0.16 1,000 0.48 480 0.13 Writing full 64MB zeroes every WAL file switch will not just cause general performance degradation, but more concerningly also makes the workload more "jittery", by stopping all WAL writes, so all write workloads, at every WAL switch for the time it takes to zero-fill. Also about WAL recycle, during our performance benchmarking, we noticed that high volume of updates or inserts will tend to generate WAL faster than standard checkpoint processes can keep up resulting in increased WAL file creation (instead of rotation) and zero-filling, which significantly degrades performance. I see, PG once had fallocate [1] (which was reverted by [2] due to some performance regression concern). The original OSS discussion was in [3]. The perf regression was reported in [4]. Looks like this was due to how ext4 handled extents and uninitialized data[5] and that seems to be fixed in [6]. I'll check with Theodore Ts'o to confirm on [6]. Could we consider adding back fallocate? [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=269e780 [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b571bb [3] https://www.postgresql.org/message-id/flat/CAKuK5J0raLwOiKfSh5d8SxtCY2snJAMsfo6RGTBMfcQYB%2B-faQ%40mail.gmail.com [4] https://www.postgresql.org/message-id/flat/CAA-aLv7tYHDzMGg4HtDZh0RQZjJc2v2weJ-Obm4yvkw6ePe9Qw%40mail.gmail.com [5] https://www.postgresql.org/message-id/CAKuK5J3R-oBh%2B9f23Ko-0-gt5Zi1REgg7ng-awQuUsgiY2B7GQ%40mail.gmail.com [6] https://github.com/torvalds/linux/commit/b71fc079b5d8f42b2a52743c8d2f1d35d655b1c5 Thanks, -Ritu On Thu, 16 Jan 2025 at 12:01, Andy Fan wrote: > > Hi, > > > > > c=1 && \ > > psql -c checkpoint -c 'select pg_switch_wal()' && \ > > pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT > pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1 > > > > wal_init_zero = 1: 885 TPS > > wal_init_zero = 0: 286 TPS. > > Your theory looks clear and the result is promsing. I can reproduce the > similar result in my setup. > > on: tps = 1588.538378 (without initial connection time) > off: tps = 857.755343 (without initial connection time) > > > Of course I chose this case to be intentionally extreme - each > transaction > > fills a bit more than one page of WAL and immediately flushes it. That > > guarantees that each commit needs a seperate filesystem metadata flush > and a > > flush of the data for the fdatasync() at commit. > > However if I increase the clients from 1 to 64(this may break this > extrme because of group commit) then we can see the wal_init_zero caused > noticable regression. > > c=64 && \ >psql -c checkpoint -c 'select pg_switch_wal()' && \ >pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT > pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1 > > off: > tps = 12135.110730 (without initial connection time) > tps = 11964.016277 (without initial connection time) > tps = 12078.458724 (without initial connection time) > > on: > tps = 9392.374563 (without initial connection time) > tps = 9391.916410 (without initial connection time) > tps = 9390.503777 (without initial connection time) > > Now the wal_init_zero happens on the user backend and other backends also > need to wait for it, this looks not good to me. I find walwriter doesn't > do much things, I'd like to have a try if we can offload wal_init_zero > to the walwriter. > > About the wal_recycle, IIUC, it can only recycle a wal file during > Checkpoint, but checkpoint doesn't happens often. > > -- > Best Regards > Andy Fan > > > >
Re: Purpose of wal_init_zero
Hi, > > c=1 && \ > psql -c checkpoint -c 'select pg_switch_wal()' && \ > pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT > pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1 > > wal_init_zero = 1: 885 TPS > wal_init_zero = 0: 286 TPS. Your theory looks clear and the result is promsing. I can reproduce the similar result in my setup. on: tps = 1588.538378 (without initial connection time) off: tps = 857.755343 (without initial connection time) > Of course I chose this case to be intentionally extreme - each transaction > fills a bit more than one page of WAL and immediately flushes it. That > guarantees that each commit needs a seperate filesystem metadata flush and a > flush of the data for the fdatasync() at commit. However if I increase the clients from 1 to 64(this may break this extrme because of group commit) then we can see the wal_init_zero caused noticable regression. c=64 && \ psql -c checkpoint -c 'select pg_switch_wal()' && \ pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1 off: tps = 12135.110730 (without initial connection time) tps = 11964.016277 (without initial connection time) tps = 12078.458724 (without initial connection time) on: tps = 9392.374563 (without initial connection time) tps = 9391.916410 (without initial connection time) tps = 9390.503777 (without initial connection time) Now the wal_init_zero happens on the user backend and other backends also need to wait for it, this looks not good to me. I find walwriter doesn't do much things, I'd like to have a try if we can offload wal_init_zero to the walwriter. About the wal_recycle, IIUC, it can only recycle a wal file during Checkpoint, but checkpoint doesn't happens often. -- Best Regards Andy Fan
Re: Purpose of wal_init_zero
Hi, On 2025-01-15 09:12:17 +, Andy Fan wrote: > It is unclear to me why do we need wal_init_zero. Per comments: > > /* >* Zero-fill the file. With this setting, we do this the hard > way to >* ensure that all the file space has really been allocated. On >* platforms that allow "holes" in files, just seeking to the > end >* doesn't allocate intermediate space. This way, we know that > we >* have all the space and (after the fsync below) that all the >* indirect blocks are down on disk. Therefore, fdatasync(2) or >* O_DSYNC will be sufficient to sync future writes to the log > file. >*/ > > I can understand that "the file space has really been allocated", but > why do we care about this? Performance. If you create an empty segment by lseek'ing to the end (i.e. typically resulting in a large "hole" in the file that's not yet backed by storage) or you allocate it by calling fallocate() (allocates space, but doesn't write it), durable writes need to do more work. The reason for the additional work is that you don't just need write the new WAL contents and then flush the write cache, you will also (on most, but not all filesystems) incur a filesystem metadata write. In case of the file-with-hole approach, the filesystem has to first allocate space to the file, journal the relevant metadata change, probably flush that change, then write the data, then for the fdatasync() at COMMIT another cache flush is needed. If your workload doesn't commit very often, compared to the rate of WAL generation, that will often be fine. E.g. if you do bulk data load. The added number of flushes aren't big. However, if your workload includes a lot of small WAL writes & flushes, the increased number of flushes can hurt rather badly. If you have wal_recycle=true, this overhead will only be paid the first time a WAL segment is used, of course, not after recycling. Here's an example: Storage is an older client oriented NVMe SSD (SAMSUNG MZVLB1T0HBLR-000L7). To make it easier to ensure that the new WAL file case is tested, I turned wal_recycle off. To make the pattern of WAL easily repeatable, I'm using pg_logical_emit_message() to emit a WAL record than then needs to be flushed to disk, because I pass transactional true. c=1 && \ psql -c checkpoint -c 'select pg_switch_wal()' && \ pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1 wal_init_zero = 1: 885 TPS wal_init_zero = 0: 286 TPS. Of course I chose this case to be intentionally extreme - each transaction fills a bit more than one page of WAL and immediately flushes it. That guarantees that each commit needs a seperate filesystem metadata flush and a flush of the data for the fdatasync() at commit. If I instead emit huge WAL record / flush the WAL rarely, e.g. by passing 16*1024*1024 to repeat in the command above, the difference completely vanishes: wal_init_zero = 1: 6.25 wal_init_zero = 0: 6.27 If anything the init_zero path now is slower, because it needs to do less work. The reason it doesn't hurt to have wal_init_zero disabled in this case is that the workloads leads to huge WAL writes, which means the additional number of metadata flushes is very small. Similarly, if the WAL writes/flushes are very small (say a single '0' in the test from above), there also won't be a benefit from wal_init_zero=1, because now most of the time we're just writing to the same WAL page as the previous transaction, which won't require filesystem metadata changes. Note that not all filesystems can benefit from wal_init_zero=1, e.g. ZFS or BTRFS won't benefit, because they always allocate new disk space for each write. With the associated overheads. Greetings, Andres Freund
Re: Purpose of wal_init_zero
Hi Michael, > My understanding was that if we have pre-allocated wal space (and > re-cycle already used wal files), we can still write wal records into > that pre-allocated space and still issue changes to data files as long > as we don't need to enlarge any. So an out-of-space situation is less > bad in that case than if we fail to write WAL with ENOSPC. True, but if we are out of space we are going to crash anyway because we can't allocate new heap pages, write regular (text) logfiles, update CLOG and other SLRU-based structures, etc. So if it is "less bad" the difference seems to be negligible - either way you crash. Unless I'm missing something of course. -- Best regards, Aleksander Alekseev
Re: Purpose of wal_init_zero
Hi, On Wed, Jan 15, 2025 at 09:12:17AM +, Andy Fan wrote: > I can understand that "the file space has really been allocated", but > why do we care about this? > > One reason I can think of is it has something with "out-of-disk-space" > sistuation, even though what's the benefit of it since we can't do > anything in such case anyway no matter the wal space is pre-alocated or > not? My understanding was that if we have pre-allocated wal space (and re-cycle already used wal files), we can still write wal records into that pre-allocated space and still issue changes to data files as long as we don't need to enlarge any. So an out-of-space situation is less bad in that case than if we fail to write WAL with ENOSPC. Michael
Re: Purpose of wal_init_zero
Hi, > Good catch. This comment is not 100% clear to me either. > [...] TWIMC zero-filling was added in 33cc5d8a4d0d (year 2001). The commit doesn't reference the discussion behind this change and the comment text hasn't changed since then. -- Best regards, Aleksander Alekseev
Re: Purpose of wal_init_zero
Hi, >> I can understand that "the file space has really been allocated", but >> why do we care about this? >> >> [...] > > Can you report the benchmark difference with false (disabled)? > Maybe It's worth considering leaving false as the default. Good catch. This comment is not 100% clear to me either. I think that the actual idea of zero-filling *might* be to give the filesystem a hint to allocate sequential blocks for the file, which may be beneficial if WAL is stored on HDD (which arguably few people do in 2025). This GUC was added not a long time ago in 2019, see commit 475861b2615d and discussion [1]. I didn't read the entire thread (yet?) but it looks like the actual purpose of the GUC is to *disable* zero-filling on the filesystems with copy-on-write such as ZFS where it's useless. This is also said in the documentation [2]. Zero-filling was the default behaviour before the commit. However the documentation doesn't say why we can't just always use wal_init_zero = off, and perhaps get rid of this GUC altogether. In either case the referenced comment is misleading. [1]: https://postgr.es/m/CACPQ5Fo00QR7LNAcd1ZjgoBi4y97%2BK760YABs0vQHH5dLdkkMA%40mail.gmail.com [2]: https://www.postgresql.org/docs/current/runtime-config-wal.html -- Best regards, Aleksander Alekseev
Re: Purpose of wal_init_zero
Hi. Em qua., 15 de jan. de 2025 às 06:12, Andy Fan escreveu: > > Hi, > > It is unclear to me why do we need wal_init_zero. Per comments: > > /* > * Zero-fill the file. With this setting, we do this the > hard way to > * ensure that all the file space has really been > allocated. On > * platforms that allow "holes" in files, just seeking to > the end > * doesn't allocate intermediate space. This way, we know > that we > * have all the space and (after the fsync below) that all > the > * indirect blocks are down on disk. Therefore, > fdatasync(2) or > * O_DSYNC will be sufficient to sync future writes to the > log file. > */ > > I can understand that "the file space has really been allocated", but > why do we care about this? > > One reason I can think of is it has something with "out-of-disk-space" > sistuation, even though what's the benefit of it since we can't do > anything in such case anyway no matter the wal space is pre-alocated or > not? > > Another reason I can guess is it provides some performance gain in the > future XLogWrite to that file. However in the normal case, the > wal_init_zero is still under "LWLockAcquire(WALWriteLock, > LW_EXCLUSIVE);" so it cost a lot at first. (more traffic to file system > due to pg_pwrite_zeros and later fsync). I saw "Therefore, fdatasync(2) > or O_DSYNC will be sufficient to sync future writes to the log file. ", > but it is still unclear to me. > > I noticed this during a benchmark, where WALWriteLock is waited and the > holder is running WAIT_EVENT_WAL_INIT_WRITE. > Can you report the benchmark difference with false (disabled)? Maybe It's worth considering leaving false as the default. best regards, Ranier Vilela