Re: pg_combinebackup --copy-file-range
On 4/7/24 19:46, Tomas Vondra wrote: > On 4/5/24 21:43, Tomas Vondra wrote: >> Hi, >> >> ... >> >> 2) The prefetching is not a huge improvement, at least not for these >> three filesystems (btrfs, ext4, xfs). From the color scale it might seem >> like it helps, but those values are relative to the baseline, so when >> the non-prefetching value is 5% and with prefetching 10%, that means the >> prefetching makes it slower. And that's very often true. >> >> This is visible more clearly in prefetching.pdf, comparing the >> non-prefetching and prefetching results for each patch, not to baseline. >> That's makes it quite clear there's a lot of "red" where prefetching >> makes it slower. It certainly does help for larger increments (which >> makes sense, because the modified blocks are distributed randomly, and >> thus come from random files, making long streaks unlikely). >> >> I've imagined the prefetching could be made a bit smarter to ignore the >> streaks (=sequential patterns), but once again - this only matters with >> the batching, which we don't have. And without the batching it looks >> like a net loss (that's the first column in the prefetching PDF). >> >> I did start thinking about prefetching because of ZFS, where it was >> necessary to get decent performance. And that's still true. But (a) even >> with the explicit prefetching it's still 2-3x slower than any of these >> filesystems, so I assume performance-sensitive use cases won't use it. >> And (b) the prefetching seems necessary in all cases, no matter how >> large the increment is. Which goes directly against the idea of looking >> at how random the blocks are and prefetching only the sufficiently >> random patterns. That doesn't seem like a great thing. >> > > I finally got a more complete ZFS results, and I also decided to get > some numbers without the ZFS tuning I did. And boy oh boy ... > > All the tests I did with ZFS were tuned the way I've seen recommended > when using ZFS for PostgreSQL, that is > > zfs set recordsize=8K logbias=throughput compression=none > > and this performed quite poorly - pg_combinebackup took 4-8x longer than > with the traditional filesystems (btrfs, xfs, ext4) and the only thing > that improved that measurably was prefetching. > > But once I reverted back to the default recordsize of 128kB the > performance is waay better - entirely comparable to ext4/xfs, while > btrfs remains faster with --copy-file-range --no-manigest (by a factor > of 2-3x). > I forgot to explicitly say that I think confirms the decision to not push the patch adding the explicit prefetching to pg_combinebackup. It's not needed/beneficial even for ZFS, when using a suitable configuration. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_combinebackup --copy-file-range
On 4/4/24 12:25, Jakub Wartak wrote: > On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra > wrote: >> >> Hi, >> >> Here's a much more polished and cleaned up version of the patches, >> fixing all the issues I've been aware of, and with various parts merged >> into much more cohesive parts (instead of keeping them separate to make >> the changes/evolution more obvious). > > OK, so three runs of incrementalbackupstests - as stated earlier - > also passed with OK for v20240403 (his time even with > --enable-casserts) > > pg_combinebackup flags tested were: > 1) --copy-file-range --manifest-checksums=CRC32C > 2) --copy-file-range --manifest-checksums=NONE > 3) default, no flags (no copy-file-range) > Thanks! >> I changed how I think about this a bit - I don't really see the CoW copy >> methods as necessary faster than the regular copy (even though it can be >> with (5)). I think the main benefits are the space savings, enabled by >> patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without >> that I don't think the performance is an issue - everything has a cost. > > I take i differently: incremental backups without CoW fs would be clearly : > - inefficient in terms of RTO (SANs are often a key thing due to > having fast ability to "restore" the clone rather than copying the > data from somewhere else) > - pg_basebackup without that would be unusuable without space savings > (e.g. imagine daily backups @ 10+TB DWHs) > Right, although this very much depends on the backup scheme. If you only take incremental backups, and then also a full backup once in a while, the CoW stuff probably does not help much. The alignment (the only thing affecting basebackups) may allow deduplication, but that's all I think. If the scheme is more complex, and involves "merging" the increments into the full backup, then this does help a lot. It'd even be possible to cheaply clone instances this way, I think. But I'm not sure how often would people do that on the same volume, to benefit from the CoW. >> On 4/3/24 15:39, Jakub Wartak wrote: >>> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra >>> wrote: > [..] >> Thanks. Would be great if you could run this on the attached version of >> the patches, ideally for each of them independently, so make sure it >> doesn't get broken+fixed somewhere on the way. > > Those are semi-manual test runs (~30 min? per run), the above results > are for all of them applied at once. So my take is all of them work > each one does individually too. > Cool, thanks. > FWIW, I'm also testing your other offlist incremental backup > corruption issue, but that doesnt seem to be related in any way to > copy_file_range() patches here. > Yes, that's entirely independent, happens with master too. 2) prefetch --- >>> [..] I think this means we may need a "--prefetch" option, that'd force prefetching, probably both before pread and copy_file_range. Otherwise people on ZFS are doomed and will have poor performance. >>> >>> Right, we could optionally cover in the docs later-on various options >>> to get the performance (on XFS use $this, but without $that and so >>> on). It's kind of madness dealing with all those performance >>> variations. >>> >> >> Yeah, something like that. I'm not sure we want to talk too much about >> individual filesystems in our docs, because those things evolve over >> time too. > > Sounds like Wiki then. > > BTW, after a quick review: could we in 05 have something like common > value then (to keep those together via some .h?) > > #defineBATCH_SIZE PREFETCH_TARGET ? > Yes, that's one of the things I'd like to refine a bit more. Making it more consistent / clearer that these things are interdependent. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_combinebackup --copy-file-range
On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra wrote: > > Hi, > > Here's a much more polished and cleaned up version of the patches, > fixing all the issues I've been aware of, and with various parts merged > into much more cohesive parts (instead of keeping them separate to make > the changes/evolution more obvious). OK, so three runs of incrementalbackupstests - as stated earlier - also passed with OK for v20240403 (his time even with --enable-casserts) pg_combinebackup flags tested were: 1) --copy-file-range --manifest-checksums=CRC32C 2) --copy-file-range --manifest-checksums=NONE 3) default, no flags (no copy-file-range) > I changed how I think about this a bit - I don't really see the CoW copy > methods as necessary faster than the regular copy (even though it can be > with (5)). I think the main benefits are the space savings, enabled by > patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without > that I don't think the performance is an issue - everything has a cost. I take i differently: incremental backups without CoW fs would be clearly : - inefficient in terms of RTO (SANs are often a key thing due to having fast ability to "restore" the clone rather than copying the data from somewhere else) - pg_basebackup without that would be unusuable without space savings (e.g. imagine daily backups @ 10+TB DWHs) > On 4/3/24 15:39, Jakub Wartak wrote: > > On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra > > wrote: [..] > Thanks. Would be great if you could run this on the attached version of > the patches, ideally for each of them independently, so make sure it > doesn't get broken+fixed somewhere on the way. Those are semi-manual test runs (~30 min? per run), the above results are for all of them applied at once. So my take is all of them work each one does individually too. FWIW, I'm also testing your other offlist incremental backup corruption issue, but that doesnt seem to be related in any way to copy_file_range() patches here. > >> 2) prefetch > >> --- > > [..] > >> I think this means we may need a "--prefetch" option, that'd force > >> prefetching, probably both before pread and copy_file_range. Otherwise > >> people on ZFS are doomed and will have poor performance. > > > > Right, we could optionally cover in the docs later-on various options > > to get the performance (on XFS use $this, but without $that and so > > on). It's kind of madness dealing with all those performance > > variations. > > > > Yeah, something like that. I'm not sure we want to talk too much about > individual filesystems in our docs, because those things evolve over > time too. Sounds like Wiki then. BTW, after a quick review: could we in 05 have something like common value then (to keep those together via some .h?) #defineBATCH_SIZE PREFETCH_TARGET ? -J.
Re: pg_combinebackup --copy-file-range
Hi, Here's a much more polished and cleaned up version of the patches, fixing all the issues I've been aware of, and with various parts merged into much more cohesive parts (instead of keeping them separate to make the changes/evolution more obvious). I decided to reorder the changes like this: 1) block alignment - As I said earlier, I think we should definitely do this, even if only to make future improvements possible. After chatting about this with Robert off-list a bit, he pointed out I actually forgot to not align the headers for files without any blocks, so this version fixes that. 2) add clone/copy_file_range for the case that copies whole files. This is pretty simple, but it also adds the --clone/copy-file-range options, and similar infrastructure. The one slightly annoying bit is that we now have the ifdef stuff in two places - when parsing the options, and then in the copy_file_XXX methods, and the pg_fatal() calls should not be reachable in practice. But that doesn't seem harmful, and might be a useful protection against someone calling function that does nothing. This also merges the original two parts, where the first one only did this cloning/CoW stuff when checksum did not need to be calculated, and the second extended it to those cases too (by also reading the data, but still doing the copy the old way). I think this is the right way - if that's not desisable, it's easy to either add --no-manifest or not use the CoW options. Or just not change the checksum type. There's no other way. 3) add copy_file_range to write_reconstructed_file, by using roughly the same logic/reasoning as (2). If --copy-file-range is specified and a checksum should be calculated, the data is read for the checksum, but the copy is done using copy_file_range. I did rework the flow write_reconstructed_file() flow a bit, because tracking what exactly needs to be read/written in each of the cases (which copy method, zeroed block, checksum calculated) made the flow really difficult to follow. Instead I introduced a function to read/write a block, and call them from different places. I think this is much better, and it also makes the following part dealing with batches of blocks much easier / smaller change. 4) prefetching - This is mostly unrelated to the CoW stuff, but it has tremendous benefits, especially for ZFS. I haven't been able to tune ZFS to get decent performance, and ISTM it'd be rather unusable for backup purposes without this. 5) batching in write_reconstructed_file - This benefits especially the copy_file_range case, where I've seen it to yield massive speedups (see the message from Monday for better data). 6) batching for prefetch - Similar logic to (5), but for fadvise. This is the only part where I'm not entirely sure whether it actually helps or not. Needs some more analysis, but I'm including it for completeness. I do think the parts (1)-(4) are in pretty good shape, good enough to make them committable in a day or two. I see it mostly a matter of testing and comment improvements rather than code changes. (5) is in pretty good shape too, but I'd like to maybe simplify and refine the write_reconstructed_file changes a bit more. I don't think it's broken, but it feels a bit too cumbersome. Not sure about (6) yet. I changed how I think about this a bit - I don't really see the CoW copy methods as necessary faster than the regular copy (even though it can be with (5)). I think the main benefits are the space savings, enabled by patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without that I don't think the performance is an issue - everything has a cost. On 4/3/24 15:39, Jakub Wartak wrote: > On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra > wrote: >> >> Hi, >> >> I've been running some benchmarks and experimenting with various stuff, >> trying to improve the poor performance on ZFS, and the regression on XFS >> when using copy_file_range. And oh boy, did I find interesting stuff ... > > [..] > > Congratulations on great results! > >> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster >> to finish recovery, run pg_checksums --check (to check the patches don't >> produce something broken) > > I've performed some follow-up small testing on all patches mentioned > here (1..7), with the earlier developed nano-incremental-backup-tests > that helped detect some issues for Robert earlier during original > development. They all went fine in both cases: > - no special options when using pg_combinebackup > - using pg_combinebackup --copy-file-range --manifest-checksums=NONE > > Those were: > test_across_wallevelminimal.sh > test_full_pri__incr_stby__restore_on_pri.sh > test_full_pri__incr_stby__restore_on_stby.sh > test_full_stby__incr_stby__restore_on_pri.sh > test_full_stby__incr_stby__restore_on_stby.sh > test_
Re: pg_combinebackup --copy-file-range
On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra wrote: > > Hi, > > I've been running some benchmarks and experimenting with various stuff, > trying to improve the poor performance on ZFS, and the regression on XFS > when using copy_file_range. And oh boy, did I find interesting stuff ... [..] Congratulations on great results! > 4) after each pg_combinebackup run to pg_verifybackup, start the cluster > to finish recovery, run pg_checksums --check (to check the patches don't > produce something broken) I've performed some follow-up small testing on all patches mentioned here (1..7), with the earlier developed nano-incremental-backup-tests that helped detect some issues for Robert earlier during original development. They all went fine in both cases: - no special options when using pg_combinebackup - using pg_combinebackup --copy-file-range --manifest-checksums=NONE Those were: test_across_wallevelminimal.sh test_full_pri__incr_stby__restore_on_pri.sh test_full_pri__incr_stby__restore_on_stby.sh test_full_stby__incr_stby__restore_on_pri.sh test_full_stby__incr_stby__restore_on_stby.sh test_incr_after_timelineincrease.sh test_incr_on_standby_after_promote.sh test_many_incrementals_dbcreate_duplicateOID.sh test_many_incrementals_dbcreate_filecopy_NOINCR.sh test_many_incrementals_dbcreate_filecopy.sh test_many_incrementals_dbcreate.sh test_many_incrementals.sh test_multixact.sh test_pending_2pc.sh test_reindex_and_vacuum_full.sh test_repro_assert_RP.sh test_repro_assert.sh test_standby_incr_just_backup.sh test_stuck_walsum.sh test_truncaterollback.sh test_unlogged_table.sh > Now to the findings > > > 1) block alignment [..] > And I think we probably want to do this now, because this affects all > tools dealing with incremental backups - even if someone writes a custom > version of pg_combinebackup, it will have to deal with misaligned data. > Perhaps there might be something like pg_basebackup that "transforms" > the data received from the server (and also the backup manifest), but > that does not seem like a great direction. If anything is on the table, then I think in the far future pg_refresh_standby_using_incremental_backup_from_primary would be the only other tool using the format ? > 2) prefetch > --- [..] > I think this means we may need a "--prefetch" option, that'd force > prefetching, probably both before pread and copy_file_range. Otherwise > people on ZFS are doomed and will have poor performance. Right, we could optionally cover in the docs later-on various options to get the performance (on XFS use $this, but without $that and so on). It's kind of madness dealing with all those performance variations. Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a getopt variant like: --prefetch[=HOWMANY] with 128 being as default ? -J.
Re: pg_combinebackup --copy-file-range
On 4/1/24 23:45, Thomas Munro wrote: > ... >> >> I was very puzzled by the awful performance on ZFS. When every other fs >> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took >> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning >> advice I could think of, with almost no effect. >> >> Ultimately I decided that it probably is the "no readahead" behavior >> I've observed on ZFS. I assume it's because it doesn't use the page >> cache where the regular readahead is detected etc. And there's no >> prefetching in pg_combinebackup, so I decided to an experiment and added >> a trivial explicit prefetch when reconstructing the file - every time >> we'd read data from a file, we do posix_fadvise for up to 128 blocks >> ahead (similar to what bitmap heap scan code does). See 0002. >> >> And tadaaa - the duration dropped from 900-1000 seconds to only about >> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is >> pretty massive. > > Interesting. ZFS certainly has its own prefetching heuristics with > lots of logic and settings, but it could be that it's using > strict-next-block detection of access pattern (ie what I called > effective_io_readahead_window=0 in the streaming I/O thread) instead > of a window (ie like the Linux block device level read ahead where, > AFAIK, if you access anything in that sliding window it is triggered), > and perhaps your test has a lot of non-contiguous but close-enough > blocks? (Also reminds me of the similar discussion on the BHS thread > about distinguishing sequential access from > mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the > fine line between them.) > I don't think the files have a lot of non-contiguous but close-enough blocks (it's rather that we'd skip blocks that need to come from a later incremental file). The backups are generated to have a certain fraction of modified blocks. For example the smallest backup has 1% means 99% of blocks comes from the base backup, and 1% comes from the increment. And indeed, the whole database is ~75GB and the backup is ~740MB. Which means that on average there will be runs of 99 blocks in the base backup, then skip 1 block (to come from the increment), and then again 99-1-99-1. So it's very sequential, almost no holes, and the increment is 100% sequential. And it still does not seem to prefetch anything. > You could double-check this and related settings (for example I think > it might disable itself automatically if you're on a VM with small RAM > size): > > https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable > I haven't touched that parameter at all, and it's "enabled" by default: # cat /sys/module/zfs/parameters/zfs_prefetch_disable 0 While trying to make the built-in prefetch work I reviewed the other parameters with the "prefetch" tag, without success. And I haven't seen any advice on how to make it work ... >> There's a couple more interesting ZFS details - the prefetching seems to >> be necessary even when using copy_file_range() and don't need to read >> the data (to calculate checksums). This is why the "manifest=off" chart >> has the strange group of high bars at the end - the copy cases are fast >> because prefetch happens, but if we switch to copy_file_range() there >> are no prefetches and it gets slow. > > Hmm, at a guess, it might be due to prefetching the dnode (root object > for a file) and block pointers, ie the structure but not the data > itself. > Yeah, that's possible. But the effects are the same - it doesn't matter what exactly is not prefetched. But perhaps we could prefetch just a tiny part of the record, enough to prefetch the dnode+pointers, not the whole record. Might save some space in ARC, perhaps? >> This is a bit bizarre, especially because the manifest=on cases are >> still fast, exactly because the pread + prefetching still happens. I'm >> sure users would find this puzzling. >> >> Unfortunately, the prefetching is not beneficial for all filesystems. >> For XFS it does not seem to make any difference, but on BTRFS it seems >> to cause a regression. >> >> I think this means we may need a "--prefetch" option, that'd force >> prefetching, probably both before pread and copy_file_range. Otherwise >> people on ZFS are doomed and will have poor performance. > > Seems reasonable if you can't fix it by tuning ZFS. (Might also be an > interesting research topic for a potential ZFS patch: > prefetch_swiss_cheese_window_size. I will not be nerd-sniped into > reading the relevant source today, but I'll figure it out soonish...) > It's entirely possible I'm just too stupid and it works just fine for everyone else. But maybe not, and I'd say an implementation that is this difficult to configure is almost as if it didn't exist at all. The linux read-ahead works by default pretty great. So I don't see how to make this work without explicit prefetch ... Of
Re: pg_combinebackup --copy-file-range
On Tue, Apr 2, 2024 at 8:43 AM Tomas Vondra wrote: > And I think he's right, and my tests confirm this. I did a trivial patch > to align the blocks to 8K boundary, by forcing the header to be a > multiple of 8K (I think 4K alignment would be enough). See the 0001 > patch that does this. > > And if I measure the disk space used by pg_combinebackup, and compare > the results with results without the patch ([1] from a couple days > back), I see this: > >pct not alignedaligned > - > 1% 689M19M >10%3172M22M >20% 13797M27M > > Yes, those numbers are correct. I didn't believe this at first, but the > backups are valid/verified, checksums are OK, etc. BTRFS has similar > numbers (e.g. drop from 20GB to 600MB). Fantastic. > I think we absolutely need to align the blocks in the incremental files, > and I think we should do that now. I think 8K would work, but maybe we > should add alignment parameter to basebackup & manifest? > > The reason why I think maybe this should be a basebackup parameter is > the recent discussion about large fs blocks - it seems to be in the > works, so maybe better to be ready and not assume all fs have 4K. > > And I think we probably want to do this now, because this affects all > tools dealing with incremental backups - even if someone writes a custom > version of pg_combinebackup, it will have to deal with misaligned data. > Perhaps there might be something like pg_basebackup that "transforms" > the data received from the server (and also the backup manifest), but > that does not seem like a great direction. +1, and I think BLCKSZ is the right choice. > I was very puzzled by the awful performance on ZFS. When every other fs > (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took > 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning > advice I could think of, with almost no effect. > > Ultimately I decided that it probably is the "no readahead" behavior > I've observed on ZFS. I assume it's because it doesn't use the page > cache where the regular readahead is detected etc. And there's no > prefetching in pg_combinebackup, so I decided to an experiment and added > a trivial explicit prefetch when reconstructing the file - every time > we'd read data from a file, we do posix_fadvise for up to 128 blocks > ahead (similar to what bitmap heap scan code does). See 0002. > > And tadaaa - the duration dropped from 900-1000 seconds to only about > 250-300 seconds, so an improvement of a factor of 3-4x. I think this is > pretty massive. Interesting. ZFS certainly has its own prefetching heuristics with lots of logic and settings, but it could be that it's using strict-next-block detection of access pattern (ie what I called effective_io_readahead_window=0 in the streaming I/O thread) instead of a window (ie like the Linux block device level read ahead where, AFAIK, if you access anything in that sliding window it is triggered), and perhaps your test has a lot of non-contiguous but close-enough blocks? (Also reminds me of the similar discussion on the BHS thread about distinguishing sequential access from mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the fine line between them.) You could double-check this and related settings (for example I think it might disable itself automatically if you're on a VM with small RAM size): https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable > There's a couple more interesting ZFS details - the prefetching seems to > be necessary even when using copy_file_range() and don't need to read > the data (to calculate checksums). This is why the "manifest=off" chart > has the strange group of high bars at the end - the copy cases are fast > because prefetch happens, but if we switch to copy_file_range() there > are no prefetches and it gets slow. Hmm, at a guess, it might be due to prefetching the dnode (root object for a file) and block pointers, ie the structure but not the data itself. > This is a bit bizarre, especially because the manifest=on cases are > still fast, exactly because the pread + prefetching still happens. I'm > sure users would find this puzzling. > > Unfortunately, the prefetching is not beneficial for all filesystems. > For XFS it does not seem to make any difference, but on BTRFS it seems > to cause a regression. > > I think this means we may need a "--prefetch" option, that'd force > prefetching, probably both before pread and copy_file_range. Otherwise > people on ZFS are doomed and will have poor performance. Seems reasonable if you can't fix it by tuning ZFS. (Might also be an interesting research topic for a potential ZFS patch: prefetch_swiss_cheese_window_size. I will not be nerd-sniped into reading the relevant source today, but I'll figure it out soonish...) > So I took a stab
Re: pg_combinebackup --copy-file-range
On 3/31/24 06:46, Thomas Munro wrote: > On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra > wrote: >> I'm on 2.2.2 (on Linux). But there's something wrong, because the >> pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS. >> >> I'm not sure it's a ZFS config issue, though, because it's not CPU or >> I/O bound, and I see this on both machines. And some simple dd tests >> show the zpool can do 10x the throughput. Could this be due to the file >> header / pool alignment? > > Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with > the same 128kB record as you copy_file_range 16 x 8kB blocks? > (Guessing you might be using the default recordsize?) > No, I reduced the record size to 8kB. And the pgbench init takes about the same as on other filesystems on this hardware, I think. ~10 minutes for scale 5000. >> I admit I'm not very familiar with the format, but you're probably right >> there's a header, and header_length does not seem to consider alignment. >> make_incremental_rfile simply does this: >> >> /* Remember length of header. */ >> rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) + >> sizeof(rf->truncation_block_length) + >> sizeof(BlockNumber) * rf->num_blocks; >> >> and sendFile() does the same thing when creating incremental basebackup. >> I guess it wouldn't be too difficult to make sure to align this to >> BLCKSZ or something like this. I wonder if the file format is documented >> somewhere ... It'd certainly be nicer to tweak before v18, if necessary. >> >> Anyway, is that really a problem? I mean, in my tests the CoW stuff >> seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe >> that's why it took longer on XFS ... > > Yeah I'm not sure, I assume it did more allocating and copying because > of that. It doesn't matter and it would be fine if a first version > weren't as good as possible, and fine if we tune the format later once > we know more, ie leaving improvements on the table. I just wanted to > share the observation. I wouldn't be surprised if the block-at-a-time > coding makes it slower and maybe makes the on disk data structures > worse, but I dunno I'm just guessing. > > It's also interesting but not required to figure out how to tune ZFS > well for this purpose right now... No idea. Any idea if there's some good ZFS statistics to check? -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_combinebackup --copy-file-range
On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra wrote: > I'm on 2.2.2 (on Linux). But there's something wrong, because the > pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS. > > I'm not sure it's a ZFS config issue, though, because it's not CPU or > I/O bound, and I see this on both machines. And some simple dd tests > show the zpool can do 10x the throughput. Could this be due to the file > header / pool alignment? Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with the same 128kB record as you copy_file_range 16 x 8kB blocks? (Guessing you might be using the default recordsize?) > I admit I'm not very familiar with the format, but you're probably right > there's a header, and header_length does not seem to consider alignment. > make_incremental_rfile simply does this: > > /* Remember length of header. */ > rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) + > sizeof(rf->truncation_block_length) + > sizeof(BlockNumber) * rf->num_blocks; > > and sendFile() does the same thing when creating incremental basebackup. > I guess it wouldn't be too difficult to make sure to align this to > BLCKSZ or something like this. I wonder if the file format is documented > somewhere ... It'd certainly be nicer to tweak before v18, if necessary. > > Anyway, is that really a problem? I mean, in my tests the CoW stuff > seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe > that's why it took longer on XFS ... Yeah I'm not sure, I assume it did more allocating and copying because of that. It doesn't matter and it would be fine if a first version weren't as good as possible, and fine if we tune the format later once we know more, ie leaving improvements on the table. I just wanted to share the observation. I wouldn't be surprised if the block-at-a-time coding makes it slower and maybe makes the on disk data structures worse, but I dunno I'm just guessing. It's also interesting but not required to figure out how to tune ZFS well for this purpose right now...
Re: pg_combinebackup --copy-file-range
On 3/31/24 03:03, Thomas Munro wrote: > On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra > wrote: >> So I decided to take a stab at Thomas' idea, i.e. reading the data to >> ... >> I'll see how this works on EXT4/ZFS next ... > > Wow, very cool! A couple of very quick thoughts/notes: > > ZFS: the open source version only gained per-file block cloning in > 2.2, so if you're on an older release I expect copy_file_range() to > work but not really do the magic thing. On the FreeBSD version you > also have to turn cloning on with a sysctl because people were worried > about bugs in early versions so by default you still get actual > copying, not sure if you need something like that on the Linux > version... (Obviously ZFS is always completely COW-based, but before > the new block cloning stuff it could only share blocks by its own > magic deduplication if enabled, or by cloning/snapshotting a whole > dataset/mountpoint; there wasn't a way to control it explicitly like > this.) > I'm on 2.2.2 (on Linux). But there's something wrong, because the pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS. I'm not sure it's a ZFS config issue, though, because it's not CPU or I/O bound, and I see this on both machines. And some simple dd tests show the zpool can do 10x the throughput. Could this be due to the file header / pool alignment? > Alignment: block sharing on any fs requires it. I haven't re-checked > recently but IIRC the incremental file format might have a > non-block-sized header? That means that if you copy_file_range() from > both the older backup and also the incremental backup, only the former > will share blocks, and the latter will silently be done by copying to > newly allocated blocks. If that's still true, I'm not sure how hard > it would be to tweak the format to inject some padding and to make > sure that there isn't any extra header before each block. I admit I'm not very familiar with the format, but you're probably right there's a header, and header_length does not seem to consider alignment. make_incremental_rfile simply does this: /* Remember length of header. */ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) + sizeof(rf->truncation_block_length) + sizeof(BlockNumber) * rf->num_blocks; and sendFile() does the same thing when creating incremental basebackup. I guess it wouldn't be too difficult to make sure to align this to BLCKSZ or something like this. I wonder if the file format is documented somewhere ... It'd certainly be nicer to tweak before v18, if necessary. Anyway, is that really a problem? I mean, in my tests the CoW stuff seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe that's why it took longer on XFS ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_combinebackup --copy-file-range
+wb = copy_file_range(s->fd, [i], wfd, NULL, BLCKSZ, 0); Can you collect adjacent blocks in one multi-block call? And then I think the contract is that you need to loop if it returns short.
Re: pg_combinebackup --copy-file-range
On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra wrote: > So I decided to take a stab at Thomas' idea, i.e. reading the data to > ... > I'll see how this works on EXT4/ZFS next ... Wow, very cool! A couple of very quick thoughts/notes: ZFS: the open source version only gained per-file block cloning in 2.2, so if you're on an older release I expect copy_file_range() to work but not really do the magic thing. On the FreeBSD version you also have to turn cloning on with a sysctl because people were worried about bugs in early versions so by default you still get actual copying, not sure if you need something like that on the Linux version... (Obviously ZFS is always completely COW-based, but before the new block cloning stuff it could only share blocks by its own magic deduplication if enabled, or by cloning/snapshotting a whole dataset/mountpoint; there wasn't a way to control it explicitly like this.) Alignment: block sharing on any fs requires it. I haven't re-checked recently but IIRC the incremental file format might have a non-block-sized header? That means that if you copy_file_range() from both the older backup and also the incremental backup, only the former will share blocks, and the latter will silently be done by copying to newly allocated blocks. If that's still true, I'm not sure how hard it would be to tweak the format to inject some padding and to make sure that there isn't any extra header before each block.
Re: pg_combinebackup --copy-file-range
On Tue, Mar 26, 2024 at 2:03 PM Tomas Vondra wrote: > [ new patches ] Tomas, thanks for pointing me to this email; as you speculated, gmail breaks threading if the subject line changes. The documentation still needs work here: - It refers to --link mode, which is not a thing. - It should talk about the fact that in some cases block-by-block copying will still be needed, possibly mentioning that it specifically happens when the old backup manifest is not available or does not contain checksums or does not contain checksums of the right type, or maybe just being a bit vague. In copy_file.c: - You added an unnecessary blank line to the beginning of a comment block. - You could keep the strategy_implementation variable here. I don't think that's 100% necessary, but now that you've got the code structured this way, there's no compelling reason to remove it. - I don't know what the +/* XXX maybe this should do the check internally, same as the other functions? */ comment is talking about. - Maybe these functions should have header comments. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_combinebackup --copy-file-range
On Tue, Mar 26, 2024 at 7:03 PM Tomas Vondra wrote: [..] > > That's really strange. Hi Tomas, but it looks like it's fixed now :) > > --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: > > 27m23.887s > > --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and > > loop-fix: 5m1.986s but it creates corruption as it stands > > > > Thanks. I plan to do more similar tests, once my machines get done with > some other stuff. Please do so as I do not trust my fingers :-) > > Issues: > > > > 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327 > > compains about win32/mingw: > > [..] > > > > Yup, missing break. Now it's https://cirrus-ci.com/task/4997185324974080?logs=headers_headerscheck#L10 , reproducible with "make -s headerscheck EXTRAFLAGS='-fmax-errors=10'": /tmp/postgres/src/bin/pg_combinebackup/reconstruct.h:34:91: error: unknown type name ‘CopyMode’ / CopyMode copy_mode); to me it looks like reconstruct.h needs to include definition of CopyMode which is in "#include "copy_file.h" > > 2. I do not know what's the consensus between --clone and > > --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif > > HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also > > apply the same logic to the --help so that --clone is not visible > > there (for consistency?). Also the "error: file cloning not supported > > on this platform " is technically incorrect, Linux does support > > ioctl(FICLONE) and copy_file_range(), but we are just not choosing one > > over another (so technically it is "available"). Nitpicking I know. > > > > That's a good question, I'm not sure. But whatever we do, we should do > the same thing in pg_upgrade. Maybe there's some sort of precedent? Sigh, you are right... It's consistent hell. > > 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned > > ENOSPACE spiral-of-death-bug symptoms are like that: [..] > > Yeah, that retry logic is wrong. I ended up copying the check from the > "regular" copy branch, which simply bails out if copy_file_range returns > anything but the expected 8192. > > I wonder if this should deal with partial writes, though. I mean, it's > allowed copy_file_range() copies only some of the bytes - I don't know > how often / in what situations that happens, though ... And if we want > to handle that for copy_file_range(), pwrite() needs the same treatment. Maybe that helps? https://github.com/coreutils/coreutils/blob/606f54d157c3d9d558bdbe41da8d108993d86aeb/src/copy.c#L1427 , it's harder than I anticipated (we can ignore the sparse logic though, I think) > > 3. .. but onn startup I've got this after trying psql login: invalid > > page in block 0 of relation base/5/1259 . [..] > > Can you see if you can still reproduce this with the attached version? Looks like it's fixed now and it works great (~3min, multiple times)! BTW: I've tried to also try it over NFSv4 over loopback with XFS as copy_file_range() does server-side optimization probably, but somehow it was so slow there that's it is close to being unusable (~9GB out of 104GB reconstructed after 45mins) - maybe it's due to NFS mount opts, i don't think we should worry too much. I think it's related to missing the below optimization if that matters. I think it's too early to warn users about NFS (I've spent on it just 10 mins), but on the other hand people might complain it's broken... > > Apparently there's no merging of adjacent IO/s, so pg_combinebackup > > wastes lots of time on issuing instead small syscalls but it could > > let's say do single pread/write (or even copy_file_range()). I think > > it was not evident in my earlier testing (200GB; 39min vs ~40s) as I > > had much smaller modifications in my incremental (think of 99% of > > static data). > > > > Yes, I've been thinking about exactly this optimization, but I think > we're way past proposing this for PG17. The changes that would require > in reconstruct_from_incremental_file are way too significant. Has to > wait for PG18 ;-) Sure thing! > I do think there's more on the table, as mentioned by Thomas a couple > days ago - maybe we shouldn't approach clone/copy_file_range merely as > an optimization to save time, it might be entirely reasonable to do this > simply to allow the filesystem to do CoW magic and save space (even if > we need to read the data and recalculate the checksum, which now > disables these copy methods). Sure ! I think time will still be a priority though, as pg_combinebackup duration impacts RTO while disk space is relatively cheap. One could argue that reconstructing 50TB will be a challenge though. Now my tests indicate space saving is already happening with 0002 patch - 100GB DB / full backup stats look like that (so we are good I think when using CoW - not so without using CoW) -- or i misunderstood something?: root@jw-test-1:/backups# du -sm /backups/ 214612 /backups/ root@jw-test-1:/backups# du -sm * 106831
Re: pg_combinebackup --copy-file-range
On 3/26/24 15:09, Jakub Wartak wrote: > On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra > wrote: > >> On 3/23/24 14:47, Tomas Vondra wrote: >>> On 3/23/24 13:38, Robert Haas wrote: On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro wrote: > [..] >>> Yeah, that's in write_reconstructed_file() and the patch does not touch >>> that at all. I agree it would be nice to use copy_file_range() in this >>> part too, and it doesn't seem it'd be that hard to do, I think. >>> >>> It seems we'd just need a "fork" that either calls pread/pwrite or >>> copy_file_range, depending on checksums and what was requested. >>> >> >> Here's a patch to use copy_file_range in write_reconstructed_file too, >> when requested/possible. One thing that I'm not sure about is whether to >> do pg_fatal() if --copy-file-range but the platform does not support it. > [..] > > Hi Tomas, so I gave a go to the below patches today: > - v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch > - v20240323-2-0002-write_reconstructed_file.patch > > My assessment: > > v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch - > looks like more or less good to go There's some issues with the --dry-run, pointed out by Robert. Should be fixed in the attached version. > v20240323-2-0002-write_reconstructed_file.patch - needs work and > without that clone/copy_file_range() good effects are unlikely > > Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional > tables with GiST and GIN indexes , just to see more WAL record types) > and with backups sizes in MB like that: > > 106831 full > 2823incr.1 # captured after some time with pgbench -R 100 > 165 incr.2 # captured after some time with pgbench -R 100 > > Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee > /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o > outtest full incr.1 incr.2 > > Test results of various copies on small I/O constrained XFS device: > normal copy: 31m47.407s > --clone copy: error: file cloning not supported on this platform (it's > due #ifdef of having COPY_FILE_RANGE available) > --copy-file-range: aborted, as it was taking too long , I was > expecting it to accelerate, but it did not... obviously this is the > transparent failover in case of calculating checksums... > --manifest-checksums=NONE --copy-file-range: BUG, it keep on appending > to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended > up with ENOSPC [more on this later] That's really strange. > --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: > 27m23.887s > --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and > loop-fix: 5m1.986s but it creates corruption as it stands > Thanks. I plan to do more similar tests, once my machines get done with some other stuff. > Issues: > > 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327 > compains about win32/mingw: > > [15:47:27.184] In file included from copy_file.c:22: > [15:47:27.184] copy_file.c: In function ‘copy_file’: > [15:47:27.184] ../../../src/include/common/logging.h:134:6: error: > this statement may fall through [-Werror=implicit-fallthrough=] > [15:47:27.184] 134 | if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \ > [15:47:27.184] | ^ > [15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’ > [15:47:27.184]96 | pg_log_debug("would copy \"%s\" to \"%s\" > (copy_file_range)", > [15:47:27.184] | ^~~~ > [15:47:27.184] copy_file.c:99:4: note: here > [15:47:27.184]99 |case COPY_MODE_COPYFILE: > [15:47:27.184] |^~~~ > [15:47:27.184] cc1: all warnings being treated as errors > Yup, missing break. > 2. I do not know what's the consensus between --clone and > --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif > HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also > apply the same logic to the --help so that --clone is not visible > there (for consistency?). Also the "error: file cloning not supported > on this platform " is technically incorrect, Linux does support > ioctl(FICLONE) and copy_file_range(), but we are just not choosing one > over another (so technically it is "available"). Nitpicking I know. > That's a good question, I'm not sure. But whatever we do, we should do the same thing in pg_upgrade. Maybe there's some sort of precedent? > 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned > ENOSPACE spiral-of-death-bug symptoms are like that: > > strace: > copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192 >