Re: pg_combinebackup --copy-file-range

2024-04-07 Thread Tomas Vondra
On 4/7/24 19:46, Tomas Vondra wrote:
> On 4/5/24 21:43, Tomas Vondra wrote:
>> Hi,
>>
>> ...
>>
>> 2) The prefetching is not a huge improvement, at least not for these
>> three filesystems (btrfs, ext4, xfs). From the color scale it might seem
>> like it helps, but those values are relative to the baseline, so when
>> the non-prefetching value is 5% and with prefetching 10%, that means the
>> prefetching makes it slower. And that's very often true.
>>
>> This is visible more clearly in prefetching.pdf, comparing the
>> non-prefetching and prefetching results for each patch, not to baseline.
>> That's makes it quite clear there's a lot of "red" where prefetching
>> makes it slower. It certainly does help for larger increments (which
>> makes sense, because the modified blocks are distributed randomly, and
>> thus come from random files, making long streaks unlikely).
>>
>> I've imagined the prefetching could be made a bit smarter to ignore the
>> streaks (=sequential patterns), but once again - this only matters with
>> the batching, which we don't have. And without the batching it looks
>> like a net loss (that's the first column in the prefetching PDF).
>>
>> I did start thinking about prefetching because of ZFS, where it was
>> necessary to get decent performance. And that's still true. But (a) even
>> with the explicit prefetching it's still 2-3x slower than any of these
>> filesystems, so I assume performance-sensitive use cases won't use it.
>> And (b) the prefetching seems necessary in all cases, no matter how
>> large the increment is. Which goes directly against the idea of looking
>> at how random the blocks are and prefetching only the sufficiently
>> random patterns. That doesn't seem like a great thing.
>>
> 
> I finally got a more complete ZFS results, and I also decided to get
> some numbers without the ZFS tuning I did. And boy oh boy ...
> 
> All the tests I did with ZFS were tuned the way I've seen recommended
> when using ZFS for PostgreSQL, that is
> 
>   zfs set recordsize=8K logbias=throughput compression=none
> 
> and this performed quite poorly - pg_combinebackup took 4-8x longer than
> with the traditional filesystems (btrfs, xfs, ext4) and the only thing
> that improved that measurably was prefetching.
> 
> But once I reverted back to the default recordsize of 128kB the
> performance is waay better - entirely comparable to ext4/xfs, while
> btrfs remains faster with --copy-file-range --no-manigest (by a factor
> of 2-3x).
> 

I forgot to explicitly say that I think confirms the decision to not
push the patch adding the explicit prefetching to pg_combinebackup. It's
not needed/beneficial even for ZFS, when using a suitable configuration.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_combinebackup --copy-file-range

2024-04-04 Thread Tomas Vondra
On 4/4/24 12:25, Jakub Wartak wrote:
> On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra
>  wrote:
>>
>> Hi,
>>
>> Here's a much more polished and cleaned up version of the patches,
>> fixing all the issues I've been aware of, and with various parts merged
>> into much more cohesive parts (instead of keeping them separate to make
>> the changes/evolution more obvious).
> 
> OK, so three runs of incrementalbackupstests - as stated earlier -
> also passed with OK for v20240403 (his time even with
> --enable-casserts)
> 
> pg_combinebackup flags tested were:
> 1) --copy-file-range --manifest-checksums=CRC32C
> 2) --copy-file-range --manifest-checksums=NONE
> 3) default, no flags (no copy-file-range)
> 

Thanks!

>> I changed how I think about this a bit - I don't really see the CoW copy
>> methods as necessary faster than the regular copy (even though it can be
>> with (5)). I think the main benefits are the space savings, enabled by
>> patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
>> that I don't think the performance is an issue - everything has a cost.
> 
> I take i differently: incremental backups without CoW fs would be clearly :
> - inefficient in terms of RTO (SANs are often a key thing due to
> having fast ability to "restore" the clone rather than copying the
> data from somewhere else)
> - pg_basebackup without that would be unusuable without space savings
> (e.g. imagine daily backups @ 10+TB DWHs)
> 

Right, although this very much depends on the backup scheme. If you only
take incremental backups, and then also a full backup once in a while,
the CoW stuff probably does not help much. The alignment (the only thing
affecting basebackups) may allow deduplication, but that's all I think.

If the scheme is more complex, and involves "merging" the increments
into the full backup, then this does help a lot. It'd even be possible
to cheaply clone instances this way, I think. But I'm not sure how often
would people do that on the same volume, to benefit from the CoW.

>> On 4/3/24 15:39, Jakub Wartak wrote:
>>> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
>>>  wrote:
> [..]
>> Thanks. Would be great if you could run this on the attached version of
>> the patches, ideally for each of them independently, so make sure it
>> doesn't get broken+fixed somewhere on the way.
> 
> Those are semi-manual test runs (~30 min? per run), the above results
> are for all of them applied at once. So my take is all of them work
> each one does individually too.
> 

Cool, thanks.

> FWIW, I'm also testing your other offlist incremental backup
> corruption issue, but that doesnt seem to be related in any way to
> copy_file_range() patches here.
> 

Yes, that's entirely independent, happens with master too.

 2) prefetch
 ---
>>> [..]
 I think this means we may need a "--prefetch" option, that'd force
 prefetching, probably both before pread and copy_file_range. Otherwise
 people on ZFS are doomed and will have poor performance.
>>>
>>> Right, we could optionally cover in the docs later-on various options
>>> to get the performance (on XFS use $this, but without $that and so
>>> on). It's kind of madness dealing with all those performance
>>> variations.
>>>
>>
>> Yeah, something like that. I'm not sure we want to talk too much about
>> individual filesystems in our docs, because those things evolve over
>> time too.
> 
> Sounds like Wiki then.
> 
> BTW, after a quick review: could we in 05 have something like common
> value then (to keep those together via some .h?)
> 
> #defineBATCH_SIZE PREFETCH_TARGET ?
> 

Yes, that's one of the things I'd like to refine a bit more. Making it
more consistent / clearer that these things are interdependent.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_combinebackup --copy-file-range

2024-04-04 Thread Jakub Wartak
On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra
 wrote:
>
> Hi,
>
> Here's a much more polished and cleaned up version of the patches,
> fixing all the issues I've been aware of, and with various parts merged
> into much more cohesive parts (instead of keeping them separate to make
> the changes/evolution more obvious).

OK, so three runs of incrementalbackupstests - as stated earlier -
also passed with OK for v20240403 (his time even with
--enable-casserts)

pg_combinebackup flags tested were:
1) --copy-file-range --manifest-checksums=CRC32C
2) --copy-file-range --manifest-checksums=NONE
3) default, no flags (no copy-file-range)

> I changed how I think about this a bit - I don't really see the CoW copy
> methods as necessary faster than the regular copy (even though it can be
> with (5)). I think the main benefits are the space savings, enabled by
> patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
> that I don't think the performance is an issue - everything has a cost.

I take i differently: incremental backups without CoW fs would be clearly :
- inefficient in terms of RTO (SANs are often a key thing due to
having fast ability to "restore" the clone rather than copying the
data from somewhere else)
- pg_basebackup without that would be unusuable without space savings
(e.g. imagine daily backups @ 10+TB DWHs)

> On 4/3/24 15:39, Jakub Wartak wrote:
> > On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
> >  wrote:
[..]
> Thanks. Would be great if you could run this on the attached version of
> the patches, ideally for each of them independently, so make sure it
> doesn't get broken+fixed somewhere on the way.

Those are semi-manual test runs (~30 min? per run), the above results
are for all of them applied at once. So my take is all of them work
each one does individually too.

FWIW, I'm also testing your other offlist incremental backup
corruption issue, but that doesnt seem to be related in any way to
copy_file_range() patches here.

> >> 2) prefetch
> >> ---
> > [..]
> >> I think this means we may need a "--prefetch" option, that'd force
> >> prefetching, probably both before pread and copy_file_range. Otherwise
> >> people on ZFS are doomed and will have poor performance.
> >
> > Right, we could optionally cover in the docs later-on various options
> > to get the performance (on XFS use $this, but without $that and so
> > on). It's kind of madness dealing with all those performance
> > variations.
> >
>
> Yeah, something like that. I'm not sure we want to talk too much about
> individual filesystems in our docs, because those things evolve over
> time too.

Sounds like Wiki then.

BTW, after a quick review: could we in 05 have something like common
value then (to keep those together via some .h?)

#defineBATCH_SIZE PREFETCH_TARGET ?

-J.




Re: pg_combinebackup --copy-file-range

2024-04-03 Thread Tomas Vondra
Hi,

Here's a much more polished and cleaned up version of the patches,
fixing all the issues I've been aware of, and with various parts merged
into much more cohesive parts (instead of keeping them separate to make
the changes/evolution more obvious).

I decided to reorder the changes like this:

1) block alignment - As I said earlier, I think we should definitely do
this, even if only to make future improvements possible. After chatting
about this with Robert off-list a bit, he pointed out I actually forgot
to not align the headers for files without any blocks, so this version
fixes that.

2) add clone/copy_file_range for the case that copies whole files. This
is pretty simple, but it also adds the --clone/copy-file-range options,
and similar infrastructure. The one slightly annoying bit is that we now
have the ifdef stuff in two places - when parsing the options, and then
in the copy_file_XXX methods, and the pg_fatal() calls should not be
reachable in practice. But that doesn't seem harmful, and might be a
useful protection against someone calling function that does nothing.

This also merges the original two parts, where the first one only did
this cloning/CoW stuff when checksum did not need to be calculated, and
the second extended it to those cases too (by also reading the data, but
still doing the copy the old way).

I think this is the right way - if that's not desisable, it's easy to
either add --no-manifest or not use the CoW options. Or just not change
the checksum type. There's no other way.

3) add copy_file_range to write_reconstructed_file, by using roughly the
same logic/reasoning as (2). If --copy-file-range is specified and a
checksum should be calculated, the data is read for the checksum, but
the copy is done using copy_file_range.

I did rework the flow write_reconstructed_file() flow a bit, because
tracking what exactly needs to be read/written in each of the cases
(which copy method, zeroed block, checksum calculated) made the flow
really difficult to follow. Instead I introduced a function to
read/write a block, and call them from different places.

I think this is much better, and it also makes the following part
dealing with batches of blocks much easier / smaller change.

4) prefetching - This is mostly unrelated to the CoW stuff, but it has
tremendous benefits, especially for ZFS. I haven't been able to tune ZFS
to get decent performance, and ISTM it'd be rather unusable for backup
purposes without this.

5) batching in write_reconstructed_file - This benefits especially the
copy_file_range case, where I've seen it to yield massive speedups (see
the message from Monday for better data).

6) batching for prefetch - Similar logic to (5), but for fadvise. This
is the only part where I'm not entirely sure whether it actually helps
or not. Needs some more analysis, but I'm including it for completeness.


I do think the parts (1)-(4) are in pretty good shape, good enough to
make them committable in a day or two. I see it mostly a matter of
testing and comment improvements rather than code changes.

(5) is in pretty good shape too, but I'd like to maybe simplify and
refine the write_reconstructed_file changes a bit more. I don't think
it's broken, but it feels a bit too cumbersome.

Not sure about (6) yet.

I changed how I think about this a bit - I don't really see the CoW copy
methods as necessary faster than the regular copy (even though it can be
with (5)). I think the main benefits are the space savings, enabled by
patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
that I don't think the performance is an issue - everything has a cost.


On 4/3/24 15:39, Jakub Wartak wrote:
> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
>  wrote:
>>
>> Hi,
>>
>> I've been running some benchmarks and experimenting with various stuff,
>> trying to improve the poor performance on ZFS, and the regression on XFS
>> when using copy_file_range. And oh boy, did I find interesting stuff ...
> 
> [..]
> 
> Congratulations on great results!
> 
>> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
>> to finish recovery, run pg_checksums --check (to check the patches don't
>> produce something broken)
> 
> I've performed some follow-up small testing on all patches mentioned
> here  (1..7), with the earlier developed nano-incremental-backup-tests
> that helped detect some issues for Robert earlier during original
> development. They all went fine in both cases:
> - no special options when using pg_combinebackup
> - using pg_combinebackup --copy-file-range --manifest-checksums=NONE
> 
> Those were:
> test_across_wallevelminimal.sh
> test_full_pri__incr_stby__restore_on_pri.sh
> test_full_pri__incr_stby__restore_on_stby.sh
> test_full_stby__incr_stby__restore_on_pri.sh
> test_full_stby__incr_stby__restore_on_stby.sh
> test_

Re: pg_combinebackup --copy-file-range

2024-04-03 Thread Jakub Wartak
On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
 wrote:
>
> Hi,
>
> I've been running some benchmarks and experimenting with various stuff,
> trying to improve the poor performance on ZFS, and the regression on XFS
> when using copy_file_range. And oh boy, did I find interesting stuff ...

[..]

Congratulations on great results!

> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
> to finish recovery, run pg_checksums --check (to check the patches don't
> produce something broken)

I've performed some follow-up small testing on all patches mentioned
here  (1..7), with the earlier developed nano-incremental-backup-tests
that helped detect some issues for Robert earlier during original
development. They all went fine in both cases:
- no special options when using pg_combinebackup
- using pg_combinebackup --copy-file-range --manifest-checksums=NONE

Those were:
test_across_wallevelminimal.sh
test_full_pri__incr_stby__restore_on_pri.sh
test_full_pri__incr_stby__restore_on_stby.sh
test_full_stby__incr_stby__restore_on_pri.sh
test_full_stby__incr_stby__restore_on_stby.sh
test_incr_after_timelineincrease.sh
test_incr_on_standby_after_promote.sh
test_many_incrementals_dbcreate_duplicateOID.sh
test_many_incrementals_dbcreate_filecopy_NOINCR.sh
test_many_incrementals_dbcreate_filecopy.sh
test_many_incrementals_dbcreate.sh
test_many_incrementals.sh
test_multixact.sh
test_pending_2pc.sh
test_reindex_and_vacuum_full.sh
test_repro_assert_RP.sh
test_repro_assert.sh
test_standby_incr_just_backup.sh
test_stuck_walsum.sh
test_truncaterollback.sh
test_unlogged_table.sh

> Now to the findings 
>
>
> 1) block alignment

[..]

> And I think we probably want to do this now, because this affects all
> tools dealing with incremental backups - even if someone writes a custom
> version of pg_combinebackup, it will have to deal with misaligned data.
> Perhaps there might be something like pg_basebackup that "transforms"
> the data received from the server (and also the backup manifest), but
> that does not seem like a great direction.

If anything is on the table, then I think in the far future
pg_refresh_standby_using_incremental_backup_from_primary would be the
only other tool using the format ?

> 2) prefetch
> ---
[..]
> I think this means we may need a "--prefetch" option, that'd force
> prefetching, probably both before pread and copy_file_range. Otherwise
> people on ZFS are doomed and will have poor performance.

Right, we could optionally cover in the docs later-on various options
to get the performance (on XFS use $this, but without $that and so
on). It's kind of madness dealing with all those performance
variations.

Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a
getopt variant like: --prefetch[=HOWMANY] with 128 being as default ?

-J.




Re: pg_combinebackup --copy-file-range

2024-04-02 Thread Tomas Vondra
On 4/1/24 23:45, Thomas Munro wrote:
> ...
>>
>> I was very puzzled by the awful performance on ZFS. When every other fs
>> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
>> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
>> advice I could think of, with almost no effect.
>>
>> Ultimately I decided that it probably is the "no readahead" behavior
>> I've observed on ZFS. I assume it's because it doesn't use the page
>> cache where the regular readahead is detected etc. And there's no
>> prefetching in pg_combinebackup, so I decided to an experiment and added
>> a trivial explicit prefetch when reconstructing the file - every time
>> we'd read data from a file, we do posix_fadvise for up to 128 blocks
>> ahead (similar to what bitmap heap scan code does). See 0002.
>>
>> And tadaaa - the duration dropped from 900-1000 seconds to only about
>> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is
>> pretty massive.
> 
> Interesting.  ZFS certainly has its own prefetching heuristics with
> lots of logic and settings, but it could be that it's using
> strict-next-block detection of access pattern (ie what I called
> effective_io_readahead_window=0 in the streaming I/O thread) instead
> of a window (ie like the Linux block device level read ahead where,
> AFAIK, if you access anything in that sliding window it is triggered),
> and perhaps your test has a lot of non-contiguous but close-enough
> blocks?  (Also reminds me of the similar discussion on the BHS thread
> about distinguishing sequential access from
> mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the
> fine line between them.)
> 

I don't think the files have a lot of non-contiguous but close-enough
blocks (it's rather that we'd skip blocks that need to come from a later
incremental file). The backups are generated to have a certain fraction
of modified blocks.

For example the smallest backup has 1% means 99% of blocks comes from
the base backup, and 1% comes from the increment. And indeed, the whole
database is ~75GB and the backup is ~740MB. Which means that on average
there will be runs of 99 blocks in the base backup, then skip 1 block
(to come from the increment), and then again 99-1-99-1. So it's very
sequential, almost no holes, and the increment is 100% sequential. And
it still does not seem to prefetch anything.


> You could double-check this and related settings (for example I think
> it might disable itself automatically if you're on a VM with small RAM
> size):
> 
> https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable
> 

I haven't touched that parameter at all, and it's "enabled" by default:

# cat /sys/module/zfs/parameters/zfs_prefetch_disable
0

While trying to make the built-in prefetch work I reviewed the other
parameters with the "prefetch" tag, without success. And I haven't seen
any advice on how to make it work ...

>> There's a couple more interesting ZFS details - the prefetching seems to
>> be necessary even when using copy_file_range() and don't need to read
>> the data (to calculate checksums). This is why the "manifest=off" chart
>> has the strange group of high bars at the end - the copy cases are fast
>> because prefetch happens, but if we switch to copy_file_range() there
>> are no prefetches and it gets slow.
> 
> Hmm, at a guess, it might be due to prefetching the dnode (root object
> for a file) and block pointers, ie the structure but not the data
> itself.
> 

Yeah, that's possible. But the effects are the same - it doesn't matter
what exactly is not prefetched. But perhaps we could prefetch just a
tiny part of the record, enough to prefetch the dnode+pointers, not the
whole record. Might save some space in ARC, perhaps?

>> This is a bit bizarre, especially because the manifest=on cases are
>> still fast, exactly because the pread + prefetching still happens. I'm
>> sure users would find this puzzling.
>>
>> Unfortunately, the prefetching is not beneficial for all filesystems.
>> For XFS it does not seem to make any difference, but on BTRFS it seems
>> to cause a regression.
>>
>> I think this means we may need a "--prefetch" option, that'd force
>> prefetching, probably both before pread and copy_file_range. Otherwise
>> people on ZFS are doomed and will have poor performance.
> 
> Seems reasonable if you can't fix it by tuning ZFS.  (Might also be an
> interesting research topic for a potential ZFS patch:
> prefetch_swiss_cheese_window_size.  I will not be nerd-sniped into
> reading the relevant source today, but I'll figure it out soonish...)
> 

It's entirely possible I'm just too stupid and it works just fine for
everyone else. But maybe not, and I'd say an implementation that is this
difficult to configure is almost as if it didn't exist at all. The linux
read-ahead works by default pretty great.

So I don't see how to make this work without explicit prefetch ... Of

Re: pg_combinebackup --copy-file-range

2024-04-01 Thread Thomas Munro
On Tue, Apr 2, 2024 at 8:43 AM Tomas Vondra
 wrote:
> And I think he's right, and my tests confirm this. I did a trivial patch
> to align the blocks to 8K boundary, by forcing the header to be a
> multiple of 8K (I think 4K alignment would be enough). See the 0001
> patch that does this.
>
> And if I measure the disk space used by pg_combinebackup, and compare
> the results with results without the patch ([1] from a couple days
> back), I see this:
>
>pct  not alignedaligned
>   -
> 1% 689M19M
>10%3172M22M
>20%   13797M27M
>
> Yes, those numbers are correct. I didn't believe this at first, but the
> backups are valid/verified, checksums are OK, etc. BTRFS has similar
> numbers (e.g. drop from 20GB to 600MB).

Fantastic.

> I think we absolutely need to align the blocks in the incremental files,
> and I think we should do that now. I think 8K would work, but maybe we
> should add alignment parameter to basebackup & manifest?
>
> The reason why I think maybe this should be a basebackup parameter is
> the recent discussion about large fs blocks - it seems to be in the
> works, so maybe better to be ready and not assume all fs have 4K.
>
> And I think we probably want to do this now, because this affects all
> tools dealing with incremental backups - even if someone writes a custom
> version of pg_combinebackup, it will have to deal with misaligned data.
> Perhaps there might be something like pg_basebackup that "transforms"
> the data received from the server (and also the backup manifest), but
> that does not seem like a great direction.

+1, and I think BLCKSZ is the right choice.

> I was very puzzled by the awful performance on ZFS. When every other fs
> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
> advice I could think of, with almost no effect.
>
> Ultimately I decided that it probably is the "no readahead" behavior
> I've observed on ZFS. I assume it's because it doesn't use the page
> cache where the regular readahead is detected etc. And there's no
> prefetching in pg_combinebackup, so I decided to an experiment and added
> a trivial explicit prefetch when reconstructing the file - every time
> we'd read data from a file, we do posix_fadvise for up to 128 blocks
> ahead (similar to what bitmap heap scan code does). See 0002.
>
> And tadaaa - the duration dropped from 900-1000 seconds to only about
> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is
> pretty massive.

Interesting.  ZFS certainly has its own prefetching heuristics with
lots of logic and settings, but it could be that it's using
strict-next-block detection of access pattern (ie what I called
effective_io_readahead_window=0 in the streaming I/O thread) instead
of a window (ie like the Linux block device level read ahead where,
AFAIK, if you access anything in that sliding window it is triggered),
and perhaps your test has a lot of non-contiguous but close-enough
blocks?  (Also reminds me of the similar discussion on the BHS thread
about distinguishing sequential access from
mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the
fine line between them.)

You could double-check this and related settings (for example I think
it might disable itself automatically if you're on a VM with small RAM
size):

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable

> There's a couple more interesting ZFS details - the prefetching seems to
> be necessary even when using copy_file_range() and don't need to read
> the data (to calculate checksums). This is why the "manifest=off" chart
> has the strange group of high bars at the end - the copy cases are fast
> because prefetch happens, but if we switch to copy_file_range() there
> are no prefetches and it gets slow.

Hmm, at a guess, it might be due to prefetching the dnode (root object
for a file) and block pointers, ie the structure but not the data
itself.

> This is a bit bizarre, especially because the manifest=on cases are
> still fast, exactly because the pread + prefetching still happens. I'm
> sure users would find this puzzling.
>
> Unfortunately, the prefetching is not beneficial for all filesystems.
> For XFS it does not seem to make any difference, but on BTRFS it seems
> to cause a regression.
>
> I think this means we may need a "--prefetch" option, that'd force
> prefetching, probably both before pread and copy_file_range. Otherwise
> people on ZFS are doomed and will have poor performance.

Seems reasonable if you can't fix it by tuning ZFS.  (Might also be an
interesting research topic for a potential ZFS patch:
prefetch_swiss_cheese_window_size.  I will not be nerd-sniped into
reading the relevant source today, but I'll figure it out soonish...)

> So I took a stab 

Re: pg_combinebackup --copy-file-range

2024-03-30 Thread Tomas Vondra



On 3/31/24 06:46, Thomas Munro wrote:
> On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra
>  wrote:
>> I'm on 2.2.2 (on Linux). But there's something wrong, because the
>> pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.
>>
>> I'm not sure it's a ZFS config issue, though, because it's not CPU or
>> I/O bound, and I see this on both machines. And some simple dd tests
>> show the zpool can do 10x the throughput. Could this be due to the file
>> header / pool alignment?
> 
> Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with
> the same 128kB record as you copy_file_range 16 x 8kB blocks?
> (Guessing you might be using the default recordsize?)
> 

No, I reduced the record size to 8kB. And the pgbench init takes about
the same as on other filesystems on this hardware, I think. ~10 minutes
for scale 5000.

>> I admit I'm not very familiar with the format, but you're probably right
>> there's a header, and header_length does not seem to consider alignment.
>> make_incremental_rfile simply does this:
>>
>> /* Remember length of header. */
>> rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
>> sizeof(rf->truncation_block_length) +
>> sizeof(BlockNumber) * rf->num_blocks;
>>
>> and sendFile() does the same thing when creating incremental basebackup.
>> I guess it wouldn't be too difficult to make sure to align this to
>> BLCKSZ or something like this. I wonder if the file format is documented
>> somewhere ... It'd certainly be nicer to tweak before v18, if necessary.
>>
>> Anyway, is that really a problem? I mean, in my tests the CoW stuff
>> seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
>> that's why it took longer on XFS ...
> 
> Yeah I'm not sure, I assume it did more allocating and copying because
> of that.  It doesn't matter and it would be fine if a first version
> weren't as good as possible, and fine if we tune the format later once
> we know more, ie leaving improvements on the table.  I just wanted to
> share the observation.  I wouldn't be surprised if the block-at-a-time
> coding makes it slower and maybe makes the on disk data structures
> worse, but I dunno I'm just guessing.
> 
> It's also interesting but not required to figure out how to tune ZFS
> well for this purpose right now...

No idea. Any idea if there's some good ZFS statistics to check?


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_combinebackup --copy-file-range

2024-03-30 Thread Thomas Munro
On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra
 wrote:
> I'm on 2.2.2 (on Linux). But there's something wrong, because the
> pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.
>
> I'm not sure it's a ZFS config issue, though, because it's not CPU or
> I/O bound, and I see this on both machines. And some simple dd tests
> show the zpool can do 10x the throughput. Could this be due to the file
> header / pool alignment?

Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with
the same 128kB record as you copy_file_range 16 x 8kB blocks?
(Guessing you might be using the default recordsize?)

> I admit I'm not very familiar with the format, but you're probably right
> there's a header, and header_length does not seem to consider alignment.
> make_incremental_rfile simply does this:
>
> /* Remember length of header. */
> rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
> sizeof(rf->truncation_block_length) +
> sizeof(BlockNumber) * rf->num_blocks;
>
> and sendFile() does the same thing when creating incremental basebackup.
> I guess it wouldn't be too difficult to make sure to align this to
> BLCKSZ or something like this. I wonder if the file format is documented
> somewhere ... It'd certainly be nicer to tweak before v18, if necessary.
>
> Anyway, is that really a problem? I mean, in my tests the CoW stuff
> seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
> that's why it took longer on XFS ...

Yeah I'm not sure, I assume it did more allocating and copying because
of that.  It doesn't matter and it would be fine if a first version
weren't as good as possible, and fine if we tune the format later once
we know more, ie leaving improvements on the table.  I just wanted to
share the observation.  I wouldn't be surprised if the block-at-a-time
coding makes it slower and maybe makes the on disk data structures
worse, but I dunno I'm just guessing.

It's also interesting but not required to figure out how to tune ZFS
well for this purpose right now...




Re: pg_combinebackup --copy-file-range

2024-03-30 Thread Tomas Vondra
On 3/31/24 03:03, Thomas Munro wrote:
> On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra
>  wrote:
>> So I decided to take a stab at Thomas' idea, i.e. reading the data to
>> ...
>> I'll see how this works on EXT4/ZFS next ...
> 
> Wow, very cool!  A couple of very quick thoughts/notes:
> 
> ZFS: the open source version only gained per-file block cloning in
> 2.2, so if you're on an older release I expect copy_file_range() to
> work but not really do the magic thing.  On the FreeBSD version you
> also have to turn cloning on with a sysctl because people were worried
> about bugs in early versions so by default you still get actual
> copying, not sure if you need something like that on the Linux
> version...  (Obviously ZFS is always completely COW-based, but before
> the new block cloning stuff it could only share blocks by its own
> magic deduplication if enabled, or by cloning/snapshotting a whole
> dataset/mountpoint; there wasn't a way to control it explicitly like
> this.)
> 

I'm on 2.2.2 (on Linux). But there's something wrong, because the
pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.

I'm not sure it's a ZFS config issue, though, because it's not CPU or
I/O bound, and I see this on both machines. And some simple dd tests
show the zpool can do 10x the throughput. Could this be due to the file
header / pool alignment?

> Alignment: block sharing on any fs requires it.  I haven't re-checked
> recently but IIRC the incremental file format might have a
> non-block-sized header?  That means that if you copy_file_range() from
> both the older backup and also the incremental backup, only the former
> will share blocks, and the latter will silently be done by copying to
> newly allocated blocks.  If that's still true, I'm not sure how hard
> it would be to tweak the format to inject some padding and to make
> sure that there isn't any extra header before each block.

I admit I'm not very familiar with the format, but you're probably right
there's a header, and header_length does not seem to consider alignment.
make_incremental_rfile simply does this:

/* Remember length of header. */
rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
sizeof(rf->truncation_block_length) +
sizeof(BlockNumber) * rf->num_blocks;

and sendFile() does the same thing when creating incremental basebackup.
I guess it wouldn't be too difficult to make sure to align this to
BLCKSZ or something like this. I wonder if the file format is documented
somewhere ... It'd certainly be nicer to tweak before v18, if necessary.

Anyway, is that really a problem? I mean, in my tests the CoW stuff
seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
that's why it took longer on XFS ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_combinebackup --copy-file-range

2024-03-30 Thread Thomas Munro
+wb = copy_file_range(s->fd, [i], wfd, NULL, BLCKSZ, 0);

Can you collect adjacent blocks in one multi-block call?  And then I
think the contract is that you need to loop if it returns short.




Re: pg_combinebackup --copy-file-range

2024-03-30 Thread Thomas Munro
On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra
 wrote:
> So I decided to take a stab at Thomas' idea, i.e. reading the data to
> ...
> I'll see how this works on EXT4/ZFS next ...

Wow, very cool!  A couple of very quick thoughts/notes:

ZFS: the open source version only gained per-file block cloning in
2.2, so if you're on an older release I expect copy_file_range() to
work but not really do the magic thing.  On the FreeBSD version you
also have to turn cloning on with a sysctl because people were worried
about bugs in early versions so by default you still get actual
copying, not sure if you need something like that on the Linux
version...  (Obviously ZFS is always completely COW-based, but before
the new block cloning stuff it could only share blocks by its own
magic deduplication if enabled, or by cloning/snapshotting a whole
dataset/mountpoint; there wasn't a way to control it explicitly like
this.)

Alignment: block sharing on any fs requires it.  I haven't re-checked
recently but IIRC the incremental file format might have a
non-block-sized header?  That means that if you copy_file_range() from
both the older backup and also the incremental backup, only the former
will share blocks, and the latter will silently be done by copying to
newly allocated blocks.  If that's still true, I'm not sure how hard
it would be to tweak the format to inject some padding and to make
sure that there isn't any extra header before each block.




Re: pg_combinebackup --copy-file-range

2024-03-29 Thread Robert Haas
On Tue, Mar 26, 2024 at 2:03 PM Tomas Vondra
 wrote:
> [ new patches ]

Tomas, thanks for pointing me to this email; as you speculated, gmail
breaks threading if the subject line changes.

The documentation still needs work here:

- It refers to --link mode, which is not a thing.

- It should talk about the fact that in some cases block-by-block
copying will still be needed, possibly mentioning that it specifically
happens when the old backup manifest is not available or does not
contain checksums or does not contain checksums of the right type, or
maybe just being a bit vague.

In copy_file.c:

- You added an unnecessary blank line to the beginning of a comment block.

- You could keep the strategy_implementation variable here. I don't
think that's 100% necessary, but now that you've got the code
structured this way, there's no compelling reason to remove it.

- I don't know what the +/* XXX maybe this should do the check
internally, same as the other functions? */ comment is talking about.

- Maybe these functions should have header comments.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_combinebackup --copy-file-range

2024-03-27 Thread Jakub Wartak
On Tue, Mar 26, 2024 at 7:03 PM Tomas Vondra
 wrote:
[..]
>
> That's really strange.

Hi Tomas, but it looks like it's fixed now :)

> > --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 
> > 27m23.887s
> > --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
> > loop-fix: 5m1.986s but it creates corruption as it stands
> >
>
> Thanks. I plan to do more similar tests, once my machines get done with
> some other stuff.

Please do so as I do not trust my fingers :-)

> > Issues:
> >
> > 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
> > compains about win32/mingw:
> >
[..]
> >
>
> Yup, missing break.

Now it's 
https://cirrus-ci.com/task/4997185324974080?logs=headers_headerscheck#L10
, reproducible with "make -s headerscheck
EXTRAFLAGS='-fmax-errors=10'":
/tmp/postgres/src/bin/pg_combinebackup/reconstruct.h:34:91: error:
unknown type name ‘CopyMode’ / CopyMode copy_mode);
to me it looks like reconstruct.h needs to include definition of
CopyMode which is in "#include "copy_file.h"

> > 2. I do not know what's the consensus between --clone and
> > --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
> > HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
> > apply the same logic to the --help so that --clone is not visible
> > there (for consistency?). Also the "error: file cloning not supported
> > on this platform " is technically incorrect, Linux does support
> > ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
> > over another (so technically it is "available"). Nitpicking I know.
> >
>
> That's a good question, I'm not sure. But whatever we do, we should do
> the same thing in pg_upgrade. Maybe there's some sort of precedent?

Sigh, you are right... It's consistent hell.

> > 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
> > ENOSPACE spiral-of-death-bug symptoms are like that:
[..]
>
> Yeah, that retry logic is wrong. I ended up copying the check from the
> "regular" copy branch, which simply bails out if copy_file_range returns
> anything but the expected 8192.
>
> I wonder if this should deal with partial writes, though. I mean, it's
> allowed copy_file_range() copies only some of the bytes - I don't know
> how often / in what situations that happens, though ... And if we want
> to handle that for copy_file_range(), pwrite() needs the same treatment.

Maybe that helps?
https://github.com/coreutils/coreutils/blob/606f54d157c3d9d558bdbe41da8d108993d86aeb/src/copy.c#L1427
, it's harder than I anticipated (we can ignore the sparse logic
though, I think)

> > 3. .. but onn startup I've got this after trying psql login: invalid
> > page in block 0 of relation base/5/1259  .
[..]
>
> Can you see if you can still reproduce this with the attached version?

Looks like it's fixed now and it works great (~3min, multiple times)!

BTW: I've tried to also try it over NFSv4 over loopback with XFS as
copy_file_range() does server-side optimization probably, but somehow
it was so slow there that's it is close to being unusable (~9GB out of
104GB reconstructed after 45mins) - maybe it's due to NFS mount opts,
i don't think we should worry too much. I think it's related to
missing the below optimization if that matters. I think it's too early
to warn users about NFS (I've spent on it just 10 mins), but on the
other hand people might complain it's broken...

> > Apparently there's no merging of adjacent IO/s, so pg_combinebackup
> > wastes lots of time on issuing instead small syscalls but it could
> > let's say do single pread/write (or even copy_file_range()). I think
> > it was not evident in my earlier testing (200GB; 39min vs ~40s) as I
> > had much smaller modifications in my incremental (think of 99% of
> > static data).
> >
>
> Yes, I've been thinking about exactly this optimization, but I think
> we're way past proposing this for PG17. The changes that would require
> in reconstruct_from_incremental_file are way too significant. Has to
> wait for PG18 ;-)

Sure thing!

> I do think there's more on the table, as mentioned by Thomas a couple
> days ago - maybe we shouldn't approach clone/copy_file_range merely as
> an optimization to save time, it might be entirely reasonable to do this
> simply to allow the filesystem to do CoW magic and save space (even if
> we need to read the data and recalculate the checksum, which now
> disables these copy methods).

Sure ! I think time will still be a priority though, as
pg_combinebackup duration impacts RTO while disk space is relatively
cheap.

One could argue that reconstructing 50TB will be a challenge though.
Now my tests indicate space saving is already happening with 0002
patch - 100GB DB / full backup stats look like that (so we are good I
think when using CoW - not so without using CoW) -- or i misunderstood
something?:

root@jw-test-1:/backups# du -sm /backups/
214612  /backups/
root@jw-test-1:/backups# du -sm *
106831  

Re: pg_combinebackup --copy-file-range

2024-03-26 Thread Tomas Vondra


On 3/26/24 15:09, Jakub Wartak wrote:
> On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
>  wrote:
> 
>> On 3/23/24 14:47, Tomas Vondra wrote:
>>> On 3/23/24 13:38, Robert Haas wrote:
 On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  
 wrote:
> [..]
>>> Yeah, that's in write_reconstructed_file() and the patch does not touch
>>> that at all. I agree it would be nice to use copy_file_range() in this
>>> part too, and it doesn't seem it'd be that hard to do, I think.
>>>
>>> It seems we'd just need a "fork" that either calls pread/pwrite or
>>> copy_file_range, depending on checksums and what was requested.
>>>
>>
>> Here's a patch to use copy_file_range in write_reconstructed_file too,
>> when requested/possible. One thing that I'm not sure about is whether to
>> do pg_fatal() if --copy-file-range but the platform does not support it.
> [..]
> 
> Hi Tomas, so I gave a go to the below patches today:
> - v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
> - v20240323-2-0002-write_reconstructed_file.patch
> 
> My assessment:
> 
> v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
> looks like more or less good to go

There's some issues with the --dry-run, pointed out by Robert. Should be
fixed in the attached version.

> v20240323-2-0002-write_reconstructed_file.patch - needs work and
> without that clone/copy_file_range() good effects are unlikely
> 
> Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
> tables with GiST and GIN indexes , just to see more WAL record types)
> and with backups sizes in MB like that:
> 
> 106831  full
> 2823incr.1 # captured after some time with pgbench -R 100
> 165 incr.2 # captured after some time with pgbench -R 100
> 
> Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
> /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
> outtest full incr.1 incr.2
> 
> Test results of various copies on small I/O constrained XFS device:
> normal copy: 31m47.407s
> --clone copy: error: file cloning not supported on this platform (it's
> due #ifdef of having COPY_FILE_RANGE available)
> --copy-file-range: aborted, as it was taking too long , I was
> expecting it to accelerate, but it did not... obviously this is the
> transparent failover in case of calculating checksums...
> --manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
> to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
> up with ENOSPC [more on this later]

That's really strange.

> --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 
> 27m23.887s
> --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
> loop-fix: 5m1.986s but it creates corruption as it stands
> 

Thanks. I plan to do more similar tests, once my machines get done with
some other stuff.

> Issues:
> 
> 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
> compains about win32/mingw:
> 
> [15:47:27.184] In file included from copy_file.c:22:
> [15:47:27.184] copy_file.c: In function ‘copy_file’:
> [15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
> this statement may fall through [-Werror=implicit-fallthrough=]
> [15:47:27.184]   134 |   if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
> [15:47:27.184]   |  ^
> [15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
> [15:47:27.184]96 | pg_log_debug("would copy \"%s\" to \"%s\"
> (copy_file_range)",
> [15:47:27.184]   | ^~~~
> [15:47:27.184] copy_file.c:99:4: note: here
> [15:47:27.184]99 |case COPY_MODE_COPYFILE:
> [15:47:27.184]   |^~~~
> [15:47:27.184] cc1: all warnings being treated as errors
> 

Yup, missing break.

> 2. I do not know what's the consensus between --clone and
> --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
> HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
> apply the same logic to the --help so that --clone is not visible
> there (for consistency?). Also the "error: file cloning not supported
> on this platform " is technically incorrect, Linux does support
> ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
> over another (so technically it is "available"). Nitpicking I know.
> 

That's a good question, I'm not sure. But whatever we do, we should do
the same thing in pg_upgrade. Maybe there's some sort of precedent?

> 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
> ENOSPACE spiral-of-death-bug symptoms are like that:
> 
> strace:
> copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192
>