Re: Large files for relations

2024-05-13 Thread Peter Eisentraut

On 06.03.24 22:54, Thomas Munro wrote:

Rebased.  I had intended to try to get this into v17, but a couple of
unresolved problems came up while rebasing over the new incremental
backup stuff.  You snooze, you lose.  Hopefully we can sort these out
in time for the next commitfest:

* should pg_combinebasebackup read the control file to fetch the segment size?
* hunt for other segment-size related problems that may be lurking in
new incremental backup stuff
* basebackup_incremental.c wants to use memory in proportion to
segment size, which looks like a problem, and I wrote about that in a
new thread[1]


Overall, I like this idea, and the patch seems to have many bases covered.

The patch will need a rebase.  I was able to test it on 
master@{2024-03-13}, but after that there are conflicts.


In .cirrus.tasks.yml, one of the test tasks uses 
--with-segsize-blocks=6, but you are removing that option.  You could 
replace that with something like


PG_TEST_INITDB_EXTRA_OPTS='--rel-segsize=48kB'

But that won't work exactly because

initdb: error: argument of --rel-segsize must be a power of two

I suppose that's ok as a change, since it makes the arithmetic more 
efficient.  But maybe it should be called out explicitly in the commit 
message.


If I run it with 64kB, the test pgbench/001_pgbench_with_server fails 
consistently, so it seems there is still a gap somewhere.


A minor point, the initdb error message

initdb: error: argument of --rel-segsize must be a multiple of BLCKSZ

would be friendlier if actually showed the value of the block size 
instead of just the symbol.  Similarly for the nearby error message 
about the off_t size.


In the control file, all the other fields use unsigned types.  Should 
relseg_size be uint64?


PG_CONTROL_VERSION needs to be changed.





Re: Large files for relations

2024-03-06 Thread Thomas Munro
Rebased.  I had intended to try to get this into v17, but a couple of
unresolved problems came up while rebasing over the new incremental
backup stuff.  You snooze, you lose.  Hopefully we can sort these out
in time for the next commitfest:

* should pg_combinebasebackup read the control file to fetch the segment size?
* hunt for other segment-size related problems that may be lurking in
new incremental backup stuff
* basebackup_incremental.c wants to use memory in proportion to
segment size, which looks like a problem, and I wrote about that in a
new thread[1]

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5%2Bg%40mail.gmail.com
From 85678257fef94aa3ca3efb39ce55fb66df7c889e Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 26 May 2023 01:41:11 +1200
Subject: [PATCH v3] Allow relation segment size to be set by initdb.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, relation segment size was a rarely modified compile time
option.  Make it an initdb option, so that users with very large tables
can avoid using so many files and file descriptors.

The initdb option --rel-segsize is modeled on the existing --wal-segsize
option.

The data type used to store the size is int64, not BlockNumber, because
it seems reasonable to want to be able to say --rel-segsize=32TB (=
don't use segments at all), but that would overflow uint32.

It should be fairly straightforward to teach pg_upgrade (or some new
dedicated tool) to convert an existing cluster to a new segment size,
but that is not done yet, so for now this is only useful for entirely
new clusters.

The default behavior is unchanged: 1GB segments.  On Windows, we can't
go above 2GB for now due (we'd have to make a lot of changes due to
Windows' small off_t).

XXX work remains to be done for incremental backups

Reviewed-by: David Steele 
Reviewed-by: Peter Eisentraut 
Reviewed-by: Stephen Frost 
Reviewed-by: Jim Mlodgenski 
Reivewed-by: Dagfinn Ilmari Mannsåker 
Reviewed-by: Pavel Stehule 
Discussion: https://postgr.es/m/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com
---
 configure   |  91 --
 configure.ac|  55 -
 doc/src/sgml/config.sgml|   7 +-
 doc/src/sgml/ref/initdb.sgml|  24 
 meson.build |  14 ---
 src/backend/access/transam/xlog.c   |  11 +-
 src/backend/backup/basebackup.c |   7 +-
 src/backend/backup/basebackup_incremental.c |  31 +++--
 src/backend/bootstrap/bootstrap.c   |   5 +-
 src/backend/storage/file/buffile.c  |   6 +-
 src/backend/storage/smgr/md.c   | 128 
 src/backend/storage/smgr/smgr.c |  14 +++
 src/backend/utils/misc/guc.c|  16 +++
 src/backend/utils/misc/guc_tables.c |  12 +-
 src/bin/initdb/initdb.c |  47 ++-
 src/bin/pg_checksums/pg_checksums.c |   2 +-
 src/bin/pg_combinebackup/reconstruct.c  |  18 ++-
 src/bin/pg_controldata/pg_controldata.c |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c   |   4 +-
 src/bin/pg_rewind/filemap.c |   4 +-
 src/bin/pg_rewind/pg_rewind.c   |   3 +
 src/bin/pg_rewind/pg_rewind.h   |   1 +
 src/bin/pg_upgrade/relfilenumber.c  |   2 +-
 src/include/catalog/pg_control.h|   2 +-
 src/include/pg_config.h.in  |  13 --
 src/include/storage/smgr.h  |   3 +
 src/include/utils/guc_tables.h  |   1 +
 27 files changed, 249 insertions(+), 274 deletions(-)

diff --git a/configure b/configure
index 36feeafbb23..49a7f0f2c4a 100755
--- a/configure
+++ b/configure
@@ -842,8 +842,6 @@ enable_dtrace
 enable_tap_tests
 enable_injection_points
 with_blocksize
-with_segsize
-with_segsize_blocks
 with_wal_blocksize
 with_llvm
 enable_depend
@@ -1551,9 +1549,6 @@ Optional Packages:
   --with-pgport=PORTNUM   set default port number [5432]
   --with-blocksize=BLOCKSIZE
   set table block size in kB [8]
-  --with-segsize=SEGSIZE  set table segment size in GB [1]
-  --with-segsize-blocks=SEGSIZE_BLOCKS
-  set table segment size in blocks [0]
   --with-wal-blocksize=BLOCKSIZE
   set WAL block size in kB [8]
   --with-llvm build with LLVM based JIT support
@@ -3759,85 +3754,6 @@ cat >>confdefs.h <<_ACEOF
 _ACEOF
 
 
-#
-# Relation segment size
-#
-
-
-
-# Check whether --with-segsize was given.
-if test "${with_segsize+set}" = set; then :
-  withval=$with_segsize;
-  case $withval in
-yes)
-  as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5
-  ;;
-no)
-  as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5
-  ;;
-*)
-  segsize=$withval
-  

Re: Large files for relations

2023-07-03 Thread Thomas Munro
On Mon, Jun 12, 2023 at 8:53 PM David Steele  wrote:
> +   if (strcmp(endptr, "kB") == 0)
>
> Why kB here instead of KB to match MB, GB, TB below?

Those are SI prefixes[1], and we use kB elsewhere too.  ("K" was used
for kelvins, so they went with "k" for kilo.  Obviously these aren't
fully SI, because B is supposed to mean bel.  A gigabel would be
pretty loud... more than "sufficient power to create a black hole"[2],
hehe.)

> +   int64   relseg_size;/* blocks per segment of large 
> relation */
>
> This will require PG_CONTROL_VERSION to be bumped -- but you are
> probably waiting until commit time to avoid annoying conflicts, though I
> don't think it is as likely as with CATALOG_VERSION_NO.

Oh yeah, thanks.

> > Another
> > idea would be to make it static in md.c and call smgrsetsegmentsize(),
> > or something like that.  That could be a nice place to compute the
> > "shift" value up front, instead of computing it each time in
> > blockno_to_segno(), but that's probably not worth bothering with (?).
> > BSR/LZCNT/CLZ instructions are pretty fast on modern chips.  That's
> > about the only place where someone could say that this change makes
> > things worse for people not interested in the new feature, so I was
> > careful to get rid of / and % operations with no-longer-constant RHS.
>
> Right -- not sure we should be troubling ourselves with trying to
> optimize away ops that are very fast, unless they are computed trillions
> of times.

This obviously has some things in common with David Christensen's
nearby patch for block sizes[3], and we should be shifting and masking
there too if that route is taken (as opposed to a specialise-the-code
route or somethign else).  My binary-log trick is probably a little
too cute though... I should probably just go and set a shift variable.

Thanks for looking!

[1] https://en.wikipedia.org/wiki/Metric_prefix
[2] https://en.wiktionary.org/wiki/gigabel
[3] 
https://www.postgresql.org/message-id/flat/CAOxo6XKx7DyDgBkWwPfnGSXQYNLpNrSWtYnK6-1u%2BQHUwRa1Gg%40mail.gmail.com




Re: Large files for relations

2023-06-12 Thread David Steele

On 5/28/23 08:48, Thomas Munro wrote:


Alright, since I had some time to kill in an airport, here is a
starter patch for initdb --rel-segsize.  


I've gone through this patch and it looks pretty good to me. A few things:

+* rel_setment_size, we will truncate the K+1st segment 
to 0 length

rel_setment_size -> rel_segment_size

+* We used a phony GUC with a custome show function, because we don't

custome -> custom

+   if (strcmp(endptr, "kB") == 0)

Why kB here instead of KB to match MB, GB, TB below?

+   int64   relseg_size;/* blocks per segment of large relation 
*/

This will require PG_CONTROL_VERSION to be bumped -- but you are 
probably waiting until commit time to avoid annoying conflicts, though I 
don't think it is as likely as with CATALOG_VERSION_NO.



Some random thoughts:

Another potential option name would be --segsize, if we think we're
going to use this for temp files too eventually.


I feel like temp file segsize should be separately configurable for the 
same reason that we are leaving it as 1GB for now.



Maybe it's not so beautiful to have that global variable
rel_segment_size (which replaces REL_SEGSIZE everywhere).  


Maybe not, but it is the way these things are done in general, .e.g. 
wal_segment_size, so I don't think it will be too controversial.



Another
idea would be to make it static in md.c and call smgrsetsegmentsize(),
or something like that.  That could be a nice place to compute the
"shift" value up front, instead of computing it each time in
blockno_to_segno(), but that's probably not worth bothering with (?).
BSR/LZCNT/CLZ instructions are pretty fast on modern chips.  That's
about the only place where someone could say that this change makes
things worse for people not interested in the new feature, so I was
careful to get rid of / and % operations with no-longer-constant RHS.


Right -- not sure we should be troubling ourselves with trying to 
optimize away ops that are very fast, unless they are computed trillions 
of times.



I had to promote segment size to int64 (global variable, field in
control file), because otherwise it couldn't represent
--rel-segsize=32TB (it'd be too big by one).  Other ideas would be to
store the shift value instead of the size, or store the max block
number, eg subtract one, or use InvalidBlockNumber to mean "no limit"
(with more branches to test for it).  The only problem I ran into with
the larger type was that 'SHOW segment_size' now needs a custom show
function because we don't have int64 GUCs.


A custom show function seems like a reasonable solution here.


A C type confusion problem that I noticed: some code uses BlockNumber
and some code uses int for segment numbers.  It's not really a
reachable problem for practical reasons (you'd need over 2 billion
directories and VFDs to reach it), but it's wrong to use int if
segment size can be set as low as BLCKSZ (one file per block); you
could have more segments than an int can represent.  We could go for
uint32, BlockNumber or create SegmentNumber (which I think I've
proposed before, and lost track of...).  We can address that
separately (perhaps by finding my old patch...)


I think addressing this separately is fine, though maybe enforcing some 
reasonable minimum in initdb would be a good idea for this patch. For my 
2c SEGSIZE == BLOCKSZ just makes very little sense.


Lastly, I think the blockno_to_segno(), blockno_within_segment(), and 
blockno_to_seekpos() functions add enough readability that they should 
be committed regardless of how this patch proceeds.


Regards,
-David




Re: Large files for relations

2023-05-30 Thread Peter Eisentraut

On 28.05.23 02:48, Thomas Munro wrote:

Another potential option name would be --segsize, if we think we're
going to use this for temp files too eventually.

Maybe it's not so beautiful to have that global variable
rel_segment_size (which replaces REL_SEGSIZE everywhere).  Another
idea would be to make it static in md.c and call smgrsetsegmentsize(),
or something like that.


I think one way to look at this is that the segment size is a 
configuration property of the md.c smgr.  I have been thinking a bit 
about how smgr-level configuration could look.  You can't use a catalog 
table, but we also can't have smgr plugins get space in pg_control.


Anyway, I'm not asking you to design this now.  A global variable via 
pg_control seems fine for now.  But it wouldn't be an smgr API call, I 
think.





Re: Large files for relations

2023-05-28 Thread Thomas Munro
On Sun, May 28, 2023 at 2:48 AM Thomas Munro  wrote:
> (you'd need over 2 billion
> directories ...

directory *entries* (segment files), I meant to write there.




Re: Large files for relations

2023-05-28 Thread Thomas Munro
On Thu, May 25, 2023 at 1:08 PM Stephen Frost  wrote:
> * Peter Eisentraut (peter.eisentr...@enterprisedb.com) wrote:
> > On 24.05.23 02:34, Thomas Munro wrote:
> > > * pg_upgrade would convert if source and target don't match
> >
> > This would be good, but it could also be an optional or later feature.
>
> Agreed.

OK.  I do have a patch for that, but I'll put that (+ copy_file_range)
aside for now so we can talk about the basic feature.  Without that,
pg_upgrade just rejects mismatching clusters as it always did, no
change required.

> > > I would probably also leave out those Windows file API changes, too.
> > > --rel-segsize would simply refuse larger sizes until someone does the
> > > work on that platform, to keep the initial proposal small.
> >
> > Those changes from off_t to pgoff_t?  Yes, it would be good to do without
> > those.  Apart of the practical problems that have been brought up, this was
> > a major annoyance with the proposed patch set IMO.

+1, it was not nice.

Alright, since I had some time to kill in an airport, here is a
starter patch for initdb --rel-segsize.  Some random thoughts:

Another potential option name would be --segsize, if we think we're
going to use this for temp files too eventually.

Maybe it's not so beautiful to have that global variable
rel_segment_size (which replaces REL_SEGSIZE everywhere).  Another
idea would be to make it static in md.c and call smgrsetsegmentsize(),
or something like that.  That could be a nice place to compute the
"shift" value up front, instead of computing it each time in
blockno_to_segno(), but that's probably not worth bothering with (?).
BSR/LZCNT/CLZ instructions are pretty fast on modern chips.  That's
about the only place where someone could say that this change makes
things worse for people not interested in the new feature, so I was
careful to get rid of / and % operations with no-longer-constant RHS.

I had to promote segment size to int64 (global variable, field in
control file), because otherwise it couldn't represent
--rel-segsize=32TB (it'd be too big by one).  Other ideas would be to
store the shift value instead of the size, or store the max block
number, eg subtract one, or use InvalidBlockNumber to mean "no limit"
(with more branches to test for it).  The only problem I ran into with
the larger type was that 'SHOW segment_size' now needs a custom show
function because we don't have int64 GUCs.

A C type confusion problem that I noticed: some code uses BlockNumber
and some code uses int for segment numbers.  It's not really a
reachable problem for practical reasons (you'd need over 2 billion
directories and VFDs to reach it), but it's wrong to use int if
segment size can be set as low as BLCKSZ (one file per block); you
could have more segments than an int can represent.  We could go for
uint32, BlockNumber or create SegmentNumber (which I think I've
proposed before, and lost track of...).  We can address that
separately (perhaps by finding my old patch...)
From c6809aafd147d0ac286ab73c2d8fbe571c698550 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 26 May 2023 01:41:11 +1200
Subject: [PATCH 1/2] Allow relation segment size to be set by initdb.

Previously, relation segment size was a rarely modified compile time
option.  Make it an initdb option, so that users with very large tables
can avoid using so many files and file descriptors.

The initdb option --rel-segsize is modeled on the existing --wal-segsize
option.

The data type used to store the size is int64, not BlockNumber, because
it seems reasonable to want to be able to say --rel-segsize=32TB (=
don't use segments at all), but that would overflow uint32.

The default behavior is unchanged: 1GB segments.  On Windows, we can't
go above 2GB for now due (we'd have to make a lot of changes due to
Windows' small off_t).

Discussion: https://postgr.es/m/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com

diff --git a/configure b/configure
index 1b415142d1..a3dee3ea74 100755
--- a/configure
+++ b/configure
@@ -841,8 +841,6 @@ enable_coverage
 enable_dtrace
 enable_tap_tests
 with_blocksize
-with_segsize
-with_segsize_blocks
 with_wal_blocksize
 with_CC
 with_llvm
@@ -1551,9 +1549,6 @@ Optional Packages:
   --with-pgport=PORTNUM   set default port number [5432]
   --with-blocksize=BLOCKSIZE
   set table block size in kB [8]
-  --with-segsize=SEGSIZE  set table segment size in GB [1]
-  --with-segsize-blocks=SEGSIZE_BLOCKS
-  set table segment size in blocks [0]
   --with-wal-blocksize=BLOCKSIZE
   set WAL block size in kB [8]
   --with-CC=CMD   set compiler (deprecated)
@@ -3731,85 +3726,6 @@ cat >>confdefs.h <<_ACEOF
 _ACEOF
 
 
-#
-# Relation segment size
-#
-
-
-
-# Check whether --with-segsize was given.
-if test "${with_segsize+set}" = set; then :
-  withval=$with_segsize;
-  case $withval in
-yes)
-  as_fn_error $? "argument required for 

Re: Large files for relations

2023-05-25 Thread Stephen Frost
Greetings,

* Peter Eisentraut (peter.eisentr...@enterprisedb.com) wrote:
> On 24.05.23 02:34, Thomas Munro wrote:
> > Thanks all for the feedback.  It was a nice idea and it *almost*
> > works, but it seems like we just can't drop segmented mode.  And the
> > automatic transition schemes I showed don't make much sense without
> > that goal.
> > 
> > What I'm hearing is that something simple like this might be more 
> > acceptable:
> > 
> > * initdb --rel-segsize (cf --wal-segsize), default unchanged
> 
> makes sense

Agreed, this seems alright in general.  Having more initdb-time options
to help with certain use-cases rather than having things be compile-time
is definitely just generally speaking a good direction to be going in,
imv.

> > * pg_upgrade would convert if source and target don't match
> 
> This would be good, but it could also be an optional or later feature.

Agreed.

> Maybe that should be a different mode, like --copy-and-adjust-as-necessary,
> so that users would have to opt into what would presumably be slower than
> plain --copy, rather than being surprised by it, if they unwittingly used
> incompatible initdb options.

I'm curious as to why it would be slower than a regular copy..?

> > I would probably also leave out those Windows file API changes, too.
> > --rel-segsize would simply refuse larger sizes until someone does the
> > work on that platform, to keep the initial proposal small.
> 
> Those changes from off_t to pgoff_t?  Yes, it would be good to do without
> those.  Apart of the practical problems that have been brought up, this was
> a major annoyance with the proposed patch set IMO.
> 
> > I would probably leave the experimental copy_on_write() ideas out too,
> > for separate discussion in a separate proposal.
> 
> right

You mean copy_file_range() here, right?

Shouldn't we just add support for that today into pg_upgrade,
independently of this?  Seems like a worthwhile improvement even without
the benefit it would provide to changing segment sizes.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Large files for relations

2023-05-24 Thread Robert Haas
On Wed, May 24, 2023 at 2:18 AM Peter Eisentraut
 wrote:
> > What I'm hearing is that something simple like this might be more 
> > acceptable:
> >
> > * initdb --rel-segsize (cf --wal-segsize), default unchanged
>
> makes sense

+1.

> > * pg_upgrade would convert if source and target don't match
>
> This would be good, but it could also be an optional or later feature.

+1. I think that would be nice to have, but not absolutely required.

IMHO it's best not to overcomplicate these projects. Not everything
needs to be part of the initial commit. If the initial commit happens
2 months from now and then stuff like this gets added over the next 8,
that's strictly better than trying to land the whole patch set next
March.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Large files for relations

2023-05-24 Thread Peter Eisentraut

On 24.05.23 02:34, Thomas Munro wrote:

Thanks all for the feedback.  It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode.  And the
automatic transition schemes I showed don't make much sense without
that goal.

What I'm hearing is that something simple like this might be more acceptable:

* initdb --rel-segsize (cf --wal-segsize), default unchanged


makes sense


* pg_upgrade would convert if source and target don't match


This would be good, but it could also be an optional or later feature.

Maybe that should be a different mode, like 
--copy-and-adjust-as-necessary, so that users would have to opt into 
what would presumably be slower than plain --copy, rather than being 
surprised by it, if they unwittingly used incompatible initdb options.



I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.


Those changes from off_t to pgoff_t?  Yes, it would be good to do 
without those.  Apart of the practical problems that have been brought 
up, this was a major annoyance with the proposed patch set IMO.



I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.


right





Re: Large files for relations

2023-05-23 Thread Thomas Munro
Thanks all for the feedback.  It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode.  And the
automatic transition schemes I showed don't make much sense without
that goal.

What I'm hearing is that something simple like this might be more acceptable:

* initdb --rel-segsize (cf --wal-segsize), default unchanged
* pg_upgrade would convert if source and target don't match

I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.

I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.




Re: Large files for relations

2023-05-15 Thread Robert Haas
On Fri, May 12, 2023 at 9:53 AM Stephen Frost  wrote:
> While I tend to agree that 1GB is too small, 1TB seems like it's
> possibly going to end up on the too big side of things, or at least,
> if we aren't getting rid of the segment code then it's possibly throwing
> away the benefits we have from the smaller segments without really
> giving us all that much.  Going from 1G to 10G would reduce the number
> of open file descriptors by quite a lot without having much of a net
> change on other things.  50G or 100G would reduce the FD handles further
> but starts to make us lose out a bit more on some of the nice parts of
> having multiple segments.

This is my view as well, more or less. I don't really like our current
handling of relation segments; we know it has bugs, and making it
non-buggy feels difficult. And there are performance issues as well --
file descriptor consumption, for sure, but also probably that crossing
a file boundary likely breaks the operating system's ability to do
readahead to some degree. However, I think we're going to find that
moving to a system where we have just one file per relation fork and
that file can be arbitrarily large is not fantastic, either. Jim's
point about running into filesystem limits is a good one (hi Jim, long
time no see!) and the problem he points out with ext4 is almost
certainly not the only one. It doesn't just have to be filesystems,
either. It could be a limitation of an archiving tool (tar, zip, cpio)
or a file copy utility or whatever as well. A quick Google search
suggests that most such things have been updated to use 64-bit sizes,
but my point is that the set of things that can potentially cause
problems is broader than just the filesystem. Furthermore, even when
there's no hard limit at play, a smaller file size can occasionally be
*convenient*, as in Pavel's example of using hard links to share
storage between backups. From that point of view, a 16GB or 64GB or
256GB file size limit seems more convenient than no limit and more
convenient than a large limit like 1TB.

However, the bugs are the flies in the ointment (ahem). If we just
make the segment size bigger but don't get rid of segments altogether,
then we still have to fix the bugs that can occur when you do have
multiple segments. I think part of Thomas's motivation is to dodge
that whole category of problems. If we gradually deprecate
multi-segment mode in favor of single-file-per-relation-fork, then the
fact that the segment handling code has bugs becomes progressively
less relevant. While that does make some sense, I'm not sure I really
agree with the approach. The problem is that we're trading problems
that we at least theoretically can fix somehow by hitting our code
with a big enough hammer for an unknown set of problems that stem from
limitations of software we don't control, maybe don't even know about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Large files for relations

2023-05-15 Thread MARK CALLAGHAN
On Fri, May 12, 2023 at 4:02 PM Thomas Munro  wrote:

> On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN  wrote:
> > Repeating what was mentioned on Twitter, because I had some experience
> with the topic. With fewer files per table there will be more contention on
> the per-inode mutex (which might now be the per-inode rwsem). I haven't
> read filesystem source in a long time. Back in the day, and perhaps today,
> it was locked for the duration of a write to storage (locked within the
> kernel) and was briefly locked while setting up a read.
> >
> > The workaround for writes was one of:
> > 1) enable disk write cache or use battery-backed HW RAID to make writes
> faster (yes disks, I encountered this prior to 2010)
> > 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't
> locked for the duration of a write
> >
> > I have a vague memory that filesystems have improved in this regard.
>
> (I am interpreting your "use XFS" to mean "use XFS instead of ext4".)
>

Yes, although when the decision was made it was probably ext-3 -> XFS.  We
suffered from fsync a file == fsync the filesystem
because MySQL binlogs use buffered IO and are appended on write. Switching
from ext-? to XFS was an easy perf win
so I don't have much experience with ext-? over the past decade.


> Right, 80s file systems like UFS (and I suspect ext and ext2, which
>

Late 80s is when I last hacked on Unix fileys code, excluding browsing XFS
and ext source. Unix was easy back then -- one big kernel lock covers
everything.


> some time sooner).  Currently our code believes that it is not safe to
> call fdatasync() for files whose size might have changed.  There is no
>

Long ago we added code for InnoDB to avoid fsync/fdatasync in some cases
when O_DIRECT was used. While great for performance
we also forgot to make sure they were still done when files were extended.
Eventually we fixed that.

Thanks for all of the details.

-- 
Mark Callaghan
mdcal...@gmail.com


Re: Large files for relations

2023-05-12 Thread Thomas Munro
On Sat, May 13, 2023 at 11:01 AM Thomas Munro  wrote:
> On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN  wrote:
> > use XFS and O_DIRECT

As for direct I/O, we're only just getting started on that.  We
currently can't produce more than one concurrent WAL write, and then
for relation data, we just got very basic direct I/O support but we
haven't yet got the asynchronous machinery to drive it properly (work
in progress, more soon).  I was just now trying to find out what the
state of parallel direct writes is in ext4, and it looks like it's
finally happening:

https://www.phoronix.com/news/Linux-6.3-EXT4




Re: Large files for relations

2023-05-12 Thread Thomas Munro
On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN  wrote:
> Repeating what was mentioned on Twitter, because I had some experience with 
> the topic. With fewer files per table there will be more contention on the 
> per-inode mutex (which might now be the per-inode rwsem). I haven't read 
> filesystem source in a long time. Back in the day, and perhaps today, it was 
> locked for the duration of a write to storage (locked within the kernel) and 
> was briefly locked while setting up a read.
>
> The workaround for writes was one of:
> 1) enable disk write cache or use battery-backed HW RAID to make writes 
> faster (yes disks, I encountered this prior to 2010)
> 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't 
> locked for the duration of a write
>
> I have a vague memory that filesystems have improved in this regard.

(I am interpreting your "use XFS" to mean "use XFS instead of ext4".)

Right, 80s file systems like UFS (and I suspect ext and ext2, which
were probably based on similar ideas and ran on non-SMP machines?)
used coarse grained locking including vnodes/inodes level.  Then over
time various OSes and file systems have improved concurrency.  Brief
digression, as someone who got started on IRIX in the 90 and still
thinks those were probably the coolest computers: At SGI, first they
replaced SysV UFS with EFS (E for extent-based allocation) and
invented O_DIRECT to skip the buffer pool, and then blew the doors off
everything with XFS, which maximised I/O concurrency and possibly (I
guess, it's not open source so who knows?) involved a revamped VFS to
lower stuff like inode locks, motivated by monster IRIX boxes with up
to 1024 CPUs and huge storage arrays.  In the Linux ext3 era, I
remember hearing lots of reports of various kinds of large systems
going faster just by switching to XFS and there is lots of writing
about that.  ext4 certainly changed enormously.  One reason back in
those days (mid 2000s?) was the old
fsync-actually-fsyncs-everything-in-the-known-universe-and-not-just-your-file
thing, and another was the lack of write concurrency especially for
direct I/O, and probably lots more things.  But that's all ancient
history...

As for ext4, we've detected and debugged clues about the gradual
weakening of locking over time on this list: we know that concurrent
read/write to the same page of a file was previously atomic, but when
we switched to pread/pwrite for most data (ie not making use of the
current file position), it ceased to be (a concurrent reader can see a
mash-up of old and new data with visible cache line-ish stripes in it,
so there isn't even a write-lock for the page); then we noticed that
in later kernels even read/write ceased to be atomic (implicating a
change in file size/file position interlocking, I guess).  I also
vaguely recall reading on here a long time ago that lseek()
performance was dramatically improved with weaker inode interlocking,
perhaps even in response to this very program's pathological SEEK_END
call frequency (something I hope to fix, but I digress).  So I think
it's possible that the effect you mentioned is gone?

I can think of a few differences compared to those other RDBMSs.
There the discussion was about one-file-per-relation vs
one-big-file-for-everything, whereas we're talking about
one-file-per-relation vs many-files-per-relation (which doesn't change
the point much, just making clear that I'm not proposing a 42PB file
to whole everything, so you can still partition to get different
files).  We also usually call fsync in series in our checkpointer
(after first getting the writebacks started with sync_file_range()
some time sooner).  Currently our code believes that it is not safe to
call fdatasync() for files whose size might have changed.  There is no
basis for that in POSIX or in any system that I currently know of
(though I haven't looked into it seriously), but I believe there was a
historical file system that at some point in history interpreted
"non-essential meta data" (the stuff POSIX allows it not to flush to
disk) to include "the size of the file" (whereas POSIX really just
meant that you don't have to synchronise the mtime and similar), which
is probably why PostgreSQL has some code that calls fsync() on newly
created empty WAL segments to "make sure the indirect blocks are down
on disk" before allowing itself to use only fdatasync() later to
overwrite it with data.  The point being that, for the most important
kind of interactive/user facing I/O latency, namely WAL flushes, we
already use fdatasync().  It's possible that we could use it to flush
relation data too (ie the relation files in question here, usually
synchronised by the checkpointer) according to POSIX but it doesn't
immediately seem like something that should be at all hot and it's
background work.  But perhaps I lack imagination.

Thanks, thought-provoking stuff.




Re: Large files for relations

2023-05-12 Thread MARK CALLAGHAN
Repeating what was mentioned on Twitter, because I had some experience with
the topic. With fewer files per table there will be more contention on the
per-inode mutex (which might now be the per-inode rwsem). I haven't read
filesystem source in a long time. Back in the day, and perhaps today, it
was locked for the duration of a write to storage (locked within the
kernel) and was briefly locked while setting up a read.

The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes
faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't
locked for the duration of a write

I have a vague memory that filesystems have improved in this regard.


On Thu, May 11, 2023 at 4:38 PM Thomas Munro  wrote:

> On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
> > On Mon, May 1, 2023 at 9:29 PM Thomas Munro 
> wrote:
> >> I am not aware of any modern/non-historic filesystem[2] that can't do
> >> large files with ease.  Anyone know of anything to worry about on that
> >> front?
> >
> > There is some trouble in the ambiguity of what we mean by "modern" and
> "large files". There are still a large number of users of ext4 where the
> max file size is 16TB. Switching to a single large file per relation would
> effectively cut the max table size in half for those users. How would a
> user with say a 20TB table running on ext4 be impacted by this change?
>
> Hrmph.  Yeah, that might be a bit of a problem.  I see it discussed in
> various places that MySQL/InnoDB can't have tables bigger than 16TB on
> ext4 because of this, when it's in its default one-file-per-object
> mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
> mode like DB2, Oracle etc, in which case I think you can have multiple
> 16TB segment files and get past that ext4 limit).  It's frustrating
> because 16TB is still really, really big and you probably should be
> using partitions, or more partitions, to avoid all kinds of other
> scalability problems at that size.  But however hypothetical the
> scenario might be, it should work, and this is certainly a plausible
> argument against the "aggressive" plan described above with the hard
> cut-off where we get to drop the segmented mode.
>
> Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
> concatenate with the above patches, so you'd have to use link or
> reflink mode (you'd probably want to use that anyway unless due to
> sheer volume of data to copy otherwise, since ext4 is also not capable
> of block-range sharing), but then you'd be out of luck after N future
> major releases, according to that plan where we start deleting the
> code, so you'd need to organise some smaller partitions before that
> time comes.  Or pg_upgrade to a target on xfs etc.  I wonder if a
> future version of extN will increase its max file size.
>
> A less aggressive version of the plan would be that we just keep the
> segment code for the foreseeable future with no planned cut off, and
> we make all of those "piggy back" transformations that I showed in the
> patch set optional.  For example, I had it so that CLUSTER would
> quietly convert your relation to large format, if it was still in
> segmented format (might as well if you're writing all the data out
> anyway, right?), but perhaps that could depend on a GUC.  Likewise for
> base backup.  Etc.  Then someone concerned about hitting the 16TB
> limit on ext4 could opt out.  Or something like that.  It seems funny
> though, that's exactly the user who should want this feature (they
> have 16,000 relation segment files).
>
>
>

-- 
Mark Callaghan
mdcal...@gmail.com


Re: Large files for relations

2023-05-12 Thread Stephen Frost
Greetings,

* Dagfinn Ilmari Mannsåker (ilm...@ilmari.org) wrote:
> Thomas Munro  writes:
> > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
> >> On Mon, May 1, 2023 at 9:29 PM Thomas Munro  wrote:
> >>> I am not aware of any modern/non-historic filesystem[2] that can't do
> >>> large files with ease.  Anyone know of anything to worry about on that
> >>> front?
> >>
> >> There is some trouble in the ambiguity of what we mean by "modern" and
> >> "large files". There are still a large number of users of ext4 where
> >> the max file size is 16TB. Switching to a single large file per
> >> relation would effectively cut the max table size in half for those
> >> users. How would a user with say a 20TB table running on ext4 be
> >> impacted by this change?
> […]
> > A less aggressive version of the plan would be that we just keep the
> > segment code for the foreseeable future with no planned cut off, and
> > we make all of those "piggy back" transformations that I showed in the
> > patch set optional.  For example, I had it so that CLUSTER would
> > quietly convert your relation to large format, if it was still in
> > segmented format (might as well if you're writing all the data out
> > anyway, right?), but perhaps that could depend on a GUC.  Likewise for
> > base backup.  Etc.  Then someone concerned about hitting the 16TB
> > limit on ext4 could opt out.  Or something like that.  It seems funny
> > though, that's exactly the user who should want this feature (they
> > have 16,000 relation segment files).
> 
> If we're going to have to keep the segment code for the foreseeable
> future anyway, could we not get most of the benefit by increasing the
> segment size to something like 1TB?  The vast majority of tables would
> fit in one file, and there would be less risk of hitting filesystem
> limits.

While I tend to agree that 1GB is too small, 1TB seems like it's
possibly going to end up on the too big side of things, or at least,
if we aren't getting rid of the segment code then it's possibly throwing
away the benefits we have from the smaller segments without really
giving us all that much.  Going from 1G to 10G would reduce the number
of open file descriptors by quite a lot without having much of a net
change on other things.  50G or 100G would reduce the FD handles further
but starts to make us lose out a bit more on some of the nice parts of
having multiple segments.

Just some thoughts.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Large files for relations

2023-05-12 Thread Jim Mlodgenski
On Thu, May 11, 2023 at 7:38 PM Thomas Munro  wrote:

> On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
> > On Mon, May 1, 2023 at 9:29 PM Thomas Munro 
> wrote:
> >> I am not aware of any modern/non-historic filesystem[2] that can't do
> >> large files with ease.  Anyone know of anything to worry about on that
> >> front?
> >
> > There is some trouble in the ambiguity of what we mean by "modern" and
> "large files". There are still a large number of users of ext4 where the
> max file size is 16TB. Switching to a single large file per relation would
> effectively cut the max table size in half for those users. How would a
> user with say a 20TB table running on ext4 be impacted by this change?
>
> Hrmph.  Yeah, that might be a bit of a problem.  I see it discussed in
> various places that MySQL/InnoDB can't have tables bigger than 16TB on
> ext4 because of this, when it's in its default one-file-per-object
> mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
> mode like DB2, Oracle etc, in which case I think you can have multiple
> 16TB segment files and get past that ext4 limit).  It's frustrating
> because 16TB is still really, really big and you probably should be
> using partitions, or more partitions, to avoid all kinds of other
> scalability problems at that size.  But however hypothetical the
> scenario might be, it should work,
>

Agreed, it is frustrating, but it is not hypothetical. I have seen a number
of
users having single tables larger than 16TB and don't use partitioning
because
of the limitations we have today. The most common reason is needing multiple
unique constraints on the table that don't include the partition key.
Something
like a user_id and email. There are workarounds for those cases, but usually
it's easier to deal with a single large table than to deal with the sharp
edges
those workarounds introduce.


Re: Large files for relations

2023-05-12 Thread Dagfinn Ilmari Mannsåker
Thomas Munro  writes:

> On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
>> On Mon, May 1, 2023 at 9:29 PM Thomas Munro  wrote:
>>> I am not aware of any modern/non-historic filesystem[2] that can't do
>>> large files with ease.  Anyone know of anything to worry about on that
>>> front?
>>
>> There is some trouble in the ambiguity of what we mean by "modern" and
>> "large files". There are still a large number of users of ext4 where
>> the max file size is 16TB. Switching to a single large file per
>> relation would effectively cut the max table size in half for those
>> users. How would a user with say a 20TB table running on ext4 be
>> impacted by this change?
[…]
> A less aggressive version of the plan would be that we just keep the
> segment code for the foreseeable future with no planned cut off, and
> we make all of those "piggy back" transformations that I showed in the
> patch set optional.  For example, I had it so that CLUSTER would
> quietly convert your relation to large format, if it was still in
> segmented format (might as well if you're writing all the data out
> anyway, right?), but perhaps that could depend on a GUC.  Likewise for
> base backup.  Etc.  Then someone concerned about hitting the 16TB
> limit on ext4 could opt out.  Or something like that.  It seems funny
> though, that's exactly the user who should want this feature (they
> have 16,000 relation segment files).

If we're going to have to keep the segment code for the foreseeable
future anyway, could we not get most of the benefit by increasing the
segment size to something like 1TB?  The vast majority of tables would
fit in one file, and there would be less risk of hitting filesystem
limits.

- ilmari




Re: Large files for relations

2023-05-11 Thread Thomas Munro
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
> On Mon, May 1, 2023 at 9:29 PM Thomas Munro  wrote:
>> I am not aware of any modern/non-historic filesystem[2] that can't do
>> large files with ease.  Anyone know of anything to worry about on that
>> front?
>
> There is some trouble in the ambiguity of what we mean by "modern" and "large 
> files". There are still a large number of users of ext4 where the max file 
> size is 16TB. Switching to a single large file per relation would effectively 
> cut the max table size in half for those users. How would a user with say a 
> 20TB table running on ext4 be impacted by this change?

Hrmph.  Yeah, that might be a bit of a problem.  I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit).  It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size.  But however hypothetical the
scenario might be, it should work, and this is certainly a plausible
argument against the "aggressive" plan described above with the hard
cut-off where we get to drop the segmented mode.

Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
concatenate with the above patches, so you'd have to use link or
reflink mode (you'd probably want to use that anyway unless due to
sheer volume of data to copy otherwise, since ext4 is also not capable
of block-range sharing), but then you'd be out of luck after N future
major releases, according to that plan where we start deleting the
code, so you'd need to organise some smaller partitions before that
time comes.  Or pg_upgrade to a target on xfs etc.  I wonder if a
future version of extN will increase its max file size.

A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional.  For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC.  Likewise for
base backup.  Etc.  Then someone concerned about hitting the 16TB
limit on ext4 could opt out.  Or something like that.  It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).




Re: Large files for relations

2023-05-11 Thread Jim Mlodgenski
On Mon, May 1, 2023 at 9:29 PM Thomas Munro  wrote:

>
> I am not aware of any modern/non-historic filesystem[2] that can't do
> large files with ease.  Anyone know of anything to worry about on that
> front?


There is some trouble in the ambiguity of what we mean by "modern" and
"large files". There are still a large number of users of ext4 where the
max file size is 16TB. Switching to a single large file per relation would
effectively cut the max table size in half for those users. How would a
user with say a 20TB table running on ext4 be impacted by this change?


Re: Large files for relations

2023-05-09 Thread Stephen Frost
Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> On Wed, May 3, 2023 at 1:37 AM Thomas Munro  wrote:
> > On Wed, May 3, 2023 at 5:21 PM Thomas Munro 
> > wrote:
> > > rsync --link-dest

... rsync isn't really a safe tool to use for PG backups by itself
unless you're using it with archiving and with start/stop backup and
with checksums enabled.

> > I wonder if rsync will grow a mode that can use copy_file_range() to
> > share blocks with a reference file (= previous backup).  Something
> > like --copy-range-dest.  That'd work for large-file relations
> > (assuming a file system that has block sharing, like XFS and ZFS).
> > You wouldn't get the "mtime is enough, I don't even need to read the
> > bytes" optimisation, which I assume makes all database hackers feel a
> > bit queasy anyway, but you'd get the space savings via the usual
> > rolling checksum or a cheaper version that only looks for strong
> > checksum matches at the same offset, or whatever other tricks rsync
> > might have up its sleeve.

There's also really good reasons to have multiple full backups and not
just a single full backup and then lots and lots of incrementals which
basically boils down to "are you really sure that one copy of that one
really important file won't every disappear from your backup
repository..?"

That said, pgbackrest does now have block-level incremental backups
(where we define our own block size ...) and there's reasons we decided
against going down the LSN-based approach (not the least of which is
that the LSN isn't always updated...), but long story short, moving to
larger than 1G files should be something that pgbackrest will be able
to handle without as much impact as there would have been previously in
terms of incremental backups.  There is a loss in the ability to use
mtime to scan just the parts of the relation that changed and that's
unfortunate but I wouldn't see it as really a game changer (and yes,
there's certainly an argument for not trusting mtime, though I don't
think we've yet had a report where there was an mtime issue that our
mtime-validity checking didn't catch and force pgbackrest into
checksum-based revalidation automatically which resulted in an invalid
backup... of course, not enough people test their backups...).

> I understand the need to reduce open file handles, despite the
> possibilities enabled by using large numbers of small file sizes.

I'm also generally in favor of reducing the number of open file handles
that we have to deal with.  Addressing the concerns raised nearby about
weird corner-cases of non-1G length ABCDEF.1 files existing while
ABCDEF.2, and more, files exist is certainly another good argument in
favor of getting rid of segments.

> I am curious whether a move like this to create a generational change in
> file file format shouldn't be more ambitious, perhaps altering the block
> format to insert a block format version number, whether that be at every
> block, or every megabyte, or some other interval, and whether we store it
> in-file or in a separate file to accompany the first non-segmented. Having
> such versioning information would allow blocks of different formats to
> co-exist in the same table, which could be critical to future changes such
> as 64 bit XIDs, etc.

To the extent you're interested in this, there are patches posted which
are alrady trying to move us in a direction that would allow for
different page formats that add in space for other features such as
64bit XIDs, better checksums, and TDE tags to be supported.

https://commitfest.postgresql.org/43/3986/

Currently those patches are expecting it to be declared at initdb time,
but the way they're currently written that's more of a soft requirement
as you can tell on a per-page basis what features are enabled for that
page.  Might make sense to support it in that form first anyway though,
before going down the more ambitious route of allowing different pages
to have different sets of features enabled for them concurrently.

When it comes to 'a separate file', we do have forks already and those
serve a very valuable but distinct use-case where you can get
information from the much smaller fork (be it the FSM or the VM or some
future thing) while something like 64bit XIDs or a stronger checksum is
something you'd really need on every page.  I have serious doubts about
a proposal where we'd store information needed on every page read in
some far away block that's still in the same file such as using
something every 1MB as that would turn every block access into two..

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Large files for relations

2023-05-09 Thread Corey Huinker
On Wed, May 3, 2023 at 1:37 AM Thomas Munro  wrote:

> On Wed, May 3, 2023 at 5:21 PM Thomas Munro 
> wrote:
> > rsync --link-dest
>
> I wonder if rsync will grow a mode that can use copy_file_range() to
> share blocks with a reference file (= previous backup).  Something
> like --copy-range-dest.  That'd work for large-file relations
> (assuming a file system that has block sharing, like XFS and ZFS).
> You wouldn't get the "mtime is enough, I don't even need to read the
> bytes" optimisation, which I assume makes all database hackers feel a
> bit queasy anyway, but you'd get the space savings via the usual
> rolling checksum or a cheaper version that only looks for strong
> checksum matches at the same offset, or whatever other tricks rsync
> might have up its sleeve.
>

I understand the need to reduce open file handles, despite the
possibilities enabled by using large numbers of small file sizes.
Snowflake, for instance, sees everything in 1MB chunks, which makes
massively parallel sequential scans (Snowflake's _only_ query plan)
possible, though I don't know if they accomplish that via separate files,
or via segments within a large file.

I am curious whether a move like this to create a generational change in
file file format shouldn't be more ambitious, perhaps altering the block
format to insert a block format version number, whether that be at every
block, or every megabyte, or some other interval, and whether we store it
in-file or in a separate file to accompany the first non-segmented. Having
such versioning information would allow blocks of different formats to
co-exist in the same table, which could be critical to future changes such
as 64 bit XIDs, etc.


Re: Large files for relations

2023-05-02 Thread Thomas Munro
On Wed, May 3, 2023 at 5:21 PM Thomas Munro  wrote:
> rsync --link-dest

I wonder if rsync will grow a mode that can use copy_file_range() to
share blocks with a reference file (= previous backup).  Something
like --copy-range-dest.  That'd work for large-file relations
(assuming a file system that has block sharing, like XFS and ZFS).
You wouldn't get the "mtime is enough, I don't even need to read the
bytes" optimisation, which I assume makes all database hackers feel a
bit queasy anyway, but you'd get the space savings via the usual
rolling checksum or a cheaper version that only looks for strong
checksum matches at the same offset, or whatever other tricks rsync
might have up its sleeve.




Re: Large files for relations

2023-05-02 Thread Thomas Munro
On Tue, May 2, 2023 at 3:28 PM Pavel Stehule  wrote:
> I like this patch - it can save some system sources - I am not sure how much, 
> because bigger tables usually use partitioning usually.

Yeah, if you only use partitions of < 1GB it won't make a difference.
Larger partitions are not uncommon, though.

> Important note - this feature breaks sharing files on the backup side - so 
> before disabling 1GB sized files, this issue should be solved.

Hmm, right, so there is a backup granularity continuum with "whole
database cluster" at one end, "only files whose size, mtime [or
optionally also checksum] changed since last backup" in the middle,
and "only blocks that changed since LSN of last backup" at the other
end.  Getting closer to the right end of that continuum can make
backups require less reading, less network transfer, less writing
and/or less storage space depending on details.  But this proposal
moves the middle thing further to the left by changing the granularity
from 1GB to whole relation, which can be gargantuan with this patch.
Ultimately we need to be all the way at the right on that continuum,
and there are clearly several people working on that goal.

I'm not involved in any of those projects, but it's fun to think about
an alien technology that produces complete standalone backups like
rsync --link-dest (as opposed to "full" backups followed by a chain of
"incremental" backups that depend on it so you need to retain them
carefully) while still sharing disk blocks with older backups, and
doing so with block granularity.  TL;DW something something WAL
something something copy_file_range().




Re: Large files for relations

2023-05-01 Thread Pavel Stehule
Hi

I like this patch - it can save some system sources - I am not sure how
much, because bigger tables usually use partitioning usually.

Important note - this feature breaks sharing files on the backup side - so
before disabling 1GB sized files, this issue should be solved.

Regards

Pavel