Re: Large files for relations
On 06.03.24 22:54, Thomas Munro wrote: Rebased. I had intended to try to get this into v17, but a couple of unresolved problems came up while rebasing over the new incremental backup stuff. You snooze, you lose. Hopefully we can sort these out in time for the next commitfest: * should pg_combinebasebackup read the control file to fetch the segment size? * hunt for other segment-size related problems that may be lurking in new incremental backup stuff * basebackup_incremental.c wants to use memory in proportion to segment size, which looks like a problem, and I wrote about that in a new thread[1] Overall, I like this idea, and the patch seems to have many bases covered. The patch will need a rebase. I was able to test it on master@{2024-03-13}, but after that there are conflicts. In .cirrus.tasks.yml, one of the test tasks uses --with-segsize-blocks=6, but you are removing that option. You could replace that with something like PG_TEST_INITDB_EXTRA_OPTS='--rel-segsize=48kB' But that won't work exactly because initdb: error: argument of --rel-segsize must be a power of two I suppose that's ok as a change, since it makes the arithmetic more efficient. But maybe it should be called out explicitly in the commit message. If I run it with 64kB, the test pgbench/001_pgbench_with_server fails consistently, so it seems there is still a gap somewhere. A minor point, the initdb error message initdb: error: argument of --rel-segsize must be a multiple of BLCKSZ would be friendlier if actually showed the value of the block size instead of just the symbol. Similarly for the nearby error message about the off_t size. In the control file, all the other fields use unsigned types. Should relseg_size be uint64? PG_CONTROL_VERSION needs to be changed.
Re: Large files for relations
Rebased. I had intended to try to get this into v17, but a couple of unresolved problems came up while rebasing over the new incremental backup stuff. You snooze, you lose. Hopefully we can sort these out in time for the next commitfest: * should pg_combinebasebackup read the control file to fetch the segment size? * hunt for other segment-size related problems that may be lurking in new incremental backup stuff * basebackup_incremental.c wants to use memory in proportion to segment size, which looks like a problem, and I wrote about that in a new thread[1] [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5%2Bg%40mail.gmail.com From 85678257fef94aa3ca3efb39ce55fb66df7c889e Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 26 May 2023 01:41:11 +1200 Subject: [PATCH v3] Allow relation segment size to be set by initdb. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previously, relation segment size was a rarely modified compile time option. Make it an initdb option, so that users with very large tables can avoid using so many files and file descriptors. The initdb option --rel-segsize is modeled on the existing --wal-segsize option. The data type used to store the size is int64, not BlockNumber, because it seems reasonable to want to be able to say --rel-segsize=32TB (= don't use segments at all), but that would overflow uint32. It should be fairly straightforward to teach pg_upgrade (or some new dedicated tool) to convert an existing cluster to a new segment size, but that is not done yet, so for now this is only useful for entirely new clusters. The default behavior is unchanged: 1GB segments. On Windows, we can't go above 2GB for now due (we'd have to make a lot of changes due to Windows' small off_t). XXX work remains to be done for incremental backups Reviewed-by: David Steele Reviewed-by: Peter Eisentraut Reviewed-by: Stephen Frost Reviewed-by: Jim Mlodgenski Reivewed-by: Dagfinn Ilmari Mannsåker Reviewed-by: Pavel Stehule Discussion: https://postgr.es/m/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com --- configure | 91 -- configure.ac| 55 - doc/src/sgml/config.sgml| 7 +- doc/src/sgml/ref/initdb.sgml| 24 meson.build | 14 --- src/backend/access/transam/xlog.c | 11 +- src/backend/backup/basebackup.c | 7 +- src/backend/backup/basebackup_incremental.c | 31 +++-- src/backend/bootstrap/bootstrap.c | 5 +- src/backend/storage/file/buffile.c | 6 +- src/backend/storage/smgr/md.c | 128 src/backend/storage/smgr/smgr.c | 14 +++ src/backend/utils/misc/guc.c| 16 +++ src/backend/utils/misc/guc_tables.c | 12 +- src/bin/initdb/initdb.c | 47 ++- src/bin/pg_checksums/pg_checksums.c | 2 +- src/bin/pg_combinebackup/reconstruct.c | 18 ++- src/bin/pg_controldata/pg_controldata.c | 2 +- src/bin/pg_resetwal/pg_resetwal.c | 4 +- src/bin/pg_rewind/filemap.c | 4 +- src/bin/pg_rewind/pg_rewind.c | 3 + src/bin/pg_rewind/pg_rewind.h | 1 + src/bin/pg_upgrade/relfilenumber.c | 2 +- src/include/catalog/pg_control.h| 2 +- src/include/pg_config.h.in | 13 -- src/include/storage/smgr.h | 3 + src/include/utils/guc_tables.h | 1 + 27 files changed, 249 insertions(+), 274 deletions(-) diff --git a/configure b/configure index 36feeafbb23..49a7f0f2c4a 100755 --- a/configure +++ b/configure @@ -842,8 +842,6 @@ enable_dtrace enable_tap_tests enable_injection_points with_blocksize -with_segsize -with_segsize_blocks with_wal_blocksize with_llvm enable_depend @@ -1551,9 +1549,6 @@ Optional Packages: --with-pgport=PORTNUM set default port number [5432] --with-blocksize=BLOCKSIZE set table block size in kB [8] - --with-segsize=SEGSIZE set table segment size in GB [1] - --with-segsize-blocks=SEGSIZE_BLOCKS - set table segment size in blocks [0] --with-wal-blocksize=BLOCKSIZE set WAL block size in kB [8] --with-llvm build with LLVM based JIT support @@ -3759,85 +3754,6 @@ cat >>confdefs.h <<_ACEOF _ACEOF -# -# Relation segment size -# - - - -# Check whether --with-segsize was given. -if test "${with_segsize+set}" = set; then : - withval=$with_segsize; - case $withval in -yes) - as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5 - ;; -no) - as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5 - ;; -*) - segsize=$withval -
Re: Large files for relations
On Mon, Jun 12, 2023 at 8:53 PM David Steele wrote: > + if (strcmp(endptr, "kB") == 0) > > Why kB here instead of KB to match MB, GB, TB below? Those are SI prefixes[1], and we use kB elsewhere too. ("K" was used for kelvins, so they went with "k" for kilo. Obviously these aren't fully SI, because B is supposed to mean bel. A gigabel would be pretty loud... more than "sufficient power to create a black hole"[2], hehe.) > + int64 relseg_size;/* blocks per segment of large > relation */ > > This will require PG_CONTROL_VERSION to be bumped -- but you are > probably waiting until commit time to avoid annoying conflicts, though I > don't think it is as likely as with CATALOG_VERSION_NO. Oh yeah, thanks. > > Another > > idea would be to make it static in md.c and call smgrsetsegmentsize(), > > or something like that. That could be a nice place to compute the > > "shift" value up front, instead of computing it each time in > > blockno_to_segno(), but that's probably not worth bothering with (?). > > BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's > > about the only place where someone could say that this change makes > > things worse for people not interested in the new feature, so I was > > careful to get rid of / and % operations with no-longer-constant RHS. > > Right -- not sure we should be troubling ourselves with trying to > optimize away ops that are very fast, unless they are computed trillions > of times. This obviously has some things in common with David Christensen's nearby patch for block sizes[3], and we should be shifting and masking there too if that route is taken (as opposed to a specialise-the-code route or somethign else). My binary-log trick is probably a little too cute though... I should probably just go and set a shift variable. Thanks for looking! [1] https://en.wikipedia.org/wiki/Metric_prefix [2] https://en.wiktionary.org/wiki/gigabel [3] https://www.postgresql.org/message-id/flat/CAOxo6XKx7DyDgBkWwPfnGSXQYNLpNrSWtYnK6-1u%2BQHUwRa1Gg%40mail.gmail.com
Re: Large files for relations
On 5/28/23 08:48, Thomas Munro wrote: Alright, since I had some time to kill in an airport, here is a starter patch for initdb --rel-segsize. I've gone through this patch and it looks pretty good to me. A few things: +* rel_setment_size, we will truncate the K+1st segment to 0 length rel_setment_size -> rel_segment_size +* We used a phony GUC with a custome show function, because we don't custome -> custom + if (strcmp(endptr, "kB") == 0) Why kB here instead of KB to match MB, GB, TB below? + int64 relseg_size;/* blocks per segment of large relation */ This will require PG_CONTROL_VERSION to be bumped -- but you are probably waiting until commit time to avoid annoying conflicts, though I don't think it is as likely as with CATALOG_VERSION_NO. Some random thoughts: Another potential option name would be --segsize, if we think we're going to use this for temp files too eventually. I feel like temp file segsize should be separately configurable for the same reason that we are leaving it as 1GB for now. Maybe it's not so beautiful to have that global variable rel_segment_size (which replaces REL_SEGSIZE everywhere). Maybe not, but it is the way these things are done in general, .e.g. wal_segment_size, so I don't think it will be too controversial. Another idea would be to make it static in md.c and call smgrsetsegmentsize(), or something like that. That could be a nice place to compute the "shift" value up front, instead of computing it each time in blockno_to_segno(), but that's probably not worth bothering with (?). BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's about the only place where someone could say that this change makes things worse for people not interested in the new feature, so I was careful to get rid of / and % operations with no-longer-constant RHS. Right -- not sure we should be troubling ourselves with trying to optimize away ops that are very fast, unless they are computed trillions of times. I had to promote segment size to int64 (global variable, field in control file), because otherwise it couldn't represent --rel-segsize=32TB (it'd be too big by one). Other ideas would be to store the shift value instead of the size, or store the max block number, eg subtract one, or use InvalidBlockNumber to mean "no limit" (with more branches to test for it). The only problem I ran into with the larger type was that 'SHOW segment_size' now needs a custom show function because we don't have int64 GUCs. A custom show function seems like a reasonable solution here. A C type confusion problem that I noticed: some code uses BlockNumber and some code uses int for segment numbers. It's not really a reachable problem for practical reasons (you'd need over 2 billion directories and VFDs to reach it), but it's wrong to use int if segment size can be set as low as BLCKSZ (one file per block); you could have more segments than an int can represent. We could go for uint32, BlockNumber or create SegmentNumber (which I think I've proposed before, and lost track of...). We can address that separately (perhaps by finding my old patch...) I think addressing this separately is fine, though maybe enforcing some reasonable minimum in initdb would be a good idea for this patch. For my 2c SEGSIZE == BLOCKSZ just makes very little sense. Lastly, I think the blockno_to_segno(), blockno_within_segment(), and blockno_to_seekpos() functions add enough readability that they should be committed regardless of how this patch proceeds. Regards, -David
Re: Large files for relations
On 28.05.23 02:48, Thomas Munro wrote: Another potential option name would be --segsize, if we think we're going to use this for temp files too eventually. Maybe it's not so beautiful to have that global variable rel_segment_size (which replaces REL_SEGSIZE everywhere). Another idea would be to make it static in md.c and call smgrsetsegmentsize(), or something like that. I think one way to look at this is that the segment size is a configuration property of the md.c smgr. I have been thinking a bit about how smgr-level configuration could look. You can't use a catalog table, but we also can't have smgr plugins get space in pg_control. Anyway, I'm not asking you to design this now. A global variable via pg_control seems fine for now. But it wouldn't be an smgr API call, I think.
Re: Large files for relations
On Sun, May 28, 2023 at 2:48 AM Thomas Munro wrote: > (you'd need over 2 billion > directories ... directory *entries* (segment files), I meant to write there.
Re: Large files for relations
On Thu, May 25, 2023 at 1:08 PM Stephen Frost wrote: > * Peter Eisentraut (peter.eisentr...@enterprisedb.com) wrote: > > On 24.05.23 02:34, Thomas Munro wrote: > > > * pg_upgrade would convert if source and target don't match > > > > This would be good, but it could also be an optional or later feature. > > Agreed. OK. I do have a patch for that, but I'll put that (+ copy_file_range) aside for now so we can talk about the basic feature. Without that, pg_upgrade just rejects mismatching clusters as it always did, no change required. > > > I would probably also leave out those Windows file API changes, too. > > > --rel-segsize would simply refuse larger sizes until someone does the > > > work on that platform, to keep the initial proposal small. > > > > Those changes from off_t to pgoff_t? Yes, it would be good to do without > > those. Apart of the practical problems that have been brought up, this was > > a major annoyance with the proposed patch set IMO. +1, it was not nice. Alright, since I had some time to kill in an airport, here is a starter patch for initdb --rel-segsize. Some random thoughts: Another potential option name would be --segsize, if we think we're going to use this for temp files too eventually. Maybe it's not so beautiful to have that global variable rel_segment_size (which replaces REL_SEGSIZE everywhere). Another idea would be to make it static in md.c and call smgrsetsegmentsize(), or something like that. That could be a nice place to compute the "shift" value up front, instead of computing it each time in blockno_to_segno(), but that's probably not worth bothering with (?). BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's about the only place where someone could say that this change makes things worse for people not interested in the new feature, so I was careful to get rid of / and % operations with no-longer-constant RHS. I had to promote segment size to int64 (global variable, field in control file), because otherwise it couldn't represent --rel-segsize=32TB (it'd be too big by one). Other ideas would be to store the shift value instead of the size, or store the max block number, eg subtract one, or use InvalidBlockNumber to mean "no limit" (with more branches to test for it). The only problem I ran into with the larger type was that 'SHOW segment_size' now needs a custom show function because we don't have int64 GUCs. A C type confusion problem that I noticed: some code uses BlockNumber and some code uses int for segment numbers. It's not really a reachable problem for practical reasons (you'd need over 2 billion directories and VFDs to reach it), but it's wrong to use int if segment size can be set as low as BLCKSZ (one file per block); you could have more segments than an int can represent. We could go for uint32, BlockNumber or create SegmentNumber (which I think I've proposed before, and lost track of...). We can address that separately (perhaps by finding my old patch...) From c6809aafd147d0ac286ab73c2d8fbe571c698550 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 26 May 2023 01:41:11 +1200 Subject: [PATCH 1/2] Allow relation segment size to be set by initdb. Previously, relation segment size was a rarely modified compile time option. Make it an initdb option, so that users with very large tables can avoid using so many files and file descriptors. The initdb option --rel-segsize is modeled on the existing --wal-segsize option. The data type used to store the size is int64, not BlockNumber, because it seems reasonable to want to be able to say --rel-segsize=32TB (= don't use segments at all), but that would overflow uint32. The default behavior is unchanged: 1GB segments. On Windows, we can't go above 2GB for now due (we'd have to make a lot of changes due to Windows' small off_t). Discussion: https://postgr.es/m/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com diff --git a/configure b/configure index 1b415142d1..a3dee3ea74 100755 --- a/configure +++ b/configure @@ -841,8 +841,6 @@ enable_coverage enable_dtrace enable_tap_tests with_blocksize -with_segsize -with_segsize_blocks with_wal_blocksize with_CC with_llvm @@ -1551,9 +1549,6 @@ Optional Packages: --with-pgport=PORTNUM set default port number [5432] --with-blocksize=BLOCKSIZE set table block size in kB [8] - --with-segsize=SEGSIZE set table segment size in GB [1] - --with-segsize-blocks=SEGSIZE_BLOCKS - set table segment size in blocks [0] --with-wal-blocksize=BLOCKSIZE set WAL block size in kB [8] --with-CC=CMD set compiler (deprecated) @@ -3731,85 +3726,6 @@ cat >>confdefs.h <<_ACEOF _ACEOF -# -# Relation segment size -# - - - -# Check whether --with-segsize was given. -if test "${with_segsize+set}" = set; then : - withval=$with_segsize; - case $withval in -yes) - as_fn_error $? "argument required for
Re: Large files for relations
Greetings, * Peter Eisentraut (peter.eisentr...@enterprisedb.com) wrote: > On 24.05.23 02:34, Thomas Munro wrote: > > Thanks all for the feedback. It was a nice idea and it *almost* > > works, but it seems like we just can't drop segmented mode. And the > > automatic transition schemes I showed don't make much sense without > > that goal. > > > > What I'm hearing is that something simple like this might be more > > acceptable: > > > > * initdb --rel-segsize (cf --wal-segsize), default unchanged > > makes sense Agreed, this seems alright in general. Having more initdb-time options to help with certain use-cases rather than having things be compile-time is definitely just generally speaking a good direction to be going in, imv. > > * pg_upgrade would convert if source and target don't match > > This would be good, but it could also be an optional or later feature. Agreed. > Maybe that should be a different mode, like --copy-and-adjust-as-necessary, > so that users would have to opt into what would presumably be slower than > plain --copy, rather than being surprised by it, if they unwittingly used > incompatible initdb options. I'm curious as to why it would be slower than a regular copy..? > > I would probably also leave out those Windows file API changes, too. > > --rel-segsize would simply refuse larger sizes until someone does the > > work on that platform, to keep the initial proposal small. > > Those changes from off_t to pgoff_t? Yes, it would be good to do without > those. Apart of the practical problems that have been brought up, this was > a major annoyance with the proposed patch set IMO. > > > I would probably leave the experimental copy_on_write() ideas out too, > > for separate discussion in a separate proposal. > > right You mean copy_file_range() here, right? Shouldn't we just add support for that today into pg_upgrade, independently of this? Seems like a worthwhile improvement even without the benefit it would provide to changing segment sizes. Thanks, Stephen signature.asc Description: PGP signature
Re: Large files for relations
On Wed, May 24, 2023 at 2:18 AM Peter Eisentraut wrote: > > What I'm hearing is that something simple like this might be more > > acceptable: > > > > * initdb --rel-segsize (cf --wal-segsize), default unchanged > > makes sense +1. > > * pg_upgrade would convert if source and target don't match > > This would be good, but it could also be an optional or later feature. +1. I think that would be nice to have, but not absolutely required. IMHO it's best not to overcomplicate these projects. Not everything needs to be part of the initial commit. If the initial commit happens 2 months from now and then stuff like this gets added over the next 8, that's strictly better than trying to land the whole patch set next March. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Large files for relations
On 24.05.23 02:34, Thomas Munro wrote: Thanks all for the feedback. It was a nice idea and it *almost* works, but it seems like we just can't drop segmented mode. And the automatic transition schemes I showed don't make much sense without that goal. What I'm hearing is that something simple like this might be more acceptable: * initdb --rel-segsize (cf --wal-segsize), default unchanged makes sense * pg_upgrade would convert if source and target don't match This would be good, but it could also be an optional or later feature. Maybe that should be a different mode, like --copy-and-adjust-as-necessary, so that users would have to opt into what would presumably be slower than plain --copy, rather than being surprised by it, if they unwittingly used incompatible initdb options. I would probably also leave out those Windows file API changes, too. --rel-segsize would simply refuse larger sizes until someone does the work on that platform, to keep the initial proposal small. Those changes from off_t to pgoff_t? Yes, it would be good to do without those. Apart of the practical problems that have been brought up, this was a major annoyance with the proposed patch set IMO. I would probably leave the experimental copy_on_write() ideas out too, for separate discussion in a separate proposal. right
Re: Large files for relations
Thanks all for the feedback. It was a nice idea and it *almost* works, but it seems like we just can't drop segmented mode. And the automatic transition schemes I showed don't make much sense without that goal. What I'm hearing is that something simple like this might be more acceptable: * initdb --rel-segsize (cf --wal-segsize), default unchanged * pg_upgrade would convert if source and target don't match I would probably also leave out those Windows file API changes, too. --rel-segsize would simply refuse larger sizes until someone does the work on that platform, to keep the initial proposal small. I would probably leave the experimental copy_on_write() ideas out too, for separate discussion in a separate proposal.
Re: Large files for relations
On Fri, May 12, 2023 at 9:53 AM Stephen Frost wrote: > While I tend to agree that 1GB is too small, 1TB seems like it's > possibly going to end up on the too big side of things, or at least, > if we aren't getting rid of the segment code then it's possibly throwing > away the benefits we have from the smaller segments without really > giving us all that much. Going from 1G to 10G would reduce the number > of open file descriptors by quite a lot without having much of a net > change on other things. 50G or 100G would reduce the FD handles further > but starts to make us lose out a bit more on some of the nice parts of > having multiple segments. This is my view as well, more or less. I don't really like our current handling of relation segments; we know it has bugs, and making it non-buggy feels difficult. And there are performance issues as well -- file descriptor consumption, for sure, but also probably that crossing a file boundary likely breaks the operating system's ability to do readahead to some degree. However, I think we're going to find that moving to a system where we have just one file per relation fork and that file can be arbitrarily large is not fantastic, either. Jim's point about running into filesystem limits is a good one (hi Jim, long time no see!) and the problem he points out with ext4 is almost certainly not the only one. It doesn't just have to be filesystems, either. It could be a limitation of an archiving tool (tar, zip, cpio) or a file copy utility or whatever as well. A quick Google search suggests that most such things have been updated to use 64-bit sizes, but my point is that the set of things that can potentially cause problems is broader than just the filesystem. Furthermore, even when there's no hard limit at play, a smaller file size can occasionally be *convenient*, as in Pavel's example of using hard links to share storage between backups. From that point of view, a 16GB or 64GB or 256GB file size limit seems more convenient than no limit and more convenient than a large limit like 1TB. However, the bugs are the flies in the ointment (ahem). If we just make the segment size bigger but don't get rid of segments altogether, then we still have to fix the bugs that can occur when you do have multiple segments. I think part of Thomas's motivation is to dodge that whole category of problems. If we gradually deprecate multi-segment mode in favor of single-file-per-relation-fork, then the fact that the segment handling code has bugs becomes progressively less relevant. While that does make some sense, I'm not sure I really agree with the approach. The problem is that we're trading problems that we at least theoretically can fix somehow by hitting our code with a big enough hammer for an unknown set of problems that stem from limitations of software we don't control, maybe don't even know about. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Large files for relations
On Fri, May 12, 2023 at 4:02 PM Thomas Munro wrote: > On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN wrote: > > Repeating what was mentioned on Twitter, because I had some experience > with the topic. With fewer files per table there will be more contention on > the per-inode mutex (which might now be the per-inode rwsem). I haven't > read filesystem source in a long time. Back in the day, and perhaps today, > it was locked for the duration of a write to storage (locked within the > kernel) and was briefly locked while setting up a read. > > > > The workaround for writes was one of: > > 1) enable disk write cache or use battery-backed HW RAID to make writes > faster (yes disks, I encountered this prior to 2010) > > 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't > locked for the duration of a write > > > > I have a vague memory that filesystems have improved in this regard. > > (I am interpreting your "use XFS" to mean "use XFS instead of ext4".) > Yes, although when the decision was made it was probably ext-3 -> XFS. We suffered from fsync a file == fsync the filesystem because MySQL binlogs use buffered IO and are appended on write. Switching from ext-? to XFS was an easy perf win so I don't have much experience with ext-? over the past decade. > Right, 80s file systems like UFS (and I suspect ext and ext2, which > Late 80s is when I last hacked on Unix fileys code, excluding browsing XFS and ext source. Unix was easy back then -- one big kernel lock covers everything. > some time sooner). Currently our code believes that it is not safe to > call fdatasync() for files whose size might have changed. There is no > Long ago we added code for InnoDB to avoid fsync/fdatasync in some cases when O_DIRECT was used. While great for performance we also forgot to make sure they were still done when files were extended. Eventually we fixed that. Thanks for all of the details. -- Mark Callaghan mdcal...@gmail.com
Re: Large files for relations
On Sat, May 13, 2023 at 11:01 AM Thomas Munro wrote: > On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN wrote: > > use XFS and O_DIRECT As for direct I/O, we're only just getting started on that. We currently can't produce more than one concurrent WAL write, and then for relation data, we just got very basic direct I/O support but we haven't yet got the asynchronous machinery to drive it properly (work in progress, more soon). I was just now trying to find out what the state of parallel direct writes is in ext4, and it looks like it's finally happening: https://www.phoronix.com/news/Linux-6.3-EXT4
Re: Large files for relations
On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN wrote: > Repeating what was mentioned on Twitter, because I had some experience with > the topic. With fewer files per table there will be more contention on the > per-inode mutex (which might now be the per-inode rwsem). I haven't read > filesystem source in a long time. Back in the day, and perhaps today, it was > locked for the duration of a write to storage (locked within the kernel) and > was briefly locked while setting up a read. > > The workaround for writes was one of: > 1) enable disk write cache or use battery-backed HW RAID to make writes > faster (yes disks, I encountered this prior to 2010) > 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't > locked for the duration of a write > > I have a vague memory that filesystems have improved in this regard. (I am interpreting your "use XFS" to mean "use XFS instead of ext4".) Right, 80s file systems like UFS (and I suspect ext and ext2, which were probably based on similar ideas and ran on non-SMP machines?) used coarse grained locking including vnodes/inodes level. Then over time various OSes and file systems have improved concurrency. Brief digression, as someone who got started on IRIX in the 90 and still thinks those were probably the coolest computers: At SGI, first they replaced SysV UFS with EFS (E for extent-based allocation) and invented O_DIRECT to skip the buffer pool, and then blew the doors off everything with XFS, which maximised I/O concurrency and possibly (I guess, it's not open source so who knows?) involved a revamped VFS to lower stuff like inode locks, motivated by monster IRIX boxes with up to 1024 CPUs and huge storage arrays. In the Linux ext3 era, I remember hearing lots of reports of various kinds of large systems going faster just by switching to XFS and there is lots of writing about that. ext4 certainly changed enormously. One reason back in those days (mid 2000s?) was the old fsync-actually-fsyncs-everything-in-the-known-universe-and-not-just-your-file thing, and another was the lack of write concurrency especially for direct I/O, and probably lots more things. But that's all ancient history... As for ext4, we've detected and debugged clues about the gradual weakening of locking over time on this list: we know that concurrent read/write to the same page of a file was previously atomic, but when we switched to pread/pwrite for most data (ie not making use of the current file position), it ceased to be (a concurrent reader can see a mash-up of old and new data with visible cache line-ish stripes in it, so there isn't even a write-lock for the page); then we noticed that in later kernels even read/write ceased to be atomic (implicating a change in file size/file position interlocking, I guess). I also vaguely recall reading on here a long time ago that lseek() performance was dramatically improved with weaker inode interlocking, perhaps even in response to this very program's pathological SEEK_END call frequency (something I hope to fix, but I digress). So I think it's possible that the effect you mentioned is gone? I can think of a few differences compared to those other RDBMSs. There the discussion was about one-file-per-relation vs one-big-file-for-everything, whereas we're talking about one-file-per-relation vs many-files-per-relation (which doesn't change the point much, just making clear that I'm not proposing a 42PB file to whole everything, so you can still partition to get different files). We also usually call fsync in series in our checkpointer (after first getting the writebacks started with sync_file_range() some time sooner). Currently our code believes that it is not safe to call fdatasync() for files whose size might have changed. There is no basis for that in POSIX or in any system that I currently know of (though I haven't looked into it seriously), but I believe there was a historical file system that at some point in history interpreted "non-essential meta data" (the stuff POSIX allows it not to flush to disk) to include "the size of the file" (whereas POSIX really just meant that you don't have to synchronise the mtime and similar), which is probably why PostgreSQL has some code that calls fsync() on newly created empty WAL segments to "make sure the indirect blocks are down on disk" before allowing itself to use only fdatasync() later to overwrite it with data. The point being that, for the most important kind of interactive/user facing I/O latency, namely WAL flushes, we already use fdatasync(). It's possible that we could use it to flush relation data too (ie the relation files in question here, usually synchronised by the checkpointer) according to POSIX but it doesn't immediately seem like something that should be at all hot and it's background work. But perhaps I lack imagination. Thanks, thought-provoking stuff.
Re: Large files for relations
Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read. The workaround for writes was one of: 1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010) 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write I have a vague memory that filesystems have improved in this regard. On Thu, May 11, 2023 at 4:38 PM Thomas Munro wrote: > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski wrote: > > On Mon, May 1, 2023 at 9:29 PM Thomas Munro > wrote: > >> I am not aware of any modern/non-historic filesystem[2] that can't do > >> large files with ease. Anyone know of anything to worry about on that > >> front? > > > > There is some trouble in the ambiguity of what we mean by "modern" and > "large files". There are still a large number of users of ext4 where the > max file size is 16TB. Switching to a single large file per relation would > effectively cut the max table size in half for those users. How would a > user with say a 20TB table running on ext4 be impacted by this change? > > Hrmph. Yeah, that might be a bit of a problem. I see it discussed in > various places that MySQL/InnoDB can't have tables bigger than 16TB on > ext4 because of this, when it's in its default one-file-per-object > mode (as opposed to its big-tablespace-files-to-hold-all-the-objects > mode like DB2, Oracle etc, in which case I think you can have multiple > 16TB segment files and get past that ext4 limit). It's frustrating > because 16TB is still really, really big and you probably should be > using partitions, or more partitions, to avoid all kinds of other > scalability problems at that size. But however hypothetical the > scenario might be, it should work, and this is certainly a plausible > argument against the "aggressive" plan described above with the hard > cut-off where we get to drop the segmented mode. > > Concretely, a 20TB pg_upgrade in copy mode would fail while trying to > concatenate with the above patches, so you'd have to use link or > reflink mode (you'd probably want to use that anyway unless due to > sheer volume of data to copy otherwise, since ext4 is also not capable > of block-range sharing), but then you'd be out of luck after N future > major releases, according to that plan where we start deleting the > code, so you'd need to organise some smaller partitions before that > time comes. Or pg_upgrade to a target on xfs etc. I wonder if a > future version of extN will increase its max file size. > > A less aggressive version of the plan would be that we just keep the > segment code for the foreseeable future with no planned cut off, and > we make all of those "piggy back" transformations that I showed in the > patch set optional. For example, I had it so that CLUSTER would > quietly convert your relation to large format, if it was still in > segmented format (might as well if you're writing all the data out > anyway, right?), but perhaps that could depend on a GUC. Likewise for > base backup. Etc. Then someone concerned about hitting the 16TB > limit on ext4 could opt out. Or something like that. It seems funny > though, that's exactly the user who should want this feature (they > have 16,000 relation segment files). > > > -- Mark Callaghan mdcal...@gmail.com
Re: Large files for relations
Greetings, * Dagfinn Ilmari Mannsåker (ilm...@ilmari.org) wrote: > Thomas Munro writes: > > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski wrote: > >> On Mon, May 1, 2023 at 9:29 PM Thomas Munro wrote: > >>> I am not aware of any modern/non-historic filesystem[2] that can't do > >>> large files with ease. Anyone know of anything to worry about on that > >>> front? > >> > >> There is some trouble in the ambiguity of what we mean by "modern" and > >> "large files". There are still a large number of users of ext4 where > >> the max file size is 16TB. Switching to a single large file per > >> relation would effectively cut the max table size in half for those > >> users. How would a user with say a 20TB table running on ext4 be > >> impacted by this change? > […] > > A less aggressive version of the plan would be that we just keep the > > segment code for the foreseeable future with no planned cut off, and > > we make all of those "piggy back" transformations that I showed in the > > patch set optional. For example, I had it so that CLUSTER would > > quietly convert your relation to large format, if it was still in > > segmented format (might as well if you're writing all the data out > > anyway, right?), but perhaps that could depend on a GUC. Likewise for > > base backup. Etc. Then someone concerned about hitting the 16TB > > limit on ext4 could opt out. Or something like that. It seems funny > > though, that's exactly the user who should want this feature (they > > have 16,000 relation segment files). > > If we're going to have to keep the segment code for the foreseeable > future anyway, could we not get most of the benefit by increasing the > segment size to something like 1TB? The vast majority of tables would > fit in one file, and there would be less risk of hitting filesystem > limits. While I tend to agree that 1GB is too small, 1TB seems like it's possibly going to end up on the too big side of things, or at least, if we aren't getting rid of the segment code then it's possibly throwing away the benefits we have from the smaller segments without really giving us all that much. Going from 1G to 10G would reduce the number of open file descriptors by quite a lot without having much of a net change on other things. 50G or 100G would reduce the FD handles further but starts to make us lose out a bit more on some of the nice parts of having multiple segments. Just some thoughts. Thanks, Stephen signature.asc Description: PGP signature
Re: Large files for relations
On Thu, May 11, 2023 at 7:38 PM Thomas Munro wrote: > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski wrote: > > On Mon, May 1, 2023 at 9:29 PM Thomas Munro > wrote: > >> I am not aware of any modern/non-historic filesystem[2] that can't do > >> large files with ease. Anyone know of anything to worry about on that > >> front? > > > > There is some trouble in the ambiguity of what we mean by "modern" and > "large files". There are still a large number of users of ext4 where the > max file size is 16TB. Switching to a single large file per relation would > effectively cut the max table size in half for those users. How would a > user with say a 20TB table running on ext4 be impacted by this change? > > Hrmph. Yeah, that might be a bit of a problem. I see it discussed in > various places that MySQL/InnoDB can't have tables bigger than 16TB on > ext4 because of this, when it's in its default one-file-per-object > mode (as opposed to its big-tablespace-files-to-hold-all-the-objects > mode like DB2, Oracle etc, in which case I think you can have multiple > 16TB segment files and get past that ext4 limit). It's frustrating > because 16TB is still really, really big and you probably should be > using partitions, or more partitions, to avoid all kinds of other > scalability problems at that size. But however hypothetical the > scenario might be, it should work, > Agreed, it is frustrating, but it is not hypothetical. I have seen a number of users having single tables larger than 16TB and don't use partitioning because of the limitations we have today. The most common reason is needing multiple unique constraints on the table that don't include the partition key. Something like a user_id and email. There are workarounds for those cases, but usually it's easier to deal with a single large table than to deal with the sharp edges those workarounds introduce.
Re: Large files for relations
Thomas Munro writes: > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski wrote: >> On Mon, May 1, 2023 at 9:29 PM Thomas Munro wrote: >>> I am not aware of any modern/non-historic filesystem[2] that can't do >>> large files with ease. Anyone know of anything to worry about on that >>> front? >> >> There is some trouble in the ambiguity of what we mean by "modern" and >> "large files". There are still a large number of users of ext4 where >> the max file size is 16TB. Switching to a single large file per >> relation would effectively cut the max table size in half for those >> users. How would a user with say a 20TB table running on ext4 be >> impacted by this change? […] > A less aggressive version of the plan would be that we just keep the > segment code for the foreseeable future with no planned cut off, and > we make all of those "piggy back" transformations that I showed in the > patch set optional. For example, I had it so that CLUSTER would > quietly convert your relation to large format, if it was still in > segmented format (might as well if you're writing all the data out > anyway, right?), but perhaps that could depend on a GUC. Likewise for > base backup. Etc. Then someone concerned about hitting the 16TB > limit on ext4 could opt out. Or something like that. It seems funny > though, that's exactly the user who should want this feature (they > have 16,000 relation segment files). If we're going to have to keep the segment code for the foreseeable future anyway, could we not get most of the benefit by increasing the segment size to something like 1TB? The vast majority of tables would fit in one file, and there would be less risk of hitting filesystem limits. - ilmari
Re: Large files for relations
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski wrote: > On Mon, May 1, 2023 at 9:29 PM Thomas Munro wrote: >> I am not aware of any modern/non-historic filesystem[2] that can't do >> large files with ease. Anyone know of anything to worry about on that >> front? > > There is some trouble in the ambiguity of what we mean by "modern" and "large > files". There are still a large number of users of ext4 where the max file > size is 16TB. Switching to a single large file per relation would effectively > cut the max table size in half for those users. How would a user with say a > 20TB table running on ext4 be impacted by this change? Hrmph. Yeah, that might be a bit of a problem. I see it discussed in various places that MySQL/InnoDB can't have tables bigger than 16TB on ext4 because of this, when it's in its default one-file-per-object mode (as opposed to its big-tablespace-files-to-hold-all-the-objects mode like DB2, Oracle etc, in which case I think you can have multiple 16TB segment files and get past that ext4 limit). It's frustrating because 16TB is still really, really big and you probably should be using partitions, or more partitions, to avoid all kinds of other scalability problems at that size. But however hypothetical the scenario might be, it should work, and this is certainly a plausible argument against the "aggressive" plan described above with the hard cut-off where we get to drop the segmented mode. Concretely, a 20TB pg_upgrade in copy mode would fail while trying to concatenate with the above patches, so you'd have to use link or reflink mode (you'd probably want to use that anyway unless due to sheer volume of data to copy otherwise, since ext4 is also not capable of block-range sharing), but then you'd be out of luck after N future major releases, according to that plan where we start deleting the code, so you'd need to organise some smaller partitions before that time comes. Or pg_upgrade to a target on xfs etc. I wonder if a future version of extN will increase its max file size. A less aggressive version of the plan would be that we just keep the segment code for the foreseeable future with no planned cut off, and we make all of those "piggy back" transformations that I showed in the patch set optional. For example, I had it so that CLUSTER would quietly convert your relation to large format, if it was still in segmented format (might as well if you're writing all the data out anyway, right?), but perhaps that could depend on a GUC. Likewise for base backup. Etc. Then someone concerned about hitting the 16TB limit on ext4 could opt out. Or something like that. It seems funny though, that's exactly the user who should want this feature (they have 16,000 relation segment files).
Re: Large files for relations
On Mon, May 1, 2023 at 9:29 PM Thomas Munro wrote: > > I am not aware of any modern/non-historic filesystem[2] that can't do > large files with ease. Anyone know of anything to worry about on that > front? There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number of users of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut the max table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change?
Re: Large files for relations
Greetings, * Corey Huinker (corey.huin...@gmail.com) wrote: > On Wed, May 3, 2023 at 1:37 AM Thomas Munro wrote: > > On Wed, May 3, 2023 at 5:21 PM Thomas Munro > > wrote: > > > rsync --link-dest ... rsync isn't really a safe tool to use for PG backups by itself unless you're using it with archiving and with start/stop backup and with checksums enabled. > > I wonder if rsync will grow a mode that can use copy_file_range() to > > share blocks with a reference file (= previous backup). Something > > like --copy-range-dest. That'd work for large-file relations > > (assuming a file system that has block sharing, like XFS and ZFS). > > You wouldn't get the "mtime is enough, I don't even need to read the > > bytes" optimisation, which I assume makes all database hackers feel a > > bit queasy anyway, but you'd get the space savings via the usual > > rolling checksum or a cheaper version that only looks for strong > > checksum matches at the same offset, or whatever other tricks rsync > > might have up its sleeve. There's also really good reasons to have multiple full backups and not just a single full backup and then lots and lots of incrementals which basically boils down to "are you really sure that one copy of that one really important file won't every disappear from your backup repository..?" That said, pgbackrest does now have block-level incremental backups (where we define our own block size ...) and there's reasons we decided against going down the LSN-based approach (not the least of which is that the LSN isn't always updated...), but long story short, moving to larger than 1G files should be something that pgbackrest will be able to handle without as much impact as there would have been previously in terms of incremental backups. There is a loss in the ability to use mtime to scan just the parts of the relation that changed and that's unfortunate but I wouldn't see it as really a game changer (and yes, there's certainly an argument for not trusting mtime, though I don't think we've yet had a report where there was an mtime issue that our mtime-validity checking didn't catch and force pgbackrest into checksum-based revalidation automatically which resulted in an invalid backup... of course, not enough people test their backups...). > I understand the need to reduce open file handles, despite the > possibilities enabled by using large numbers of small file sizes. I'm also generally in favor of reducing the number of open file handles that we have to deal with. Addressing the concerns raised nearby about weird corner-cases of non-1G length ABCDEF.1 files existing while ABCDEF.2, and more, files exist is certainly another good argument in favor of getting rid of segments. > I am curious whether a move like this to create a generational change in > file file format shouldn't be more ambitious, perhaps altering the block > format to insert a block format version number, whether that be at every > block, or every megabyte, or some other interval, and whether we store it > in-file or in a separate file to accompany the first non-segmented. Having > such versioning information would allow blocks of different formats to > co-exist in the same table, which could be critical to future changes such > as 64 bit XIDs, etc. To the extent you're interested in this, there are patches posted which are alrady trying to move us in a direction that would allow for different page formats that add in space for other features such as 64bit XIDs, better checksums, and TDE tags to be supported. https://commitfest.postgresql.org/43/3986/ Currently those patches are expecting it to be declared at initdb time, but the way they're currently written that's more of a soft requirement as you can tell on a per-page basis what features are enabled for that page. Might make sense to support it in that form first anyway though, before going down the more ambitious route of allowing different pages to have different sets of features enabled for them concurrently. When it comes to 'a separate file', we do have forks already and those serve a very valuable but distinct use-case where you can get information from the much smaller fork (be it the FSM or the VM or some future thing) while something like 64bit XIDs or a stronger checksum is something you'd really need on every page. I have serious doubts about a proposal where we'd store information needed on every page read in some far away block that's still in the same file such as using something every 1MB as that would turn every block access into two.. Thanks, Stephen signature.asc Description: PGP signature
Re: Large files for relations
On Wed, May 3, 2023 at 1:37 AM Thomas Munro wrote: > On Wed, May 3, 2023 at 5:21 PM Thomas Munro > wrote: > > rsync --link-dest > > I wonder if rsync will grow a mode that can use copy_file_range() to > share blocks with a reference file (= previous backup). Something > like --copy-range-dest. That'd work for large-file relations > (assuming a file system that has block sharing, like XFS and ZFS). > You wouldn't get the "mtime is enough, I don't even need to read the > bytes" optimisation, which I assume makes all database hackers feel a > bit queasy anyway, but you'd get the space savings via the usual > rolling checksum or a cheaper version that only looks for strong > checksum matches at the same offset, or whatever other tricks rsync > might have up its sleeve. > I understand the need to reduce open file handles, despite the possibilities enabled by using large numbers of small file sizes. Snowflake, for instance, sees everything in 1MB chunks, which makes massively parallel sequential scans (Snowflake's _only_ query plan) possible, though I don't know if they accomplish that via separate files, or via segments within a large file. I am curious whether a move like this to create a generational change in file file format shouldn't be more ambitious, perhaps altering the block format to insert a block format version number, whether that be at every block, or every megabyte, or some other interval, and whether we store it in-file or in a separate file to accompany the first non-segmented. Having such versioning information would allow blocks of different formats to co-exist in the same table, which could be critical to future changes such as 64 bit XIDs, etc.
Re: Large files for relations
On Wed, May 3, 2023 at 5:21 PM Thomas Munro wrote: > rsync --link-dest I wonder if rsync will grow a mode that can use copy_file_range() to share blocks with a reference file (= previous backup). Something like --copy-range-dest. That'd work for large-file relations (assuming a file system that has block sharing, like XFS and ZFS). You wouldn't get the "mtime is enough, I don't even need to read the bytes" optimisation, which I assume makes all database hackers feel a bit queasy anyway, but you'd get the space savings via the usual rolling checksum or a cheaper version that only looks for strong checksum matches at the same offset, or whatever other tricks rsync might have up its sleeve.
Re: Large files for relations
On Tue, May 2, 2023 at 3:28 PM Pavel Stehule wrote: > I like this patch - it can save some system sources - I am not sure how much, > because bigger tables usually use partitioning usually. Yeah, if you only use partitions of < 1GB it won't make a difference. Larger partitions are not uncommon, though. > Important note - this feature breaks sharing files on the backup side - so > before disabling 1GB sized files, this issue should be solved. Hmm, right, so there is a backup granularity continuum with "whole database cluster" at one end, "only files whose size, mtime [or optionally also checksum] changed since last backup" in the middle, and "only blocks that changed since LSN of last backup" at the other end. Getting closer to the right end of that continuum can make backups require less reading, less network transfer, less writing and/or less storage space depending on details. But this proposal moves the middle thing further to the left by changing the granularity from 1GB to whole relation, which can be gargantuan with this patch. Ultimately we need to be all the way at the right on that continuum, and there are clearly several people working on that goal. I'm not involved in any of those projects, but it's fun to think about an alien technology that produces complete standalone backups like rsync --link-dest (as opposed to "full" backups followed by a chain of "incremental" backups that depend on it so you need to retain them carefully) while still sharing disk blocks with older backups, and doing so with block granularity. TL;DW something something WAL something something copy_file_range().
Re: Large files for relations
Hi I like this patch - it can save some system sources - I am not sure how much, because bigger tables usually use partitioning usually. Important note - this feature breaks sharing files on the backup side - so before disabling 1GB sized files, this issue should be solved. Regards Pavel