Re: [PATCH] Refactor SLRU to always use long file names
Hi, > Here is an updated patch. The steps to test it manually are as follows. > > Compile and install PostgreSQL from the REL_17_STABLE branch: > > [...] > > As always, your feedback and suggestions are most welcomed. Rebased. -- Best regards, Aleksander Alekseev v3-0001-Always-use-long-SLRU-segment-file-names.patch Description: Binary data
Re: [PATCH] Refactor SLRU to always use long file names
Hi, > For the record, Michael and I had a brief discussion about this > offlist and decided to abandon the idea of adding TAP tests, relying > only on buildfarm. Also I will check if we have a clear error message > in case when a user forgot to run pg_upgrade and running new slru.c > with old filenames. If the user doesn't get such an error message I > will see if it's possible to add it somewhere in slru.c without > introducing much performance overhead. > > Also I'm going to submit precise steps to test this migration manually > for the reviewers convenience. Here is an updated patch. The steps to test it manually are as follows. Compile and install PostgreSQL from the REL_17_STABLE branch: ``` git checkout REL_17_STABLE git fetch origin git rebase -i origin/REL_17_STABLE git clean -dfx meson setup --buildtype debug -Dcassert=true -Dtap_tests=enabled -Dprefix=/home/eax/pginstall-17 build ninja -C build meson install -C build ~/pginstall-17/bin/initdb --data-checksums -D ~/pginstall-17/data ~/pginstall-17/bin/pg_ctl -D ~/pginstall-17/data -l ~/pginstall-17/data/logfile start ~/pginstall-17/bin/createdb $(whoami) # fill DB (or even better - use a copy of an existing one), e.g: ~/pginstall-17/bin/pgbench -i -s 100 ~/pginstall-17/bin/pgbench -j 16 -c 16 -T 10 -P 5 # should see 4-digit SLRU segment filenames, more files is better ls -la ~/pginstall-17/data/pg_xact/ \ ~/pginstall-17/data/pg_commit_ts/ \ ~/pginstall-17/data/pg_multixact/members/ \ ~/pginstall-17/data/pg_multixact/offsets/ \ ~/pginstall-17/data/pg_subtrans/ \ ~/pginstall-17/data/pg_serial/ ~/pginstall-17/bin/pg_ctl -D ~/pginstall-17/data stop ``` Apply the patch to the `master` branch, recompile PostgreSQL, install to the different location: ``` git checkout slru_pg_upgrade_v2 git clean -dfx meson setup --buildtype debug -Dcassert=true -Dtap_tests=enabled -Dprefix=/home/eax/pginstall-18 build ninja -C build meson install -C build ``` Try to start PostgreSQL without running pg_upgrade: ``` cp -r ~/pginstall-17/data ~/pginstall-18/data ~/pginstall-18/bin/pg_ctl -D ~/pginstall-18/data -l ~/pginstall-18/data/logfile start ``` You should get: ``` waiting for server to start stopped waiting pg_ctl: could not start server Examine the log output. $ tail ~/pginstall-18/data/logfile FATAL: database files are incompatible with server DETAIL: The data directory was initialized by PostgreSQL version 17, which is not compatible with this version 18devel ``` Run pg_upgrade: ``` rm -r ~/pginstall-18/data ~/pginstall-18/bin/initdb --data-checksums -D ~/pginstall-18/data ~/pginstall-18/bin/pg_upgrade --old-datadir=/home/eax/pginstall-17/data --new-datadir=/home/eax/pginstall-18/data --old-bindir=/home/eax/pginstall-17/bin --new-bindir=/home/eax/pginstall-18/bin ``` Make sure the output contains: ``` Renaming SLRU segments in pg_xact ok Renaming SLRU segments in pg_commit_tsok Renaming SLRU segments in pg_multixact/offsetsok Renaming SLRU segments in pg_multixact/membersok Renaming SLRU segments in pg_subtrans ok Renaming SLRU segments in pg_serial ok ``` Make sure PostgreSQL starts after the upgrade: ``` ~/pginstall-18/bin/pg_ctl -D ~/pginstall-18/data -l ~/pginstall-18/data/logfile start ~/pginstall-18/bin/psql -c 'select count(*) from pgbench_accounts' ~/pginstall-18/bin/pg_ctl -D ~/pginstall-18/data stop # should see 15-digit SLRU segment filenames ls -la ~/pginstall-18/data/pg_xact/ \ ~/pginstall-18/data/pg_commit_ts/ \ ~/pginstall-18/data/pg_multixact/members/ \ ~/pginstall-18/data/pg_multixact/offsets/ \ ~/pginstall-18/data/pg_subtrans/ \ ~/pginstall-18/data/pg_serial/ ``` Make sure that the second run of pg_upgrade doesn't produce "Renaming SLRU segments" messages: ``` mv ~/pginstall-18/data ~/pginstall-18/data.bak ~/pginstall-18/bin/initdb --data-checksums -D ~/pginstall-18/data ~/pginstall-18/bin/pg_upgrade --old-datadir=/home/eax/pginstall-18/data.bak --new-datadir=/home/eax/pginstall-18/data --old-bindir=/home/eax/pginstall-18/bin --new-bindir=/home/eax/pginstall-18/bin ``` As always, your feedback and suggestions are most welcomed. -- Best regards, Aleksander Alekseev v2-0001-Always-use-long-SLRU-segment-file-names.patch Description: Binary data
Re: [PATCH] Refactor SLRU to always use long file names
Hi, > I guess we are going to need either a `pg_writecontoldata` tool or > `pg_controldata -w` flag. I wonder which option you find more > attractive, or maybe you have better ideas? For the record, Michael and I had a brief discussion about this offlist and decided to abandon the idea of adding TAP tests, relying only on buildfarm. Also I will check if we have a clear error message in case when a user forgot to run pg_upgrade and running new slru.c with old filenames. If the user doesn't get such an error message I will see if it's possible to add it somewhere in slru.c without introducing much performance overhead. Also I'm going to submit precise steps to test this migration manually for the reviewers convenience. -- Best regards, Aleksander Alekseev
Re: [PATCH] Refactor SLRU to always use long file names
Hi Michael, > The scans may be quite long as well, actually, which could be a > bottleneck. Did you measure the runtime with a maximized (still > realistic) pool of files for these SLRUs in the upgrade time? For > upgrades, data would be the neck. Good question. In theory SLRUs are not supposed to grow large and their size is a small fraction of the rest of the database. As an example CLOG ( pg_xact/ ) stores 2 bits per transaction. Since every SLRU has a dedicated directory and we scan just it, non-SLRU files don't affect the scan time. To make sure I asked several people to check how many SLRUs they have in the prod environment. The typical response looked like this: ``` $PGDATA/pg_xact: 191 segments $PGDATA/pg_commit_ts: 3 $PGDATA/pg_multixact/offsets: 148 $PGDATA/pg_multixact/members: 400 $PGDATA/pg_subtrans: 4 $PGDATA/pg_serial: 3 ``` This is a 800 Gb database. Interestingly larger databases (4.2Tb) may have much less SLRU segments (220 in total, most of them are pg_xact). And here is the *worst* case that was reported to me: ``` $PGDATA/pg_xact: 171 segments $PGDATA/pg_commit_ts: 3 $PGDATA/pg_multixact/offsets: 4864 $PGDATA/pg_multixact/members: 40996 $PGDATA/pg_subtrans: 5 $PGDATA/pg_serial: 3 ``` I was told this is a "1Tb+" database. For this user pg_upgrade will rename 45 000 files. I wrote a little script to check how much time it will take: ``` #!/usr/bin/env perl use strict; my $from = "test_0001.tmp"; my $to = "test_0002.tmp"; system("touch $from"); for my $i (1..45000) { rename($from, $to); ($from, $to) = ($to, $from); } ``` On my laptop I get 0.5 seconds. Note that I don't do scanning, only renaming, assuming that the recent should take most of the time. I think this should be multiplied by 10 to take into account the role of the filesystem cache and other factors. All in all in the absolutely worst case scenario this shouldn't take more than 5 seconds, in reality it will probably be orders of magnitude less. > Note that this also depends on the system endianness, see 039_end_of_wal.pl. Sure, I think I took it into account when using pack("L!"). My understanding is that "L" takes care of the endiness since I see special flags to force little- or big-endiness independently from the platform [1]. This of course should be tested in practice on different machines. Using an exclamation mark in "L!" was a mistake since cat_ver is not an int, but rather an uint32. > You don't really need the lookup part, actually? For lookup we already have the pg_controldata tool, that's not a problem. > Control file manipulation may be useful as a routine in Cluster.pm, > based on an offset in the file and a format to pack as argument? > [...] > It's one of these things I could see myself reuse to force a state in > the cluster and make a test cheaper, for example. > You would just need the part where > the control file is rewritten, which should be OK as long as the > cluster is freshly initdb'd meaning that there should be nothing that > interacts with the new value set. Agree. Still I don't see a good way of figuring out sizeof(ControlFileData) from Perl. The structure has int's in it (e.g. wal_level, MaxConnections, etc) thus the size is platform-dependent. The CRC should be placed at the end of the structure. If we want to manipulate MaxConnections etc their offsets are going to be platform-dependent as well. And my understanding is that the alignment is platform/compiler dependent too. I guess we are going to need either a `pg_writecontoldata` tool or `pg_controldata -w` flag. I wonder which option you find more attractive, or maybe you have better ideas? [1]: https://perldoc.perl.org/functions/pack -- Best regards, Aleksander Alekseev
Re: [PATCH] Refactor SLRU to always use long file names
On Tue, Nov 12, 2024 at 05:37:02PM +0300, Aleksander Alekseev wrote: > Also it occured to me that as a 4th option we could just get rid of > this check. Users however will pay the price every time they execute > pg_upgrade so I doubt we are going to do this. We cannot remove the check, or Nathan will come after us as he's working hard on reducing the time pg_upgrade takes. We should not make it longer if there is no need to. The scans may be quite long as well, actually, which could be a bottleneck. Did you measure the runtime with a maximized (still realistic) pool of files for these SLRUs in the upgrade time? For upgrades, data would be the neck. # equals SLRU_SEG_FILENAMES_CHANGE_CAT_VER in pg_upgrade.h my $slru_seg_filenames_change_cat_ver = 202411121; [...] open my $fh, "+<", $pg_control_fname or die $!; binmode($fh); sysseek($fh, 12, 0); my $binval = pack("L!", $slru_seg_filenames_change_cat_ver - 1); syswrite($fh, $binval, 4); close($fh); Control file manipulation may be useful as a routine in Cluster.pm, based on an offset in the file and a format to pack as argument? Note that this also depends on the system endianness, see 039_end_of_wal.pl. It's one of these things I could see myself reuse to force a state in the cluster and make a test cheaper, for example. You don't really need the lookup part, actually? You would just need the part where the control file is rewritten, which should be OK as long as the cluster is freshly initdb'd meaning that there should be nothing that interacts with the new value set. pg_upgrade only has CAT_VER flags for some multixact changes and the jsonb check from 9.4. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Refactor SLRU to always use long file names
Hi again, Just a quick follow-up. > (*) BTW I noticed a mistake in the commented code. The condition > should be `>=`, not `<`, i.e: > > ``` > if(new_cluster.controldata.cat_ver >= SLRU_SEG_FILENAMES_CHANGE_CAT_VER) > return; > ``` The concentration of caffeine in my blood is a bit low right now. I suspect I may need to re-check this statement with a fresh head. Also it occured to me that as a 4th option we could just get rid of this check. Users however will pay the price every time they execute pg_upgrade so I doubt we are going to do this. -- Best regards, Aleksander Alekseev
Re: [PATCH] Refactor SLRU to always use long file names
Hi Michael, Thanks for your feedback! > Your patch is just doing a rename() of the files from short to long > names. How about adding a new TAP script in pg_upgrade that creates a > couple of empty files with short files names in each path that needs > to do the transfer? Then the test could run one pg_upgrade command > and check that the new names are in place. You could use a array of > objects, with the base path, the old name and the new name, then loop > over it. With the check in check_slru_segment_filenames() based on > SLRU_SEG_FILENAMES_CHANGE_CAT_VER, that should work. OK I gave it a try and discovered that the test becomes very ugly very fast. Attached is the draft of the test to give you an idea of how it's going to look like. In order to trigger renaming of SLRU segments first we have to downgrade the catalog version in the pg_control file. Otherwise the check in check_slru_segment_filenames() is not going to pass and pg_upgrade will do nothing (*). This per se is easily done with binmode() and pack() however the file has a CRC. Calculating it is not difficult since we have pg_read_binary_file() and crc32c() SQL functions, although personally I don't find a need for starting a cluster for this quite satisfactory. The CRC is stored by the offset `sizeof(ControlData)-4` and unless I'm wrong is platform-dependent. I see several solutions for this problem: * We could add sizeof(ControlData) to the output of `pg_controldata` so we could use it from Perl * We could teach `initdb` to override the catalog version * We could implement a new tool for editing pg_control file On top of that I should add that the test takes about 7 seconds on my laptop. Apparently executing two initdb's and one pg_upgrade is not very cheap. This makes me wonder if instead of writing a new test we should modify 002_pg_upgrade.pl. This however will make the test even less readable and maintainable. What do you think? (*) BTW I noticed a mistake in the commented code. The condition should be `>=`, not `<`, i.e: ``` if(new_cluster.controldata.cat_ver >= SLRU_SEG_FILENAMES_CHANGE_CAT_VER) return; ``` -- Best regards, Aleksander Alekseev # Copyright (c) 2024, PostgreSQL Global Development Group use strict; use warnings FATAL => 'all'; use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; # This test ensures that pg_upgrade renames SLRU segments. # After the upgrade all segments should have long file names. # equals SLRU_SEG_FILENAMES_CHANGE_CAT_VER in pg_upgrade.h my $slru_seg_filenames_change_cat_ver = 202411121; my @slru_dirs = ( "pg_xact", "pg_commit_ts", "pg_multixact/offsets", "pg_multixact/members", "pg_subtrans", "pg_serial", ); my $short_segment_name = "1234"; my $long_segment_name = "0001234"; my $oldnode = PostgreSQL::Test::Cluster->new('old_node'); $oldnode->init(); my $oldbindir = $oldnode->config_data('--bindir'); my $newnode = PostgreSQL::Test::Cluster->new('new_node'); $newnode->init(); my $newbindir = $newnode->config_data('--bindir'); # Fill data_dir of the old node with SLRU segments that use short file names. # pg_upgrade renames the files without looking at the content, so the content # is not important. foreach my $dir (@slru_dirs) { my $fname = $oldnode->data_dir."/".$dir."/".$short_segment_name; open my $fh, ">", $fname or die $!; close $fh; } # Modify pg_control of the old node to make it look like a version that needs # migration (decrease ControlData->cat_ver). Otherwise pg_upgrade will skip it. my $pg_control_fname = $oldnode->data_dir."/global/pg_control"; open my $fh, "+<", $pg_control_fname or die $!; binmode($fh); sysseek($fh, 12, 0); my $binval = pack("L!", $slru_seg_filenames_change_cat_ver - 1); syswrite($fh, $binval, 4); close($fh); # Calculate CRC of the updated file using pg_read_binary_file() and crc32c() my $fsize = -s $pg_control_fname; # WRONG! should be sizeof(ControlFile) $newnode->start; my $newcrc; $newnode->psql( "postgres", "SELECT crc32(pg_read_binary_file('".$pg_control_fname."',0,".($fsize-4)."));", stdout => \$newcrc, on_error_die => 1, ); $newnode->stop; # Update CRC open $fh, "+<", $pg_control_fname or die $!; binmode($fh); sysseek($fh, $fsize-4, 0); my $bincrc = pack("L!", $newcrc); syswrite($fh, $bincrc, 4); close($fh); $newnode->command_ok( [ 'pg_upgrade', '--old-datadir', $oldnode->data_dir, '--new-datadir', $newnode->data_dir, '--old-bindir', $oldbindir, '--new-bindir', $newbindir, ], 'run of pg_upgrade'); # Check that pg_upgrade renamed the SLRU segments we created foreach my $dir (@slru_dirs) { ok(-e $newnode->data_dir."/".$dir."/".$long_segment_name); } done_testing();
Re: [PATCH] Refactor SLRU to always use long file names
On Thu, Sep 12, 2024 at 12:33:14PM +0300, Aleksander Alekseev wrote: > It wouldn't hurt re-checking the segment file names in the TAP test > but this would mean hardcoding catalog names which as I understand you > want to avoid. With high probability PG wouldn't start if the > corresponding piece of pg_upgrade is wrong (I checked more than once > :). So I'm not entirely sure if it's worth the effort, but let's see > what others think. +segno = strtoi64(de->d_name, NULL, 16); +snprintf(new_path, MAXPGPATH, "%s/%015llX", dir_path, +(long long) segno); +snprintf(old_path, MAXPGPATH, "%s/%s", dir_path, de->d_name); + +if (pg_mv_file(old_path, new_path) != 0) +pg_fatal("could not rename file \"%s\" to \"%s\": %m", + old_path, new_path); Your patch is just doing a rename() of the files from short to long names. How about adding a new TAP script in pg_upgrade that creates a couple of empty files with short files names in each path that needs to do the transfer? Then the test could run one pg_upgrade command and check that the new names are in place. You could use a array of objects, with the base path, the old name and the new name, then loop over it. With the check in check_slru_segment_filenames() based on SLRU_SEG_FILENAMES_CHANGE_CAT_VER, that should work. + static const char* dirs[] = { + "pg_xact", + "pg_commit_ts", + "pg_multixact/offsets", + "pg_multixact/members", + "pg_subtrans", + "pg_serial", + }; Hardcoding that is always annoying, but well, that's not going to change. So living with this is not going to be a maintenance burden. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Refactor SLRU to always use long file names
Hi Michael, > On Wed, Sep 11, 2024 at 04:07:06PM +0300, Aleksander Alekseev wrote: > > Commit 4ed8f0913bfd introduced long SLRU file names. The proposed > > patch removes SlruCtl->long_segment_names flag and makes SLRU always > > use long file names. This simplifies both the code and the API. > > Corresponding changes to pg_upgrade are included. > > That's leaner, indeed. > > > One drawback I see is that technically SLRU is an exposed API and > > changing it may affect third-party code. I'm not sure if we should > > seriously worry about this. Firstly, the change is trivial and > > secondly, it's not clear whether such third-party code even exists (we > > broke this API just recently in 4ed8f0913bfd and no one complained). > > Any third-party code using custom SLRUs would need to take care of > handling their upgrade path outside pg_upgrade. Not sure there are > any of them, TBH, but let's see. > > > I didn't include any tests for the new pg_upgrade code. To my > > knowledge we test it manually, with buildfarm members and during > > alpha- and beta-testing periods. Please let me know if you think there > > should be a corresponding TAP test. > > Removing the old API means that it is impossible to test a move from > short to long file names. That's OK by me to rely on the pg_upgrade > paths in the buildfarm code. We have a few of them. Thanks for the feedback. > There is one thing I am wondering, here, though, which is to think > harder about a validity check at the end of 002_pg_upgrade.pl to make > sure that all the SLRU use long file names after running the tests. > That would mean thinking about a mechanism to list all of them from a > backend, rather than hardcode a list of them. Perhaps that's not > worth it, just dropping an idea in the bucket of ideas. I would guess > in the shape of a catalog that's able to represent at SQL level all > the SLRUs that exist in a backend. Hmm... IMO it would be a rather niche facility to maintain in PG core. At least I'm not aware of cases when a DBA wanted to list initialized SLRUs. Would it be convenient for core / extensions developers? Creating a breakpoint on SimpleLruInit() or adding a temporary elog() sounds simpler to me. It wouldn't hurt re-checking the segment file names in the TAP test but this would mean hardcoding catalog names which as I understand you want to avoid. With high probability PG wouldn't start if the corresponding piece of pg_upgrade is wrong (I checked more than once :). So I'm not entirely sure if it's worth the effort, but let's see what others think. -- Best regards, Aleksander Alekseev
Re: [PATCH] Refactor SLRU to always use long file names
On Wed, Sep 11, 2024 at 04:07:06PM +0300, Aleksander Alekseev wrote: > Commit 4ed8f0913bfd introduced long SLRU file names. The proposed > patch removes SlruCtl->long_segment_names flag and makes SLRU always > use long file names. This simplifies both the code and the API. > Corresponding changes to pg_upgrade are included. That's leaner, indeed. > One drawback I see is that technically SLRU is an exposed API and > changing it may affect third-party code. I'm not sure if we should > seriously worry about this. Firstly, the change is trivial and > secondly, it's not clear whether such third-party code even exists (we > broke this API just recently in 4ed8f0913bfd and no one complained). Any third-party code using custom SLRUs would need to take care of handling their upgrade path outside pg_upgrade. Not sure there are any of them, TBH, but let's see. > I didn't include any tests for the new pg_upgrade code. To my > knowledge we test it manually, with buildfarm members and during > alpha- and beta-testing periods. Please let me know if you think there > should be a corresponding TAP test. Removing the old API means that it is impossible to test a move from short to long file names. That's OK by me to rely on the pg_upgrade paths in the buildfarm code. We have a few of them. There is one thing I am wondering, here, though, which is to think harder about a validity check at the end of 002_pg_upgrade.pl to make sure that all the SLRU use long file names after running the tests. That would mean thinking about a mechanism to list all of them from a backend, rather than hardcode a list of them. Perhaps that's not worth it, just dropping an idea in the bucket of ideas. I would guess in the shape of a catalog that's able to represent at SQL level all the SLRUs that exist in a backend. -- Michael signature.asc Description: PGP signature