Re: pgsql: Add TAP tests for pg_verify_checksums
On Wed, Nov 28, 2018 at 06:36:25AM +0900, Michael Paquier wrote: > Yes, let's do so rather sooner than later if there are no objections > from others. I am going to ping the other thread about what's discussed > here additionally in case folks missed what you reported. Fixing the > temp file issue which has been reported by Andres first is an additional > bonus. For the archive's sake, this is fixed by 5c99513 and a1c91dd. -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
On Tue, Nov 27, 2018 at 04:26:45PM +0100, Michael Banck wrote: > Oh, I kinda followed that thread a bit, but I think that patch fixes > things more by matter of moving code around, at least I haven't noticed > that tablespaces were explicitly claimed to be fixed in that thread. That may have been an unconscious move ;) It seems to me the root issue is that the original implementation of pg_verify_checksums naively assumed that files can be skipped without first checking at their types, which I got confused by in d55241af. You should definitely be mentioned as this reporter anyway, and get author credits for the additional test. > As pg_verify_checksums appears to be broken with respect to tablespaces > in REL_11_STABLE (so I think 11.1, but not 11.0) as well, I think this > merits a short-term minimal invasive fix (i.e. the patch you posted, > just that there's no TAP testsuite for pg_verify_checksums in v11 > unfortunately) on its own, regardless of the wider changes proposed in > the other thread. Yes, let's do so rather sooner than later if there are no objections from others. I am going to ping the other thread about what's discussed here additionally in case folks missed what you reported. Fixing the temp file issue which has been reported by Andres first is an additional bonus. -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, Am Dienstag, den 27.11.2018, 22:52 +0900 schrieb Michael Paquier: > On Tue, Nov 27, 2018 at 02:09:05PM +0100, Michael Banck wrote: > > I had a quick look at fixing this but did not manage to immediately come > > up with a solution, so posting here for now. > > If you look at another thread, the patch posted on the top would > actually solve this issue: > https://www.postgresql.org/message-id/20181021134206.ga14...@paquier.xyz Oh, I kinda followed that thread a bit, but I think that patch fixes things more by matter of moving code around, at least I haven't noticed that tablespaces were explicitly claimed to be fixed in that thread. > Your problem could also be solved with the minimalistic patch attached, > so fixing on the way the problems with temporary files present in PGDATA > something like the attached could be used... Thanks! > Based on the stale status > of the other thread I am unsure what should be done though. As pg_verify_checksums appears to be broken with respect to tablespaces in REL_11_STABLE (so I think 11.1, but not 11.0) as well, I think this merits a short-term minimal invasive fix (i.e. the patch you posted, just that there's no TAP testsuite for pg_verify_checksums in v11 unfortunately) on its own, regardless of the wider changes proposed in the other thread. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Re: pgsql: Add TAP tests for pg_verify_checksums
On Tue, Nov 27, 2018 at 02:09:05PM +0100, Michael Banck wrote: > I had a quick look at fixing this but did not manage to immediately come > up with a solution, so posting here for now. If you look at another thread, the patch posted on the top would actually solve this issue: https://www.postgresql.org/message-id/20181021134206.ga14...@paquier.xyz Your problem could also be solved with the minimalistic patch attached, so fixing on the way the problems with temporary files present in PGDATA something like the attached could be used... Based on the stale status of the other thread I am unsure what should be done though. -- Michael diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c index f0e09bea20..148aa511f6 100644 --- a/src/bin/pg_verify_checksums/pg_verify_checksums.c +++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c @@ -21,6 +21,7 @@ #include "storage/bufpage.h" #include "storage/checksum.h" #include "storage/checksum_impl.h" +#include "storage/fd.h" static int64 files = 0; @@ -189,9 +190,22 @@ scan_directory(const char *basedir, const char *subdir) char fn[MAXPGPATH]; struct stat st; - if (!isRelFileName(de->d_name)) + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) continue; + /* Skip temporary files */ + if (strncmp(de->d_name, + PG_TEMP_FILE_PREFIX, + strlen(PG_TEMP_FILE_PREFIX)) == 0) + continue; + + /* Skip temporary folders */ + if (strncmp(de->d_name, + PG_TEMP_FILES_DIR, + strlen(PG_TEMP_FILES_DIR)) == 0) + return; + snprintf(fn, sizeof(fn), "%s/%s", path, de->d_name); if (lstat(fn, ) < 0) { @@ -206,6 +220,13 @@ scan_directory(const char *basedir, const char *subdir) *segmentpath; BlockNumber segmentno = 0; + /* + * Only normal relation files can be analyzed. Note that this + * skips temporary relations. + */ + if (!isRelFileName(de->d_name)) +continue; + /* * Cut off at the segment boundary (".") to get the segment number * in order to mix it into the checksum. Then also cut off at the diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl index 0e1725d9f2..fd64de050e 100644 --- a/src/bin/pg_verify_checksums/t/002_actions.pl +++ b/src/bin/pg_verify_checksums/t/002_actions.pl @@ -5,7 +5,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 36; +use Test::More tests => 42; # Initialize node with checksums enabled. my $node = get_new_node('node_checksum'); @@ -54,31 +54,7 @@ command_fails(['pg_verify_checksums', '-D', $pgdata], "fails with online cluster"); # Create table to corrupt and get its relfilenode -$node->safe_psql('postgres', - "SELECT a INTO corrupt1 FROM generate_series(1,1) AS a; - ALTER TABLE corrupt1 SET (autovacuum_enabled=false);"); - -my $file_corrupted = $node->safe_psql('postgres', - "SELECT pg_relation_filepath('corrupt1')"); -my $relfilenode_corrupted = $node->safe_psql('postgres', - "SELECT relfilenode FROM pg_class WHERE relname = 'corrupt1';"); - -# Set page header and block size -my $pageheader_size = 24; -my $block_size = $node->safe_psql('postgres', 'SHOW block_size;'); -$node->stop; - -# Checksums are correct for single relfilenode as the table is not -# corrupted yet. -command_ok(['pg_verify_checksums', '-D', $pgdata, - '-r', $relfilenode_corrupted], - "succeeds for single relfilenode with offline cluster"); - -# Time to create some corruption -open my $file, '+<', "$pgdata/$file_corrupted"; -seek($file, $pageheader_size, 0); -syswrite($file, '\0\0\0\0\0\0\0\0\0'); -close $file; +my $relfilenode_corrupted = create_corruption($node, 'corrupt1', 'pg_default'); # Global checksum checks fail $node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata], @@ -95,6 +71,72 @@ $node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata, '-r', [qr/checksum verification failed/], 'fails for corrupted data on single relfilenode'); +# Drop corrupt table again and make sure there is no more corruption +$node->start; +$node->safe_psql('postgres', 'DROP TABLE corrupt1;'); +$node->stop; +$node->command_ok(['pg_verify_checksums', '-D', $pgdata], +'succeeds again: '.$node->data_dir); + +# Create table to corrupt in a non-default tablespace and get its relfilenode +my $tablespace_dir = $node->data_dir."/../ts_corrupt_dir"; +mkdir ($tablespace_dir); +$node->start; +$node->safe_psql('postgres', "CREATE TABLESPACE ts_corrupt LOCATION '".$tablespace_dir."';"); +$relfilenode_corrupted = create_corruption($node, 'corrupt2', 'ts_corrupt'); +$node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata], + 1, + [qr/Bad checksums:.*1/], + [qr/checksum verification failed/], + 'fails with corrupted data in non-default tablespace'); + +# Drop corrupt table again and make sure there is no more corruption +$node->start;
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, Am Freitag, den 19.10.2018, 22:50 +0900 schrieb Michael Paquier: > On Wed, Oct 17, 2018 at 05:30:05PM -0400, Andrew Dunstan wrote: > Thanks. This is now committed after some tweaks to the comments, a bit > earlier than I thought first. I found an issue with this [d55241af7, "Use whitelist to choose files scanned with pg_verify_checksums"] commit, namely, it makes pg_verify_checksums no longer scan non-default tablespaces. So if you have all of your data in tablespaces, it will more-or-less immediately return with success. I've extended the test suite to induce corruption in a table located in a non-default tablespace, see the attached patch. If fails like this, i.e. does not detect the corruption: t/002_actions.pl .. 14/42 # Failed test 'fails with corrupted data in non-default tablespace status (got 0 vs expected 1)' # at t/002_actions.pl line 87. # Failed test 'fails with corrupted data in non-default tablespace stdout /(?^:Bad checksums:.*1)/' # at t/002_actions.pl line 87. # 'Checksum scan completed # Data checksum version: 1 # Files scanned: 1102 # Blocks scanned: 2861 # Bad checksums: 0 # ' # doesn't match '(?^:Bad checksums:.*1)' # Failed test 'fails with corrupted data in non-default tablespace stderr /(?^:checksum verification failed)/' # at t/002_actions.pl line 87. # '' # doesn't match '(?^:checksum verification failed)' # Looks like you failed 3 tests of 42. t/002_actions.pl .. Dubious, test returned 3 (wstat 768, 0x300) Failed 3/42 subtests The problem is that "PG_12_201811201" is not considered a valid relation file by isRelFileName(), so it skips it and the rest of the tablespace. I had a quick look at fixing this but did not manage to immediately come up with a solution, so posting here for now. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl index 0e1725d9f2..fd64de050e 100644 --- a/src/bin/pg_verify_checksums/t/002_actions.pl +++ b/src/bin/pg_verify_checksums/t/002_actions.pl @@ -5,7 +5,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 36; +use Test::More tests => 42; # Initialize node with checksums enabled. my $node = get_new_node('node_checksum'); @@ -54,31 +54,7 @@ command_fails(['pg_verify_checksums', '-D', $pgdata], "fails with online cluster"); # Create table to corrupt and get its relfilenode -$node->safe_psql('postgres', - "SELECT a INTO corrupt1 FROM generate_series(1,1) AS a; - ALTER TABLE corrupt1 SET (autovacuum_enabled=false);"); - -my $file_corrupted = $node->safe_psql('postgres', - "SELECT pg_relation_filepath('corrupt1')"); -my $relfilenode_corrupted = $node->safe_psql('postgres', - "SELECT relfilenode FROM pg_class WHERE relname = 'corrupt1';"); - -# Set page header and block size -my $pageheader_size = 24; -my $block_size = $node->safe_psql('postgres', 'SHOW block_size;'); -$node->stop; - -# Checksums are correct for single relfilenode as the table is not -# corrupted yet. -command_ok(['pg_verify_checksums', '-D', $pgdata, - '-r', $relfilenode_corrupted], - "succeeds for single relfilenode with offline cluster"); - -# Time to create some corruption -open my $file, '+<', "$pgdata/$file_corrupted"; -seek($file, $pageheader_size, 0); -syswrite($file, '\0\0\0\0\0\0\0\0\0'); -close $file; +my $relfilenode_corrupted = create_corruption($node, 'corrupt1', 'pg_default'); # Global checksum checks fail $node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata], @@ -95,6 +71,72 @@ $node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata, '-r', [qr/checksum verification failed/], 'fails for corrupted data on single relfilenode'); +# Drop corrupt table again and make sure there is no more corruption +$node->start; +$node->safe_psql('postgres', 'DROP TABLE corrupt1;'); +$node->stop; +$node->command_ok(['pg_verify_checksums', '-D', $pgdata], +'succeeds again: '.$node->data_dir); + +# Create table to corrupt in a non-default tablespace and get its relfilenode +my $tablespace_dir = $node->data_dir."/../ts_corrupt_dir"; +mkdir ($tablespace_dir); +$node->start; +$node->safe_psql('postgres', "CREATE TABLESPACE ts_corrupt LOCATION '".$tablespace_dir."';"); +$relfilenode_corrupted = create_corruption($node, 'corrupt2', 'ts_corrupt'); +$node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata], + 1, + [qr/Bad checksums:.*1/], + [qr/checksum verification failed/], + 'fails with corrupted data in
Re: pgsql: Add TAP tests for pg_verify_checksums
On Mon, Nov 19, 2018 at 07:11:19PM +0100, Michael Banck wrote: > First off, I think those fail_corrupt files should have different > filenames than the empty ones above (I left `9_vm.123' as an > example of a duplicated file). A comment about that is a good idea. So added. > So either we remove or truncate the files again after > $node->command_checks_all(), or the tests use the relfilenode > feature. > > However, in the latter case we would have to use a different main > relfilenode ID for each subtest, as otherwise pg_verify_checksums would > again fail on the first file it scans. The take here is to make sure that the correct file name is showing up in the produced error message, and removing the files makes future test more robust, so I have committed a version which removes the files instead. -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On Sun, Oct 14, 2018 at 10:59:02AM +0900, Michael Paquier wrote: > On Sat, Oct 13, 2018 at 05:53:00PM -0400, Andrew Dunstan wrote: > > It occurred to me that a pretty simple fix could just be to blacklist > > everything that didn't start with a digit. The whitelist approach is > > probably preferable... depends how urgent we see this as. > > Yeah, possibly, still that's not a correct long-term fix. So attached > is what I think we should do for HEAD and REL_11_STABLE with an approach > using a whitelist. I have added positive and negative tests on top of > the existing TAP tests, as suggested by Michael B, and I made the code > use relpath.h to make sure that we don't miss any fork types. I was looking at extending the testsuite for the online checksum verification feature as requested by Fabien, and it seems some of the additional tests are not working as intended: > diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl > b/src/bin/pg_verify_checksums/t/002_actions.pl > index dc29da09af..33408ced89 100644 > --- a/src/bin/pg_verify_checksums/t/002_actions.pl > +++ b/src/bin/pg_verify_checksums/t/002_actions.pl [...] > +append_to_file "$pgdata/global/9_vm.123", ""; > + > # Checksums pass on a newly-created cluster > command_ok(['pg_verify_checksums', '-D', $pgdata], > "succeeds with offline cluster"); > @@ -67,3 +89,30 @@ $node->command_checks_all([ 'pg_verify_checksums', '-D', > $pgdata, '-r', > [qr/Bad checksums:.*1/], > [qr/checksum verification > failed/], > 'fails for corrupted data on > single relfilenode'); > + > +# Utility routine to check that pg_verify_checksums is correctly > +# able to detect correctly-shaped relation files with some data. > +sub fail_corrupt > +{ > + my $node = shift; > + my $file = shift; > + my $pgdata = $node->data_dir; > + > + # Create the file with some data in it. > + append_to_file "$pgdata/global/$file", "foo"; > + > + $node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata], > + 1, > + [qr/^$/], > + [qr/could not read block/], > + "fails for corrupted data in > $file"); > +} > + > +fail_corrupt($node, "9"); > +fail_corrupt($node, "9.123"); > +fail_corrupt($node, "9_fsm"); > +fail_corrupt($node, "9_init"); > +fail_corrupt($node, "9_vm"); > +fail_corrupt($node, "9_init.123"); > +fail_corrupt($node, "9_fsm.123"); > +fail_corrupt($node, "9_vm.123"); First off, I think those fail_corrupt files should have different filenames than the empty ones above (I left `9_vm.123' as an example of a duplicated file). Second, and this is more severe, the subroutine never removes those files so in practise, we are checking the first (or possibly some randon other) file instead of the one we think we are looking at (or at least what I think the intention is). This can be shown if you expand the regexp from `qr/could not read block/' to `qr/could not read block 0 in file.*$file\":/': |t/002_actions.pl .. 6/38 |# Failed test 'fails for corrupted data in 9.123 stderr /(?^:could |# not read block 0 in file.*9.123\":)/' |# at t/002_actions.pl line 109. |# 'pg_verify_checksums: could not read block 0 in file |# "[...]src/bin/pg_verify_checksums/tmp_check/t_002_actions_node_checksum_data/pgdata/global/9": |# read 3 of 8192 |# ' |# doesn't match '(?^:could not read block 0 in file.*9.123\":)' So either we remove or truncate the files again after $node->command_checks_all(), or the tests use the relfilenode feature. However, in the latter case we would have to use a different main relfilenode ID for each subtest, as otherwise pg_verify_checksums would again fail on the first file it scans. So I am proposing something like the attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl index 91adc7fef1..96fbd96e4f 100644 --- a/src/bin/pg_verify_checksums/t/002_actions.pl +++ b/src/bin/pg_verify_checksums/t/002_actions.pl @@ -100,16 +100,17 @@ $node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata, '-r', sub fail_corrupt { my $node = shift; + my $relfilenode = shift; my $file = shift; my
Re: pgsql: Add TAP tests for pg_verify_checksums
On Sat, Oct 20, 2018 at 02:03:32AM -0400, Stephen Frost wrote: > On Sat, Oct 20, 2018 at 01:11 Andres Freund wrote: >> or the new checksum validation logic complaining about them, and such >>> when doing backups and I wonder if that is because people simply don’t >>> use the two together much, making me wonder how much of an issue this >>> really is or would be with the account-for-everything approach I’ve >>> been advocating for. >> >> I mean obviously pg_verify_checksum simply hasn't been actually tested >> much with plain postgres without extensions, given the all weaknesses >> identified in this thread. > > No, it hasn’t, but pg_basebackup has been around quite a while and has > always copied everything, as best as I can recall anyway. At this point, let's create a new thread with a description of what has been discussed and what we'd like to do for HEAD and v11. I got something in mind which would result in a minimal patch. Let's start from that. -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, On Sat, Oct 20, 2018 at 01:11 Andres Freund wrote: > Hi, > > On 2018-10-20 01:07:43 -0400, Stephen Frost wrote: > > I have to say that I can’t recall hearing much in the way of complaints > > about pg_basebackup copying all the random cstore files > > Why would somebody complain about that? It's usually desirable. Even though they’re as likely as not to be invalid or corrupted..? Maybe things have moved forward here, I know there’s been discussion about it, but last I heard those files weren’t WAL’d and therefore the result of copying them from a running server was indeterminate. Yes, sometimes they’ll be fine, but you could say the same about regular PG relations too and yet we certainly wouldn’t be accepting of that. It certainly seems reasonable that people would complain about pg_basebackup misperforming when a backup that it did results in an invalid restore, though it tends to be a lot rarer to get complaints about partial failures like a corrupt or partial file being copied during a backup- but then that’s part of why we stress so much about trying to make sure we don’t do that as it can be hard to detect. People certainly did complain about unlogged tables being backed up and that was just because they took up space in the backup and time on the backup and restore just to be nuked when the server is started. > or the new checksum validation logic complaining about them, and such > > when doing backups and I wonder if that is because people simply don’t > > use the two together much, making me wonder how much of an issue this > > really is or would be with the account-for-everything approach I’ve > > been advocating for. > > I mean obviously pg_verify_checksum simply hasn't been actually tested > much with plain postgres without extensions, given the all weaknesses > identified in this thread. No, it hasn’t, but pg_basebackup has been around quite a while and has always copied everything, as best as I can recall anyway. Thanks, Stephen >
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, On Sat, Oct 20, 2018 at 01:16 Andres Freund wrote: > Hi, > > On 2018-10-20 00:25:19 -0400, Stephen Frost wrote: > > If we are going to decide that we only deal with files in our directories > > matching these whitelisted patterns, then shouldn’t we apply similar > logic > > in things like DROP DATABASE and any other cases where we perform actions > > in a recursive manner across our database and table space directories? > > I'm honestly not sure if you're just trolling at this point. Why would > anybody reasonable be concerned about DROP DATABASE dropping the > directory for the database? You're not honestly suggesting that anybody > would write an extension or anything like that that stores data in the > wrong database's directory, right? Other iterations like fsyncing > files, copying the entire template database directory, etc are similarly > harmless or positive. No, I’m not trolling, what I was trying to do is make the point that this is moving us away from having a very clear idea of what’s PG’s responsibility and what we feel comfortable operating on and this new half-and-half stance where we’ll happily nuke files in a directory that we didn’t create, and back them up even if we have no idea if they’ll be consistent at all or not when restored, but we won’t try to check the checksums on them or do some other set of operations on them. I suspect it’s pretty clear already, but just to make it plain, I really don’t like the half-and-half approach and it seems we’re being backed into this because it happens to fit some specific cases and not because there was any real thought or design put into supporting these use cases or being able to extend PG in this way. I do also see real risks with a whitelisting kind of approach that we end up missing things, and I get that you don’t see that risk but just stating that doesn’t change my opinion on it. Based on what you said up thread this whole thing also seems like it may be just going away anyway- because there’s real design being done to do this properly and allow PG to be extended in a way that we will know about what files are associated with what extensions or storage mechanisms and that seems like just another reason why we shouldn’t be moving to explicitly support random files being dropped into PG directories. Thanks, Stephen
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-20 00:25:19 -0400, Stephen Frost wrote: > If we are going to decide that we only deal with files in our directories > matching these whitelisted patterns, then shouldn’t we apply similar logic > in things like DROP DATABASE and any other cases where we perform actions > in a recursive manner across our database and table space directories? I'm honestly not sure if you're just trolling at this point. Why would anybody reasonable be concerned about DROP DATABASE dropping the directory for the database? You're not honestly suggesting that anybody would write an extension or anything like that that stores data in the wrong database's directory, right? Other iterations like fsyncing files, copying the entire template database directory, etc are similarly harmless or positive. Greetings, Andres Freund
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-20 01:07:43 -0400, Stephen Frost wrote: > I have to say that I can’t recall hearing much in the way of complaints > about pg_basebackup copying all the random cstore files Why would somebody complain about that? It's usually desirable. > or the new checksum validation logic complaining about them, and such > when doing backups and I wonder if that is because people simply don’t > use the two together much, making me wonder how much of an issue this > really is or would be with the account-for-everything approach I’ve > been advocating for. I mean obviously pg_verify_checksum simply hasn't been actually tested much with plain postgres without extensions, given the all weaknesses identified in this thread. Greetings, Andres Freund
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, On Sat, Oct 20, 2018 at 00:58 Michael Paquier wrote: > On Sat, Oct 20, 2018 at 12:41:04AM -0400, Stephen Frost wrote: > > I’d also like to give David Steele a chance to comment on the specific > API, > > and any other backup tools authors, which I don’t think we should be > > rushing into anyway and I would think we’d only put into master.. > > By the way, we need to do something for the checksum verification code > in base backups for v11 as well. If you enable checksums and take a > base backup of a build with EXEC_BACKEND, then this creates spurious > checksums failures. That's a bug. So while I agree that having a > larger robust API is fine for HEAD, I would most likely not back-patch > it. This is why I would suggest as a first step for HEAD and v11 to use > a whitelist for base backups, to check for temporary tablespaces in > pg_verify_checksums, to move isRelFileName into src/common/ and to keep > the change minimalistic. I’m all for keeping the changes which are backpatched minimal, which updating the blacklist as your original patch on this thread did would certainly be. Even adding in the logic to skip temp files as pg_basebackup has would be simpler and based on existing well-tested and extensively used code, unlike this new pattern-based whitelist of files approach. I have to say that I can’t recall hearing much in the way of complaints about pg_basebackup copying all the random cstore files, or the new checksum validation logic complaining about them, and such when doing backups and I wonder if that is because people simply don’t use the two together much, making me wonder how much of an issue this really is or would be with the account-for-everything approach I’ve been advocating for. Thanks! Stephen >
Re: pgsql: Add TAP tests for pg_verify_checksums
On Sat, Oct 20, 2018 at 12:41:04AM -0400, Stephen Frost wrote: > I’d also like to give David Steele a chance to comment on the specific API, > and any other backup tools authors, which I don’t think we should be > rushing into anyway and I would think we’d only put into master.. By the way, we need to do something for the checksum verification code in base backups for v11 as well. If you enable checksums and take a base backup of a build with EXEC_BACKEND, then this creates spurious checksums failures. That's a bug. So while I agree that having a larger robust API is fine for HEAD, I would most likely not back-patch it. This is why I would suggest as a first step for HEAD and v11 to use a whitelist for base backups, to check for temporary tablespaces in pg_verify_checksums, to move isRelFileName into src/common/ and to keep the change minimalistic. -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, On Sat, Oct 20, 2018 at 00:43 Michael Paquier wrote: > On Sat, Oct 20, 2018 at 12:25:19AM -0400, Stephen Frost wrote: > > On Fri, Oct 19, 2018 at 23:40 Michael Paquier > wrote: > >> - I agree with not doing a simple revert to not turn the buildfarm red > >> again. This is annoying for animal maintainers. Andrew has done a very > >> nice work in disabling manually those tests temporarily. > > > > This is a red herring, and always was, so I’m rather unimpressed at how > it > > keeps coming up- no, I’m not advocating that we should just make the > build > > farm red and just leave it that way. Yes, we should fix this case, and > fix > > pg_basebackup, and maybe even try to add some regression tests which test > > this exact same case in pg_basebackup, but making the build farm green is > > *not* the only thing we should care about. > > Well, the root of the problem was that pg_verify_checksums has been > committed without any tests on its own. If we had those tests from the > start, then we would not be having this discussion post-release time, > still trying to figure out if whitelisting or blacklisting is > appropriate. > > The validation checksums in base backups has been added in 4eb77d50, a > couple of days before pg_verify_checksums has been introduced. This has > added corruption-related tests in src/bin/pg_basebackup, which is a good > thing. However the feature has been designed so as checksum mismatches > are ignored after 5 failures, which actually *masked* the fact that > EXEC_BACKEND files like CONFIG_EXEC_PARAMS should have been skipped > instead of getting checksum failures. And that's a bad thing. So this > gives in my opinion a good point about using a whitelist. That counter of checksum failures should have been getting reset between files.. that’s certainly what I had understood was intended. The idea of the counter is to not flood the log with errors when a file is discovered that’s full of checksum failures (as can happen if large chunks of the file got replaced with something else, for example), but it should be reporting on each file that does have failures in it. I don’t see how having a whitelist changes that or would have impacted that logic to make it correct initially or not. I’m also trying to understand how this checksum logging limit is getting hit in the tests but they’re passing..? If this was an intended failure check then surely there’s a test done that’s intended to be successful and it should have complained about this file too, or perhaps that’s what’s missing. I’m happy to look into this later this weekend. Certainly seems like something here isn’t really making sense tho. Thanks! Stephen
Re: pgsql: Add TAP tests for pg_verify_checksums
On Sat, Oct 20, 2018 at 12:41:04AM -0400, Stephen Frost wrote: > I’d also like to give David Steele a chance to comment on the specific API, > and any other backup tools authors, which I don’t think we should be > rushing into anyway and I would think we’d only put into master.. Getting David input would be nice! -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
On Sat, Oct 20, 2018 at 12:25:19AM -0400, Stephen Frost wrote: > On Fri, Oct 19, 2018 at 23:40 Michael Paquier wrote: >> - I agree with not doing a simple revert to not turn the buildfarm red >> again. This is annoying for animal maintainers. Andrew has done a very >> nice work in disabling manually those tests temporarily. > > This is a red herring, and always was, so I’m rather unimpressed at how it > keeps coming up- no, I’m not advocating that we should just make the build > farm red and just leave it that way. Yes, we should fix this case, and fix > pg_basebackup, and maybe even try to add some regression tests which test > this exact same case in pg_basebackup, but making the build farm green is > *not* the only thing we should care about. Well, the root of the problem was that pg_verify_checksums has been committed without any tests on its own. If we had those tests from the start, then we would not be having this discussion post-release time, still trying to figure out if whitelisting or blacklisting is appropriate. The validation checksums in base backups has been added in 4eb77d50, a couple of days before pg_verify_checksums has been introduced. This has added corruption-related tests in src/bin/pg_basebackup, which is a good thing. However the feature has been designed so as checksum mismatches are ignored after 5 failures, which actually *masked* the fact that EXEC_BACKEND files like CONFIG_EXEC_PARAMS should have been skipped instead of getting checksum failures. And that's a bad thing. So this gives in my opinion a good point about using a whitelist. -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, On Sat, Oct 20, 2018 at 00:31 Michael Paquier wrote: > On Fri, Oct 19, 2018 at 08:50:04PM -0700, Andres Freund wrote: > > On 2018-10-20 12:39:55 +0900, Michael Paquier wrote: > >> So what I think we ought to do is the following: > >> - Start a new thread, this one about TAP tests is not adapted. > >> - Add in src/common/relpath.c the API from d55241af called > >> isRelFileName(), make use of it in the base backup code, and basically > >> remove is_checksummed_file() and the checksum skip list. > > > > I think it probably shouldn't quite be that as an API. The code should > > not just check whether the file matches a pattern, but also importantly > > needs to exclude files that are in a temp tablespace. isRelFileName() > > doesn't quite describe an API meaning that. I assume we should keep > > something like isRelFileName() but use an API ontop of that that also > > exclude temp files / relations. > > From what I can see we would need to check for a couple of patterns if > we go to this extent: > - Look for PG_TEMP_FILE_PREFIX and exclude those. > - looks_like_temp_rel_name() which checks for temp files names. This is > similar to isRelFileName except that it works on temporary files. > Moving it to relpath.c and renaming it IsTempRelFileName is an idea. > But this one would not be necessary as isRelFileName discards temporary > relations, no? > - parse_filename_for_nontemp_relation() is also too similar to > isRelFileName(). > > At the end, do we really need to do anything more than adding some > checks on PG_TEMP_FILE_PREFIX? I am not sure that there is much need > for a global API like isChecksummedFile for only those two places. I > have already a patch doing the work of moving isRelFileName() into > src/common/. Adding one check on PG_TEMP_FILE_PREFIX is not much work > on top of it. I’m not at my computer at the moment so may not be entirely following the question here but to be clear, whatever we do here will have downstream impact into other tools, such as pgbackrest, and therefore we definitely want to have the code in libpgcommon so that these external tools can leverage it and know that they’re doing what PG does. I’d also like to give David Steele a chance to comment on the specific API, and any other backup tools authors, which I don’t think we should be rushing into anyway and I would think we’d only put into master.. Thanks! Stephen
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-20 13:30:46 +0900, Michael Paquier wrote: > On Fri, Oct 19, 2018 at 08:50:04PM -0700, Andres Freund wrote: > > On 2018-10-20 12:39:55 +0900, Michael Paquier wrote: > >> So what I think we ought to do is the following: > >> - Start a new thread, this one about TAP tests is not adapted. > >> - Add in src/common/relpath.c the API from d55241af called > >> isRelFileName(), make use of it in the base backup code, and basically > >> remove is_checksummed_file() and the checksum skip list. > > > > I think it probably shouldn't quite be that as an API. The code should > > not just check whether the file matches a pattern, but also importantly > > needs to exclude files that are in a temp tablespace. isRelFileName() > > doesn't quite describe an API meaning that. I assume we should keep > > something like isRelFileName() but use an API ontop of that that also > > exclude temp files / relations. > > From what I can see we would need to check for a couple of patterns if > we go to this extent: > - Look for PG_TEMP_FILE_PREFIX and exclude those. > - looks_like_temp_rel_name() which checks for temp files names. This is > similar to isRelFileName except that it works on temporary files. > Moving it to relpath.c and renaming it IsTempRelFileName is an idea. > But this one would not be necessary as isRelFileName discards temporary > relations, no? I think it's not good to have those necessary intermingled in an exposed function. > At the end, do we really need to do anything more than adding some > checks on PG_TEMP_FILE_PREFIX? I think we also need the exclusion list basebackup.c has that generally skips files. They might be excluded anyway, but I think it'd be safer to make sure. > I am not sure that there is much need for a global API like > isChecksummedFile for only those two places. Seems likely that other tools would want to have access too. Greetings, Andres Freund
Re: pgsql: Add TAP tests for pg_verify_checksums
On Fri, Oct 19, 2018 at 08:50:04PM -0700, Andres Freund wrote: > On 2018-10-20 12:39:55 +0900, Michael Paquier wrote: >> So what I think we ought to do is the following: >> - Start a new thread, this one about TAP tests is not adapted. >> - Add in src/common/relpath.c the API from d55241af called >> isRelFileName(), make use of it in the base backup code, and basically >> remove is_checksummed_file() and the checksum skip list. > > I think it probably shouldn't quite be that as an API. The code should > not just check whether the file matches a pattern, but also importantly > needs to exclude files that are in a temp tablespace. isRelFileName() > doesn't quite describe an API meaning that. I assume we should keep > something like isRelFileName() but use an API ontop of that that also > exclude temp files / relations. From what I can see we would need to check for a couple of patterns if we go to this extent: - Look for PG_TEMP_FILE_PREFIX and exclude those. - looks_like_temp_rel_name() which checks for temp files names. This is similar to isRelFileName except that it works on temporary files. Moving it to relpath.c and renaming it IsTempRelFileName is an idea. But this one would not be necessary as isRelFileName discards temporary relations, no? - parse_filename_for_nontemp_relation() is also too similar to isRelFileName(). At the end, do we really need to do anything more than adding some checks on PG_TEMP_FILE_PREFIX? I am not sure that there is much need for a global API like isChecksummedFile for only those two places. I have already a patch doing the work of moving isRelFileName() into src/common/. Adding one check on PG_TEMP_FILE_PREFIX is not much work on top of it. -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, On Fri, Oct 19, 2018 at 23:40 Michael Paquier wrote: > On Fri, Oct 19, 2018 at 06:43:18PM -0400, Tom Lane wrote: > > Andrew Dunstan writes: > >> I don't think just reverting it is really acceptable. > > > > +several. I do not mind somebody writing and installing a better fix. > > I do object to turning the buildfarm red again. > > I did not expect this thread to turn into a war zone. Anyway, there are > a couple of things I agree with on this thread: > - I agree with Andres point here: > https://postgr.es/m/20181019171747.4uithw2sjkt6m...@alap3.anarazel.de > A blacklist is fundamentally more difficult to maintain as there are > way more things added in a data folder which do not have data checksums > than things which have checksums. So using a blacklist approach looks > unmaintainable in the long term. Future patches related to enabling > online checksum verification make me worry if we keep the code like > that. I can also easily imagine that anybody willing to use the > pluggable storage API would like to put new files in tablespace-related > data folders, relying on "base/" being the default system tablespace > If we are going to decide that we only deal with files in our directories matching these whitelisted patterns, then shouldn’t we apply similar logic in things like DROP DATABASE and any other cases where we perform actions in a recursive manner across our database and table space directories? Should we really be removing arbitrary files that we know nothing about, after all? What about pg_basebackup? Shall we update it to only stream through files matching these patterns as those are the only files we consider ourselves to be aware of? I don’t buy off on any argument that presents pluggable storage as not including some way for us to track and be aware of what files are associated with that pluggable storage mechanism and which of those files have checksums and how to verify them if they do. In other words, I sure hope we don’t accept an approach like cstore *fdw* uses for pluggable storage where the core system has no idea whatsoever about what these random files dropped into our tablespace directories are. - I agree with Stephen's point that we should decide if a file has > checksums or not in a single place, and that we should use the same > logic for base backups and pg_verify_checksums. Despite it being a lot of the discussion, I don’t think there was ever disagreement on this point. - I agree with not doing a simple revert to not turn the buildfarm red > again. This is annoying for animal maintainers. Andrew has done a very > nice work in disabling manually those tests temporarily. This is a red herring, and always was, so I’m rather unimpressed at how it keeps coming up- no, I’m not advocating that we should just make the build farm red and just leave it that way. Yes, we should fix this case, and fix pg_basebackup, and maybe even try to add some regression tests which test this exact same case in pg_basebackup, but making the build farm green is *not* the only thing we should care about. - The base backup logic deciding if a file has checksums looks broken to > me: it misses files generated by EXEC_BACKEND, and any instance of > Postgres using an extension with custom files and data checksums has its > backups broken. cstore_fdw has been mentioned above, and I recall that > Heroku for example enables data checksums. If you combine both, it > basically means that such an instance cannot take base backups anymore > while it was able to do so with pre-10 with default options. That's not > cool. This is incorrect, pg_basebackup will still back up and keep files which fail checksum checks- logic which David Steele and I pushed for when the checksumming logic was added to pg_basebackup, as I recall, but it’ll complain and warn about these files ... As it *should*, even if we weren’t doing checksum checks, because these are random files that have been dropped into a PG directory that PG doesn’t know anything about, which aren’t handled through WAL and therefore there’s no way to know if they’ll be at all valid when they’re copied, or that backing them up is at all useful. Backups are absolutely important and we wouldn’t want a backup to be aborted or failed due to checksum failures, in general, but the user should be made aware of these failures. This approach of only taking responsibility for the files we know of through these patterns could also imply that we shouldn’t back up in pg_basebackup files that don’t match- but that strikes me as another similarly risky approach as any mistake in this whitelist of known files could then result in an inconsistent backup that’s missing things that should have been included. As for which approach is easier to maintain, I don’t see one as being meaningfully much easier to maintain than the other, code wise, while having these patterns looks to me as having a lot more risk, with the only rather nebulous gain
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-20 12:39:55 +0900, Michael Paquier wrote: > - I agree with Stephen's point that we should decide if a file has > checksums or not in a single place, and that we should use the same > logic for base backups and pg_verify_checksums. To be clear, I wholeheartedly agree on that. We probably only want to tackle that for "is checksummable file" in a backpatchable fashion, but we probably should move to having more common infrastructure like that. > So what I think we ought to do is the following: > - Start a new thread, this one about TAP tests is not adapted. > - Add in src/common/relpath.c the API from d55241af called > isRelFileName(), make use of it in the base backup code, and basically > remove is_checksummed_file() and the checksum skip list. I think it probably shouldn't quite be that as an API. The code should not just check whether the file matches a pattern, but also importantly needs to exclude files that are in a temp tablespace. isRelFileName() doesn't quite describe an API meaning that. I assume we should keep something like isRelFileName() but use an API ontop of that that also exclude temp files / relations. Greetings, Andres Freund Date: Fri, 19 Oct 2018 20:48:02 -0700 Message-ID: <87in1xmg7h@alap4.lan>
Re: pgsql: Add TAP tests for pg_verify_checksums
On Fri, Oct 19, 2018 at 06:43:18PM -0400, Tom Lane wrote: > Andrew Dunstan writes: >> I don't think just reverting it is really acceptable. > > +several. I do not mind somebody writing and installing a better fix. > I do object to turning the buildfarm red again. I did not expect this thread to turn into a war zone. Anyway, there are a couple of things I agree with on this thread: - I agree with Andres point here: https://postgr.es/m/20181019171747.4uithw2sjkt6m...@alap3.anarazel.de A blacklist is fundamentally more difficult to maintain as there are way more things added in a data folder which do not have data checksums than things which have checksums. So using a blacklist approach looks unmaintainable in the long term. Future patches related to enabling online checksum verification make me worry if we keep the code like that. I can also easily imagine that anybody willing to use the pluggable storage API would like to put new files in tablespace-related data folders, relying on "base/" being the default system tablespace - I agree with Stephen's point that we should decide if a file has checksums or not in a single place, and that we should use the same logic for base backups and pg_verify_checksums. - I agree with not doing a simple revert to not turn the buildfarm red again. This is annoying for animal maintainers. Andrew has done a very nice work in disabling manually those tests temporarily. - The base backup logic deciding if a file has checksums looks broken to me: it misses files generated by EXEC_BACKEND, and any instance of Postgres using an extension with custom files and data checksums has its backups broken. cstore_fdw has been mentioned above, and I recall that Heroku for example enables data checksums. If you combine both, it basically means that such an instance cannot take base backups anymore while it was able to do so with pre-10 with default options. That's not cool. So what I think we ought to do is the following: - Start a new thread, this one about TAP tests is not adapted. - Add in src/common/relpath.c the API from d55241af called isRelFileName(), make use of it in the base backup code, and basically remove is_checksummed_file() and the checksum skip list. -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Andrew Dunstan writes: > I don't think just reverting it is really acceptable. +several. I do not mind somebody writing and installing a better fix. I do object to turning the buildfarm red again. regards, tom lane
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2018-10-19 17:32:58 -0400, Stephen Frost wrote: > > As you pointed out previously, the current code *doesn't* work, before > > or after this change, and we clearly need to rework this to move things > > into libpgcommon and also fix pg_basebackup. Reverting this would at > > least get us back to having similar code between this and pg_basebackup, > > and then it'll be cleaner and clearer to have one patch which moves that > > similar logic into libpgcommon and fixes the missing exceptions for the > > EXEC_BACKEND case. > > > > Keeping the patch doesn't do anything for the pg_basebackup case, and > > confuses the issue by having these filename-pattern-whitelists which > > weren't there before and that should be removed, imv. > > The current state has pg_verify_checksum work on windows for casual > usage, which includes the regression tests that previously were broken. > Whereas reverting it has the advantage that a fixup diff would be a few > lines smaller. If you want to push a revert together with a fix, be my > guest, but until that time it seems unhelpful to revert. Clearly, pg_basebackup *doesn't* work though, so it's at best only half a fix and only because our regression tests don't cover the pg_basebackup case (which is a sad state of affairs, to say the least, but an independent issue). I'm not in any specific rush on this and I hope to hear Michael's thoughts once he's had a chance to review our discussion; it's his commit, after all. Michael's initial patch is in the archives also, so my suggestion would be to revert this one and then take Michael's prior patch and add to it the fix of pg_basebackup, in the same manner, and commit those changes together with additional comments to note that the code exists in both places. We could then work on moving the logic into libpgcommon, in master. Thanks! Stephen signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/19/2018 05:32 PM, Stephen Frost wrote: As you pointed out previously, the current code *doesn't* work, before or after this change, and we clearly need to rework this to move things into libpgcommon and also fix pg_basebackup. Reverting this would at least get us back to having similar code between this and pg_basebackup, and then it'll be cleaner and clearer to have one patch which moves that similar logic into libpgcommon and fixes the missing exceptions for the EXEC_BACKEND case. Keeping the patch doesn't do anything for the pg_basebackup case, and confuses the issue by having these filename-pattern-whitelists which weren't there before and that should be removed, imv. I don't think just reverting it is really acceptable. The patch was a response to buildfarm breakage, and moreover was published and discussed before it was applied. If you don't like it I think you need to publish a better solution that will not involve reintroducing the buildfarm error. I don't have a strong opinion about the mechanism. The current conversation does seem to me to be generating more heat than light, TBH. But I do have a strong opinion about not having to enable/disable the TAP test in question constantly. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-19 17:32:58 -0400, Stephen Frost wrote: > As you pointed out previously, the current code *doesn't* work, before > or after this change, and we clearly need to rework this to move things > into libpgcommon and also fix pg_basebackup. Reverting this would at > least get us back to having similar code between this and pg_basebackup, > and then it'll be cleaner and clearer to have one patch which moves that > similar logic into libpgcommon and fixes the missing exceptions for the > EXEC_BACKEND case. > > Keeping the patch doesn't do anything for the pg_basebackup case, and > confuses the issue by having these filename-pattern-whitelists which > weren't there before and that should be removed, imv. The current state has pg_verify_checksum work on windows for casual usage, which includes the regression tests that previously were broken. Whereas reverting it has the advantage that a fixup diff would be a few lines smaller. If you want to push a revert together with a fix, be my guest, but until that time it seems unhelpful to revert. Greetings, Andres Freund
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2018-10-19 16:35:46 -0400, Stephen Frost wrote: > > The only justification for *not* doing this is that some extension > > author might have dumped files into our tablespace directory, something > > we've never claimed to support nor generally worried about in all the > > time that I can recall before this. > > No, it's really not the only reason. As I said, testing is much less > likely to find cases where we're checksumming a short-lived file even > though we shouldn't, than making sure that checksummed files are > actually checksummed. While I agree that we might possibly end up trying to checksum a short-lived and poorly-named file, I'd rather that than risk missing the checking of a file which does have checksums that should have been checked and claiming that we checked all of the checksums. > > > > What about two different extensions wanting to create files with the > > > > same names in the tablespace directories..? > > > > > > > > Experimentation is fine, of course, this is open source code and people > > > > should feel free to play with it, but we are not obligated to avoid > > > > breaking things when an extension author, through their experimentation, > > > > does something which is clearly not supported, like dropping files into > > > > PG's tablespace directories. Further, when it comes to our user's data, > > > > we should be taking a strict approach and accounting for everything, > > > > something that this whitelist-of-patterns-based approach to finding > > > > files to verify the checksums on doesn't do. > > > > > > It's not economically feasible to work on extensions that will only be > > > usable a year down the road. > > > > I listed out multiple other solutions to this problem which you > > summarily ignored. > > Except that none of them really achieves the goals you can achieve by > having the files in the database (like DROP DATABASE working, for > starters). The only reason things like DROP DATABASE "work" in this example case is exactly that it assumes that all of the files which exist are ones which we put there. Or, to put it another way, if we're only going to checksum files based on some whitelist of files we expect to be there, shouldn't we go back and make things like DROP DATABASE only remove those files that we expect to be there and not random other files that we have no knowledge of..? > > It's unfortunate that those other solutions weren't used and that, > > instead, this extension decided to drop files into PG's tablespace > > directories, but that doesn't mean we should condone or implicitly > > support that action. > > This just seems pointlessly rigid. Our extension related APIs aren't > generally very good or well documented. Most non-trivial extensions that > add interesting things to the postgres ecosystem are going to need a few > not so pretty things to get going. PostGIS is a fantastic example of an extension that is far from trivial, extends PG in ways which are clearly supported, and has worked with PG to improve things in core to be able to then use in the extension. > > I stand by my position that this patch should be reverted (with no > > offense or ill-will towards Michael, of course, I certainly appreciate > > his efforts to address the issues with pg_verify_checksums) and that we > > should move more of this code to identify files to verify the checksums > > on into libpgcommon and then use it from both pg_basebackup and > > pg_verify_checksums, to the extent possible, but that we make sure to > > account for all of the files in our tablespace and database directories. > > What does that do, except break things that currently work? Sure, work > on a patch that fixes the architectural concerns, but what's the point > in reverting it until that's done? As you pointed out previously, the current code *doesn't* work, before or after this change, and we clearly need to rework this to move things into libpgcommon and also fix pg_basebackup. Reverting this would at least get us back to having similar code between this and pg_basebackup, and then it'll be cleaner and clearer to have one patch which moves that similar logic into libpgcommon and fixes the missing exceptions for the EXEC_BACKEND case. Keeping the patch doesn't do anything for the pg_basebackup case, and confuses the issue by having these filename-pattern-whitelists which weren't there before and that should be removed, imv. Thanks! Stephen signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-19 16:35:46 -0400, Stephen Frost wrote: > The only justification for *not* doing this is that some extension > author might have dumped files into our tablespace directory, something > we've never claimed to support nor generally worried about in all the > time that I can recall before this. No, it's really not the only reason. As I said, testing is much less likely to find cases where we're checksumming a short-lived file even though we shouldn't, than making sure that checksummed files are actually checksummed. > > > Playing this further and assuming that extensions dropping files into > > > tablespace directories is acceptable, what are we supposed to do when > > > there's some direct conflict between a file that we want to create in a > > > PG tablespace directory and a file that some extension dropped there? > > > Create yet another subdirectory which we call > > > "THIS_IS_REALLY_ONLY_FOR_PG"? > > > > Then it's a buggy extension. And we error out. > > Extensions writing into directories they shouldn't be makes them buggy > even if the core code isn't likely to write to a particular filename, > imv. I'll just stop talking to you about this for now. > > > What about two different extensions wanting to create files with the > > > same names in the tablespace directories..? > > > > > > Experimentation is fine, of course, this is open source code and people > > > should feel free to play with it, but we are not obligated to avoid > > > breaking things when an extension author, through their experimentation, > > > does something which is clearly not supported, like dropping files into > > > PG's tablespace directories. Further, when it comes to our user's data, > > > we should be taking a strict approach and accounting for everything, > > > something that this whitelist-of-patterns-based approach to finding > > > files to verify the checksums on doesn't do. > > > > It's not economically feasible to work on extensions that will only be > > usable a year down the road. > > I listed out multiple other solutions to this problem which you > summarily ignored. Except that none of them really achieves the goals you can achieve by having the files in the database (like DROP DATABASE working, for starters). > It's unfortunate that those other solutions weren't used and that, > instead, this extension decided to drop files into PG's tablespace > directories, but that doesn't mean we should condone or implicitly > support that action. This just seems pointlessly rigid. Our extension related APIs aren't generally very good or well documented. Most non-trivial extensions that add interesting things to the postgres ecosystem are going to need a few not so pretty things to get going. > I stand by my position that this patch should be reverted (with no > offense or ill-will towards Michael, of course, I certainly appreciate > his efforts to address the issues with pg_verify_checksums) and that we > should move more of this code to identify files to verify the checksums > on into libpgcommon and then use it from both pg_basebackup and > pg_verify_checksums, to the extent possible, but that we make sure to > account for all of the files in our tablespace and database directories. What does that do, except break things that currently work? Sure, work on a patch that fixes the architectural concerns, but what's the point in reverting it until that's done? > To the extent that this is an issue for extension authors, perhaps it > would encourage them to work with us to have supported mechanisms > instead of using hacks like dropping files into our tablespace > directories and such instead of using another approach to manage files > across multiple filesystems. I'd be kind of surprised if they really > had an issue doing that and hopefully everyone would feel better about > what these extensions are doing once they start using a mechanism that > we actually support. Haribabu has a patch, but it's on-top of pluggable storage, so not exactly a small prerequisite. Greetings, Andres Freund
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2018-10-19 15:57:28 -0400, Stephen Frost wrote: > > Perhaps it doesn't today but surely one goal of pg_verify_checksum is to > > be able to run it on a running cluster eventually. > > I was saying *precisely* that above. I give up. I'm glad we agree on that (I thought we did before, in fact, so it was odd to see it coming up again). > > > pg_basebackup's case is *not* really comparable because basebackup.c > > > already performed a lot of filtering before noChecksumFiles is applied. > > > > This all really just points out that we should have the code for > > handling this somewhere common that both pg_verify_checksum and > > pg_basebackup can leverage without duplicating all of it. > > I never argued against that. My point is that your argument that they > started out the same isn't true. I overstated the relationship, clearly, but it hardly matters, as you say.. > > The specific case that started all of this certainly looks pretty > > clearly like a case that both need to deal with. > > Yep. ... and it's in the part of the code that was actually copied and was the same. > > > FWIW, as far as I can tell, pg_verify_checksum appears to be broken in a > > > lot of (unfortunately) pretty normal scenarios right now. Not just on > > > windows. Besides the narrow window of crashing while a .tmp file is > > > present (and then shutting down normally after a crash restart), it also > > > has the much of wider window of crashing while *any* backend has > > > temporary files/relations. As crash-restarts don't release temp files, > > > they'll still be present after the crash, and a single graceful shutdown > > > afterwards will leave them in place. No? > > > > We do go through and do some cleanup at crash/restart > > We don't clean up temp files tho. > > * NOTE: we could, but don't, call this during a post-backend-crash restart > * cycle. The argument for not doing it is that someone might want to examine > * the temp files for debugging purposes. This does however mean that > * OpenTemporaryFile had better allow for collision with an existing temp > * file name. Then, yes, we should go through and fix pg_verify_checksums to work correctly in this case, likely using more-or-less the same code that pg_basebackup has. After that, we can add the remaining code to check the last checkpoint and skip pages which have a more recent LSN... > > As you say, some of those are likely to have checksums, which should be > > handled by pg_basebackup and pg_verify_checksums, and that goes back to > > the point I was making up-thread that we want to make sure an account > > for everything. With this pattern-based approach, we could easily end > > up forgetting to add the correct new pattern into pg_verify_checksums. > > If you're adding checksums for something, you better test it I don't buy > this. In contrast it's much more likely that there's a file that's > short-lived that you won't easily test against pg_verify_checksum > running in that moment. The only justification for *not* doing this is that some extension author might have dumped files into our tablespace directory, something we've never claimed to support nor generally worried about in all the time that I can recall before this. As for saying that someone is obviously going to use pg_verify_checksums to check their checksum code- I simply don't agree that we should be happy to rely on that assumption when the only reason for any of this is that some extension decided to do something that wasn't supported and likely has issues in other parts of the code anyway (pg_basebackup would happily copy those files too, even though there's obviously no code for making sure it's a consistent copy or any such in pg_basebackup or core). > > Playing this further and assuming that extensions dropping files into > > tablespace directories is acceptable, what are we supposed to do when > > there's some direct conflict between a file that we want to create in a > > PG tablespace directory and a file that some extension dropped there? > > Create yet another subdirectory which we call > > "THIS_IS_REALLY_ONLY_FOR_PG"? > > Then it's a buggy extension. And we error out. Extensions writing into directories they shouldn't be makes them buggy even if the core code isn't likely to write to a particular filename, imv. > > What about two different extensions wanting to create files with the > > same names in the tablespace directories..? > > > > Experimentation is fine, of course, this is open source code and people > > should feel free to play with it, but we are not obligated to avoid > > breaking things when an extension author, through their experimentation, > > does something which is clearly not supported, like dropping files into > > PG's tablespace directories. Further, when it comes to our user's data, > > we should be taking a strict approach and accounting for everything, > > something that
Re: pgsql: Add TAP tests for pg_verify_checksums
On 2018-10-19 15:57:28 -0400, Stephen Frost wrote: > > > > > Of course- the same is true with a crash/restart case, so I'm not sure > > > > > what you're getting at here. > > > > > > > > pg_verify_checksum doesn't support running on a crashed cluster, and I'm > > > > not sure it'd make sense to teach it to so (there's not really much it > > > > could do to discern whether a block is a torn page that replay will fix, > > > > or corrupted). > > > > > > ... and this isn't at all relevant, because pg_basebackup does run on a > > > running cluster. > > > > I wasn't ever denying that or anything close to it? My point is that > > pg_verify_checksum needs much more filtering than it has now to deal > > with that, because it needs to handle all files that could be present, > > not just files that could be present after a graceful shutdown. > > Perhaps it doesn't today but surely one goal of pg_verify_checksum is to > be able to run it on a running cluster eventually. I was saying *precisely* that above. I give up. > > pg_basebackup's case is *not* really comparable because basebackup.c > > already performed a lot of filtering before noChecksumFiles is applied. > > This all really just points out that we should have the code for > handling this somewhere common that both pg_verify_checksum and > pg_basebackup can leverage without duplicating all of it. I never argued against that. My point is that your argument that they started out the same isn't true. > The specific case that started all of this certainly looks pretty > clearly like a case that both need to deal with. Yep. > > > As pointed out elsewhere on this thread, the logic was the same between > > > the two before this commit... The code in pg_basebackup came from the > > > prior pg_verify_checksums code. Certainly, some mention of the code > > > existing in two places, at least, should have been in the comments. > > > > But the filter for basebackup only comes after the pre-existing > > filtering that the basebackup.c code already does. All of: > > > > /* > > * List of files excluded from backups. > > */ > > static const char *excludeFiles[] = > > { > > /* Skip auto conf temporary file. */ > > PG_AUTOCONF_FILENAME ".tmp", > > > > /* Skip current log file temporary file */ > > LOG_METAINFO_DATAFILE_TMP, > > > > /* Skip relation cache because it is rebuilt on startup */ > > RELCACHE_INIT_FILENAME, > > > > /* > > * If there's a backup_label or tablespace_map file, it belongs to a > > * backup started by the user with pg_start_backup(). It is *not* > > correct > > * for this backup. Our backup_label/tablespace_map is injected into > > the > > * tar separately. > > */ > > BACKUP_LABEL_FILE, > > TABLESPACE_MAP, > > > > "postmaster.pid", > > "postmaster.opts", > > > > /* end of list */ > > NULL > > }; > > > > is not applied, for example. Nor is: > > > > /* Skip temporary files */ > > if (strncmp(de->d_name, > > PG_TEMP_FILE_PREFIX, > > strlen(PG_TEMP_FILE_PREFIX)) == 0) > > continue; > > > > Nor is > > /* Exclude temporary relations */ > > if (isDbDir && looks_like_temp_rel_name(de->d_name)) > > { > > elog(DEBUG2, > > "temporary relation file \"%s\" excluded from > > backup", > > de->d_name); > > > > continue; > > } > > > > So, yea, they had: > > > > static const char *const skip[] = { > > "pg_control", > > "pg_filenode.map", > > "pg_internal.init", > > "PG_VERSION", > > NULL, > > }; > > > > in common, but not more. All the above exclusions are strictly > > necessary. > > ... but all of that code doesn't change that pg_basebackup also needs to > ignore the EXEC_BACKEND created config_exec_params/.new files. Right. > You're right, pg_verify_checksums, with the assumption that it only runs > on a cleanly shut-down cluster, doesn't need the temp-file-skipping > logic, today, but it's going to need it tomorrow, isn't it? No, it needs it today, as explain below in the email you're replaying to. > > FWIW, as far as I can tell, pg_verify_checksum appears to be broken in a > > lot of (unfortunately) pretty normal scenarios right now. Not just on > > windows. Besides the narrow window of crashing while a .tmp file is > > present (and then shutting down normally after a crash restart), it also > > has the much of wider window of crashing while *any* backend has > > temporary files/relations. As crash-restarts don't release temp files, > > they'll still be present after the crash, and a single graceful shutdown > > afterwards will leave them in place. No? > > We do go through and do some cleanup at crash/restart We don't clean up temp files tho. * NOTE: we could, but
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2018-10-19 14:47:51 -0400, Stephen Frost wrote: > > * Andres Freund (and...@anarazel.de) wrote: > > > Oh, for crying out loud. Yes, obviously you can create conflicts, nobody > > > ever doubted that. How on earth is that a useful point for anything? The > > > point isn't that it's a good idea for extensions to do so, but that it > > > has happened because we didn't provide better alternative. > > > > The point is that you're making a case because you happen to know of an > > extension which does this, but neither of us know about every extension > > out there and I, at least, don't believe we should be coming up with > > hacks to try and avoid looking at files that some extension has dumped > > into the PG tablespace directories because that's never been a > > documented or supported thing to do. > > It's not a hack if it's a quite defendable choice on its own, and the > presense of such extensions is just one further argument in one > direction (for explicit whitelisting here). You've not convinced me that an extension dropping files into our tablespace directories is either something we should accept or that we should code around such cases, nor that we are doing anything but putting a hack in place to deal with what is pretty clearly a hack in its own right, imv. > > > > Of course- the same is true with a crash/restart case, so I'm not sure > > > > what you're getting at here. > > > > > > pg_verify_checksum doesn't support running on a crashed cluster, and I'm > > > not sure it'd make sense to teach it to so (there's not really much it > > > could do to discern whether a block is a torn page that replay will fix, > > > or corrupted). > > > > ... and this isn't at all relevant, because pg_basebackup does run on a > > running cluster. > > I wasn't ever denying that or anything close to it? My point is that > pg_verify_checksum needs much more filtering than it has now to deal > with that, because it needs to handle all files that could be present, > not just files that could be present after a graceful shutdown. Perhaps it doesn't today but surely one goal of pg_verify_checksum is to be able to run it on a running cluster eventually. > pg_basebackup's case is *not* really comparable because basebackup.c > already performed a lot of filtering before noChecksumFiles is applied. This all really just points out that we should have the code for handling this somewhere common that both pg_verify_checksum and pg_basebackup can leverage without duplicating all of it. The specific case that started all of this certainly looks pretty clearly like a case that both need to deal with. > > As pointed out elsewhere on this thread, the logic was the same between > > the two before this commit... The code in pg_basebackup came from the > > prior pg_verify_checksums code. Certainly, some mention of the code > > existing in two places, at least, should have been in the comments. > > But the filter for basebackup only comes after the pre-existing > filtering that the basebackup.c code already does. All of: > > /* > * List of files excluded from backups. > */ > static const char *excludeFiles[] = > { > /* Skip auto conf temporary file. */ > PG_AUTOCONF_FILENAME ".tmp", > > /* Skip current log file temporary file */ > LOG_METAINFO_DATAFILE_TMP, > > /* Skip relation cache because it is rebuilt on startup */ > RELCACHE_INIT_FILENAME, > > /* >* If there's a backup_label or tablespace_map file, it belongs to a >* backup started by the user with pg_start_backup(). It is *not* > correct >* for this backup. Our backup_label/tablespace_map is injected into > the >* tar separately. >*/ > BACKUP_LABEL_FILE, > TABLESPACE_MAP, > > "postmaster.pid", > "postmaster.opts", > > /* end of list */ > NULL > }; > > is not applied, for example. Nor is: > > /* Skip temporary files */ > if (strncmp(de->d_name, > PG_TEMP_FILE_PREFIX, > strlen(PG_TEMP_FILE_PREFIX)) == 0) > continue; > > Nor is > /* Exclude temporary relations */ > if (isDbDir && looks_like_temp_rel_name(de->d_name)) > { > elog(DEBUG2, >"temporary relation file \"%s\" excluded from > backup", >de->d_name); > > continue; > } > > So, yea, they had: > > static const char *const skip[] = { > "pg_control", > "pg_filenode.map", > "pg_internal.init", > "PG_VERSION", > NULL, > }; > > in common, but not more. All the above exclusions are strictly > necessary. ... but all of that code doesn't change that pg_basebackup also needs to ignore the EXEC_BACKEND created
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-19 14:47:51 -0400, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > On 2018-10-19 14:11:03 -0400, Stephen Frost wrote: > > > * Andres Freund (and...@anarazel.de) wrote: > > > > cstore e.g. does this, and it's already out there. So yes, we should > > > > provide better infrastructure. But for now, we gotta deal with reality, > > > > unless we just want to gratuituously break things. > > > > > > No, I don't agree that we are beholden to an external extension that > > > decided to start writing into directories that clearly belong to PG. > > > > Did it have an alternative? > > Yes, work with us to come up with a way to accomplish what they wanted > without creating unknown/untracked files in our tablespaces. > > Or, have their own mechanism for storing files on other filesystems. > > And likely other options. Saying that we should account for their > putting arbitrary files into the PG tablespace directories because they > "didn't have any other choice" seems pretty ridiculous, imv. > > > > How do we even know that what cstore does in the tablespace directory > > > wouldn't get caught up in the checksum file pattern-matching that this > > > commit put in? > > > > You listen to people? > > > > > What if there was an extension which did create files that would > > > match, what would we do then? I'm happy to go create one, if that'd > > > help make the point that this "pattern whitelist" isn't actually a > > > solution but is really rather just a hack that'll break in the future. > > > > Oh, for crying out loud. Yes, obviously you can create conflicts, nobody > > ever doubted that. How on earth is that a useful point for anything? The > > point isn't that it's a good idea for extensions to do so, but that it > > has happened because we didn't provide better alternative. > > The point is that you're making a case because you happen to know of an > extension which does this, but neither of us know about every extension > out there and I, at least, don't believe we should be coming up with > hacks to try and avoid looking at files that some extension has dumped > into the PG tablespace directories because that's never been a > documented or supported thing to do. It's not a hack if it's a quite defendable choice on its own, and the presense of such extensions is just one further argument in one direction (for explicit whitelisting here). > > > > > > And I fail to see why a blacklist is architecturally better. There's > > > > > > plenty cases where we might want to create temporary files, > > > > > > non-checksummed data or such that we'd need a teach a blacklist > > > > > > about, > > > > > > but there's far fewer cases where add new checksummed files. > > > > > > Basically > > > > > > never since checksums have been introduced. And if checksums were > > > > > > introduced for something new, say slrus, we'd ceertainly use > > > > > > pg_verify_checksum during development. > > > > > > > > > > In cases where we need something temporary, we're going to need to be > > > > > prepared to clean those up and we had better make it very plain that > > > > > they're temporary and easy to write code for. Anything we aren't > > > > > prepared to blow away on a crash-restart should be checksum'd and in > > > > > an > > > > > ideal world, we'd be looking to reduce the blacklist to only those > > > > > things which are temporary. > > > > > > > > There's pending patches that add support for pg_verify_checksums running > > > > against a running cluster. We'll not just need to recognize files that > > > > are there after a graceful shutdown, but with anything that a cluster > > > > can have there while running. > > > > > > Of course- the same is true with a crash/restart case, so I'm not sure > > > what you're getting at here. > > > > pg_verify_checksum doesn't support running on a crashed cluster, and I'm > > not sure it'd make sense to teach it to so (there's not really much it > > could do to discern whether a block is a torn page that replay will fix, > > or corrupted). > > ... and this isn't at all relevant, because pg_basebackup does run on a > running cluster. I wasn't ever denying that or anything close to it? My point is that pg_verify_checksum needs much more filtering than it has now to deal with that, because it needs to handle all files that could be present, not just files that could be present after a graceful shutdown. pg_basebackup's case is *not* really comparable because basebackup.c already performed a lot of filtering before noChecksumFiles is applied. > > > I wasn't ever thinking of only the graceful shutdown case- certainly > > > pg_basebackup operates when the cluster is running also and that's > > > more what I was thinking about from the start and which is part of > > > what started the discussion about this commit as that was entirely > > > ignored in this change. > > > > It was mainly ignored in the original pg_verify_checksums change,
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2018-10-19 14:11:03 -0400, Stephen Frost wrote: > > * Andres Freund (and...@anarazel.de) wrote: > > > cstore e.g. does this, and it's already out there. So yes, we should > > > provide better infrastructure. But for now, we gotta deal with reality, > > > unless we just want to gratuituously break things. > > > > No, I don't agree that we are beholden to an external extension that > > decided to start writing into directories that clearly belong to PG. > > Did it have an alternative? Yes, work with us to come up with a way to accomplish what they wanted without creating unknown/untracked files in our tablespaces. Or, have their own mechanism for storing files on other filesystems. And likely other options. Saying that we should account for their putting arbitrary files into the PG tablespace directories because they "didn't have any other choice" seems pretty ridiculous, imv. > > How do we even know that what cstore does in the tablespace directory > > wouldn't get caught up in the checksum file pattern-matching that this > > commit put in? > > You listen to people? > > > What if there was an extension which did create files that would > > match, what would we do then? I'm happy to go create one, if that'd > > help make the point that this "pattern whitelist" isn't actually a > > solution but is really rather just a hack that'll break in the future. > > Oh, for crying out loud. Yes, obviously you can create conflicts, nobody > ever doubted that. How on earth is that a useful point for anything? The > point isn't that it's a good idea for extensions to do so, but that it > has happened because we didn't provide better alternative. The point is that you're making a case because you happen to know of an extension which does this, but neither of us know about every extension out there and I, at least, don't believe we should be coming up with hacks to try and avoid looking at files that some extension has dumped into the PG tablespace directories because that's never been a documented or supported thing to do. > > > > > And I fail to see why a blacklist is architecturally better. There's > > > > > plenty cases where we might want to create temporary files, > > > > > non-checksummed data or such that we'd need a teach a blacklist about, > > > > > but there's far fewer cases where add new checksummed files. Basically > > > > > never since checksums have been introduced. And if checksums were > > > > > introduced for something new, say slrus, we'd ceertainly use > > > > > pg_verify_checksum during development. > > > > > > > > In cases where we need something temporary, we're going to need to be > > > > prepared to clean those up and we had better make it very plain that > > > > they're temporary and easy to write code for. Anything we aren't > > > > prepared to blow away on a crash-restart should be checksum'd and in an > > > > ideal world, we'd be looking to reduce the blacklist to only those > > > > things which are temporary. > > > > > > There's pending patches that add support for pg_verify_checksums running > > > against a running cluster. We'll not just need to recognize files that > > > are there after a graceful shutdown, but with anything that a cluster > > > can have there while running. > > > > Of course- the same is true with a crash/restart case, so I'm not sure > > what you're getting at here. > > pg_verify_checksum doesn't support running on a crashed cluster, and I'm > not sure it'd make sense to teach it to so (there's not really much it > could do to discern whether a block is a torn page that replay will fix, > or corrupted). ... and this isn't at all relevant, because pg_basebackup does run on a running cluster. > > I wasn't ever thinking of only the graceful shutdown case- certainly > > pg_basebackup operates when the cluster is running also and that's > > more what I was thinking about from the start and which is part of > > what started the discussion about this commit as that was entirely > > ignored in this change. > > It was mainly ignored in the original pg_verify_checksums change, not in > the commit discussed here. That was broken. That built separate logic. As pointed out elsewhere on this thread, the logic was the same between the two before this commit... The code in pg_basebackup came from the prior pg_verify_checksums code. Certainly, some mention of the code existing in two places, at least, should have been in the comments. Also, just to be clear, as I tried to say up-thread, I'm not trying to lay blame here or suggest that Michael shouldn't have been trying to fix the issue that came up, but I don't agree that this is the way to fix it, and, in any case, we need to make sure that pg_basebackup behaves correctly as well. As mentioned elsewhere, that'd be best done by adding something to libpgcommon and then changing both pg_basebackup and pg_verify_checksums to use that. > Both basebackup an
Re: pgsql: Add TAP tests for pg_verify_checksums
On 2018-10-19 14:11:03 -0400, Stephen Frost wrote: > Greetings, > > * Andres Freund (and...@anarazel.de) wrote: > > On 2018-10-19 13:52:28 -0400, Stephen Frost wrote: > > > > > I also categorically disagree with the notion that it's ok for > > > > > extensions to dump files into our tablespace diretories or that we > > > > > should try to work around random code dumping extra files in the > > > > > directories which we maintain- it's not like this additional code will > > > > > actually protect us from that, after all, and it's foolish to think we > > > > > really could protect against that. > > > > > > > > I'm unconvinced. There already are extensions doing so, and it's not > > > > like we've given them any sort of reasonable alternative. You can't just > > > > create relations in a "registered" / "supported" kinda way right now, so > > > > uh, yea, extension gotta make do. And that has worked for many years. > > > > > > I don't agree that any of this is an argument for accepting random code > > > writing into arbitrary places in the data directory or tablespace > > > directories as being something which we support or which we write code > > > to work-around. > > > > > > If this is a capability that extension authors need then I'm all for > > > having a way for them to have that capability in a clearly defined and > > > documented way- and one which we could actually write code to handle and > > > work with. > > > > cstore e.g. does this, and it's already out there. So yes, we should > > provide better infrastructure. But for now, we gotta deal with reality, > > unless we just want to gratuituously break things. > > No, I don't agree that we are beholden to an external extension that > decided to start writing into directories that clearly belong to PG. Did it have an alternative? > How do we even know that what cstore does in the tablespace directory > wouldn't get caught up in the checksum file pattern-matching that this > commit put in? You listen to people? > What if there was an extension which did create files that would > match, what would we do then? I'm happy to go create one, if that'd > help make the point that this "pattern whitelist" isn't actually a > solution but is really rather just a hack that'll break in the future. Oh, for crying out loud. Yes, obviously you can create conflicts, nobody ever doubted that. How on earth is that a useful point for anything? The point isn't that it's a good idea for extensions to do so, but that it has happened because we didn't provide better alternative. > > > > And I fail to see why a blacklist is architecturally better. There's > > > > plenty cases where we might want to create temporary files, > > > > non-checksummed data or such that we'd need a teach a blacklist about, > > > > but there's far fewer cases where add new checksummed files. Basically > > > > never since checksums have been introduced. And if checksums were > > > > introduced for something new, say slrus, we'd ceertainly use > > > > pg_verify_checksum during development. > > > > > > In cases where we need something temporary, we're going to need to be > > > prepared to clean those up and we had better make it very plain that > > > they're temporary and easy to write code for. Anything we aren't > > > prepared to blow away on a crash-restart should be checksum'd and in an > > > ideal world, we'd be looking to reduce the blacklist to only those > > > things which are temporary. > > > > There's pending patches that add support for pg_verify_checksums running > > against a running cluster. We'll not just need to recognize files that > > are there after a graceful shutdown, but with anything that a cluster > > can have there while running. > > Of course- the same is true with a crash/restart case, so I'm not sure > what you're getting at here. pg_verify_checksum doesn't support running on a crashed cluster, and I'm not sure it'd make sense to teach it to so (there's not really much it could do to discern whether a block is a torn page that replay will fix, or corrupted). > I wasn't ever thinking of only the graceful shutdown case- certainly > pg_basebackup operates when the cluster is running also and that's > more what I was thinking about from the start and which is part of > what started the discussion about this commit as that was entirely > ignored in this change. It was mainly ignored in the original pg_verify_checksums change, not in the commit discussed here. That was broken. That built separate logic. Both basebackup an verify checksums already only scan certain directories. They *already* are encoding directory structure information.
Re: pgsql: Add TAP tests for pg_verify_checksums
On 2018-10-19 14:08:19 -0400, J Chapman Flack wrote: > On 10/19/2018 01:17 PM, Andres Freund wrote: > > On 2018-10-19 10:36:59 -0400, Stephen Frost wrote: > > > >> I also categorically disagree with the notion that it's ok for > >> extensions to dump files into our tablespace diretories > > > > I'm unconvinced. There already are extensions doing so, and it's not > > like we've given them any sort of reasonable alternative. You can't just > > create relations in a "registered" / "supported" kinda way right now, so > > uh, yea, extension gotta make do. And that has worked for many years. > > For some of us following along at home, this got interesting right here. > Could someone elaborate on what extensions do this, for what purpose? > And what would it mean to create relations in a "registered" / > "supported" kinda way? Has that been the topic of a past discussion > I could catch up with? An fdw, like cstore, might be doing it, to store data, for example. The registered / supported thing would be to have a catalog representation of on-disk files that is represented in the catalogs somewhow. Right now you can't really do that because you need a proper relation (table, index, ...) to get a relfilenode. Greetings, Andres Freund
Re: Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/19/2018 01:17 PM, Andres Freund wrote: > On 2018-10-19 10:36:59 -0400, Stephen Frost wrote: > >> I also categorically disagree with the notion that it's ok for >> extensions to dump files into our tablespace diretories > > I'm unconvinced. There already are extensions doing so, and it's not > like we've given them any sort of reasonable alternative. You can't just > create relations in a "registered" / "supported" kinda way right now, so > uh, yea, extension gotta make do. And that has worked for many years. For some of us following along at home, this got interesting right here. Could someone elaborate on what extensions do this, for what purpose? And what would it mean to create relations in a "registered" / "supported" kinda way? Has that been the topic of a past discussion I could catch up with? -Chap
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2018-10-19 13:52:28 -0400, Stephen Frost wrote: > > > > I also categorically disagree with the notion that it's ok for > > > > extensions to dump files into our tablespace diretories or that we > > > > should try to work around random code dumping extra files in the > > > > directories which we maintain- it's not like this additional code will > > > > actually protect us from that, after all, and it's foolish to think we > > > > really could protect against that. > > > > > > I'm unconvinced. There already are extensions doing so, and it's not > > > like we've given them any sort of reasonable alternative. You can't just > > > create relations in a "registered" / "supported" kinda way right now, so > > > uh, yea, extension gotta make do. And that has worked for many years. > > > > I don't agree that any of this is an argument for accepting random code > > writing into arbitrary places in the data directory or tablespace > > directories as being something which we support or which we write code > > to work-around. > > > > If this is a capability that extension authors need then I'm all for > > having a way for them to have that capability in a clearly defined and > > documented way- and one which we could actually write code to handle and > > work with. > > cstore e.g. does this, and it's already out there. So yes, we should > provide better infrastructure. But for now, we gotta deal with reality, > unless we just want to gratuituously break things. No, I don't agree that we are beholden to an external extension that decided to start writing into directories that clearly belong to PG. How do we even know that what cstore does in the tablespace directory wouldn't get caught up in the checksum file pattern-matching that this commit put in? What if there was an extension which did create files that would match, what would we do then? I'm happy to go create one, if that'd help make the point that this "pattern whitelist" isn't actually a solution but is really rather just a hack that'll break in the future. > > > And I fail to see why a blacklist is architecturally better. There's > > > plenty cases where we might want to create temporary files, > > > non-checksummed data or such that we'd need a teach a blacklist about, > > > but there's far fewer cases where add new checksummed files. Basically > > > never since checksums have been introduced. And if checksums were > > > introduced for something new, say slrus, we'd ceertainly use > > > pg_verify_checksum during development. > > > > In cases where we need something temporary, we're going to need to be > > prepared to clean those up and we had better make it very plain that > > they're temporary and easy to write code for. Anything we aren't > > prepared to blow away on a crash-restart should be checksum'd and in an > > ideal world, we'd be looking to reduce the blacklist to only those > > things which are temporary. > > There's pending patches that add support for pg_verify_checksums running > against a running cluster. We'll not just need to recognize files that > are there after a graceful shutdown, but with anything that a cluster > can have there while running. Of course- the same is true with a crash/restart case, so I'm not sure what you're getting at here. I wasn't ever thinking of only the graceful shutdown case- certainly pg_basebackup operates when the cluster is running also and that's more what I was thinking about from the start and which is part of what started the discussion about this commit as that was entirely ignored in this change. > > Of course, we may need different code for > > calculating the checksum of different types of files, which moves it > > from really being a 'blacklist' to a list of file-types and their > > associated checksum'ing mechansim. > > Which pretty much is a whitelist. Given that we need a > filetype->checksumtype mapping or something, how is a blacklist a > reasonable approach for this? We only need the mapping for special files- call that list of special files a whitelist or a blacklist or a bluelist, it doesn't matter, what's relevant here is that we don't skip over anything- we account for everything and make sure that everything that would survive a crash/restart is checked or at least accounted for if we can't, yet, check it. Thanks! Stephen signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-19 13:52:28 -0400, Stephen Frost wrote: > > > I also categorically disagree with the notion that it's ok for > > > extensions to dump files into our tablespace diretories or that we > > > should try to work around random code dumping extra files in the > > > directories which we maintain- it's not like this additional code will > > > actually protect us from that, after all, and it's foolish to think we > > > really could protect against that. > > > > I'm unconvinced. There already are extensions doing so, and it's not > > like we've given them any sort of reasonable alternative. You can't just > > create relations in a "registered" / "supported" kinda way right now, so > > uh, yea, extension gotta make do. And that has worked for many years. > > I don't agree that any of this is an argument for accepting random code > writing into arbitrary places in the data directory or tablespace > directories as being something which we support or which we write code > to work-around. > > If this is a capability that extension authors need then I'm all for > having a way for them to have that capability in a clearly defined and > documented way- and one which we could actually write code to handle and > work with. cstore e.g. does this, and it's already out there. So yes, we should provide better infrastructure. But for now, we gotta deal with reality, unless we just want to gratuituously break things. > > And I fail to see why a blacklist is architecturally better. There's > > plenty cases where we might want to create temporary files, > > non-checksummed data or such that we'd need a teach a blacklist about, > > but there's far fewer cases where add new checksummed files. Basically > > never since checksums have been introduced. And if checksums were > > introduced for something new, say slrus, we'd ceertainly use > > pg_verify_checksum during development. > > In cases where we need something temporary, we're going to need to be > prepared to clean those up and we had better make it very plain that > they're temporary and easy to write code for. Anything we aren't > prepared to blow away on a crash-restart should be checksum'd and in an > ideal world, we'd be looking to reduce the blacklist to only those > things which are temporary. There's pending patches that add support for pg_verify_checksums running against a running cluster. We'll not just need to recognize files that are there after a graceful shutdown, but with anything that a cluster can have there while running. > Of course, we may need different code for > calculating the checksum of different types of files, which moves it > from really being a 'blacklist' to a list of file-types and their > associated checksum'ing mechansim. Which pretty much is a whitelist. Given that we need a filetype->checksumtype mapping or something, how is a blacklist a reasonable approach for this? Greetings, Andres Freund
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2018-10-19 10:36:59 -0400, Stephen Frost wrote: > > * Michael Paquier (mich...@paquier.xyz) wrote: > > > On Wed, Oct 17, 2018 at 05:30:05PM -0400, Andrew Dunstan wrote: > > > > Fine by me. > > > > > > Thanks. This is now committed after some tweaks to the comments, a bit > > > earlier than I thought first. > > > > I just saw this committed and I'm trying to figure out why we are > > creating yet-another-list when it comes to deciding what should be > > checksum'd and what shouldn't be. > > I'm not sure it's fair to blame Michael here. He's picking up slack, > because the original authors of the tool didn't even bother to reply to > the issue for quite a while. pg_verify_checksum was obviously never > tested on windows, it just didn't have tests showing that. I wasn't really going for 'blame' here, just that we've now got two different methods for doing the same thing- and those will give different answers; presumably this issue impacts pg_basebackup too, since the logic there wasn't changed. > > I also categorically disagree with the notion that it's ok for > > extensions to dump files into our tablespace diretories or that we > > should try to work around random code dumping extra files in the > > directories which we maintain- it's not like this additional code will > > actually protect us from that, after all, and it's foolish to think we > > really could protect against that. > > I'm unconvinced. There already are extensions doing so, and it's not > like we've given them any sort of reasonable alternative. You can't just > create relations in a "registered" / "supported" kinda way right now, so > uh, yea, extension gotta make do. And that has worked for many years. I don't agree that any of this is an argument for accepting random code writing into arbitrary places in the data directory or tablespace directories as being something which we support or which we write code to work-around. If this is a capability that extension authors need then I'm all for having a way for them to have that capability in a clearly defined and documented way- and one which we could actually write code to handle and work with. > Also, as made obvious here, it's pretty clear that there's platform > differences about which files exists and which don't, so it's not that a > blacklist approach automatically is more maintainable. Platform differences are certainly something we can manage and we have code in a few different places for doing exactly that. A platform difference is a heck of a lot more reasonable than trying to guess at what random code that we've never seen before has done and try to work around that. > And I fail to see why a blacklist is architecturally better. There's > plenty cases where we might want to create temporary files, > non-checksummed data or such that we'd need a teach a blacklist about, > but there's far fewer cases where add new checksummed files. Basically > never since checksums have been introduced. And if checksums were > introduced for something new, say slrus, we'd ceertainly use > pg_verify_checksum during development. In cases where we need something temporary, we're going to need to be prepared to clean those up and we had better make it very plain that they're temporary and easy to write code for. Anything we aren't prepared to blow away on a crash-restart should be checksum'd and in an ideal world, we'd be looking to reduce the blacklist to only those things which are temporary. Of course, we may need different code for calculating the checksum of different types of files, which moves it from really being a 'blacklist' to a list of file-types and their associated checksum'ing mechansim. Thanks! Stephen signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-19 10:36:59 -0400, Stephen Frost wrote: > * Michael Paquier (mich...@paquier.xyz) wrote: > > On Wed, Oct 17, 2018 at 05:30:05PM -0400, Andrew Dunstan wrote: > > > Fine by me. > > > > Thanks. This is now committed after some tweaks to the comments, a bit > > earlier than I thought first. > > I just saw this committed and I'm trying to figure out why we are > creating yet-another-list when it comes to deciding what should be > checksum'd and what shouldn't be. I'm not sure it's fair to blame Michael here. He's picking up slack, because the original authors of the tool didn't even bother to reply to the issue for quite a while. pg_verify_checksum was obviously never tested on windows, it just didn't have tests showing that. > I also categorically disagree with the notion that it's ok for > extensions to dump files into our tablespace diretories or that we > should try to work around random code dumping extra files in the > directories which we maintain- it's not like this additional code will > actually protect us from that, after all, and it's foolish to think we > really could protect against that. I'm unconvinced. There already are extensions doing so, and it's not like we've given them any sort of reasonable alternative. You can't just create relations in a "registered" / "supported" kinda way right now, so uh, yea, extension gotta make do. And that has worked for many years. Also, as made obvious here, it's pretty clear that there's platform differences about which files exists and which don't, so it's not that a blacklist approach automatically is more maintainable. And I fail to see why a blacklist is architecturally better. There's plenty cases where we might want to create temporary files, non-checksummed data or such that we'd need a teach a blacklist about, but there's far fewer cases where add new checksummed files. Basically never since checksums have been introduced. And if checksums were introduced for something new, say slrus, we'd ceertainly use pg_verify_checksum during development. Greetings, Andres Freund
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > >> Seems like a job for a new(?) module in src/common/. > > > Yeah, that would be nice... Even nicer would be a way for non-core PG > > tools to be able to use it (we have basically the same thing in > > pgbackrest, unsurprisingly), though I realize that's a much larger step. > > We install libpgcommon.a (and, as of HEAD, libpgcommon_shlib.a) for > precisely that sort of reason. Hmm, yeah, we might be able to use that for this.. One challenge we've been thinking about has been dealing with multiple major versions of PG with one build of pgbackrest since we'd really like to do things like inspect the control file, but that changes between versions. We do have multi-version code in some places (quite a bit in pg_dump..) so perhaps we could have libpgcommon support also. Probably a topic for another thread. Thanks! Stephen signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Stephen Frost writes: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> Seems like a job for a new(?) module in src/common/. > Yeah, that would be nice... Even nicer would be a way for non-core PG > tools to be able to use it (we have basically the same thing in > pgbackrest, unsurprisingly), though I realize that's a much larger step. We install libpgcommon.a (and, as of HEAD, libpgcommon_shlib.a) for precisely that sort of reason. regards, tom lane
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Michael Banck writes: > > Am Freitag, den 19.10.2018, 10:36 -0400 schrieb Stephen Frost: > >> I just saw this committed and I'm trying to figure out why we are > >> creating yet-another-list when it comes to deciding what should be > >> checksum'd and what shouldn't be. > > > To be fair, the list in src/backend/replication/basebackup.c was a copy- > > paste from the one in pg_verify_checksums (or from other parts of the > > online activation patch). > > Seems like a job for a new(?) module in src/common/. Yeah, that would be nice... Even nicer would be a way for non-core PG tools to be able to use it (we have basically the same thing in pgbackrest, unsurprisingly), though I realize that's a much larger step. Thanks! Stephen signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Michael Banck writes: > Am Freitag, den 19.10.2018, 10:36 -0400 schrieb Stephen Frost: >> I just saw this committed and I'm trying to figure out why we are >> creating yet-another-list when it comes to deciding what should be >> checksum'd and what shouldn't be. > To be fair, the list in src/backend/replication/basebackup.c was a copy- > paste from the one in pg_verify_checksums (or from other parts of the > online activation patch). Seems like a job for a new(?) module in src/common/. regards, tom lane
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, Am Freitag, den 19.10.2018, 10:36 -0400 schrieb Stephen Frost: > Greetings, > > * Michael Paquier (mich...@paquier.xyz) wrote: > > On Wed, Oct 17, 2018 at 05:30:05PM -0400, Andrew Dunstan wrote: > > > Fine by me. > > > > Thanks. This is now committed after some tweaks to the comments, a bit > > earlier than I thought first. > > I just saw this committed and I'm trying to figure out why we are > creating yet-another-list when it comes to deciding what should be > checksum'd and what shouldn't be. > > Specifically, pg_basebackup (or, really, > src/backend/replication/basebackup.c) has 'is_checksummed_file' which > operates differently from pg_verify_checksum with this change, and that > seems rather wrong. To be fair, the list in src/backend/replication/basebackup.c was a copy- paste from the one in pg_verify_checksums (or from other parts of the online activation patch). I agree it makes sense to have both in sync or, better yet, factored out in a central place, but I don't currently have further opinions on whether it should be a black- or whitelist. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Re: pgsql: Add TAP tests for pg_verify_checksums
Greetings, * Michael Paquier (mich...@paquier.xyz) wrote: > On Wed, Oct 17, 2018 at 05:30:05PM -0400, Andrew Dunstan wrote: > > Fine by me. > > Thanks. This is now committed after some tweaks to the comments, a bit > earlier than I thought first. I just saw this committed and I'm trying to figure out why we are creating yet-another-list when it comes to deciding what should be checksum'd and what shouldn't be. Specifically, pg_basebackup (or, really, src/backend/replication/basebackup.c) has 'is_checksummed_file' which operates differently from pg_verify_checksum with this change, and that seems rather wrong. Maybe we need to fix both places but I *really* don't like this approach of "well, we'll just guess if the file should have a checksum or not" and it certainly seems likely that we'll end up forgetting to update this when we introduce things in the future which have checksums (which seems pretty darn likely to happen). I also categorically disagree with the notion that it's ok for extensions to dump files into our tablespace diretories or that we should try to work around random code dumping extra files in the directories which we maintain- it's not like this additional code will actually protect us from that, after all, and it's foolish to think we really could protect against that. Basically, I think this commit should be reverted, we should go back to having a blacklist, update it in both places (and add comments to both places to make it clear that this list exists in two different places) and add code to handle the temp tablespace case explicitly. Even better would be to find a way to expose the list and the code for walking the directories and identifying the files which contain checksums, so that we're only doing that in one place instead of multiple different places. Thanks! Stephen signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
On Wed, Oct 17, 2018 at 05:30:05PM -0400, Andrew Dunstan wrote: > Fine by me. Thanks. This is now committed after some tweaks to the comments, a bit earlier than I thought first. -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/16/2018 06:09 PM, Michael Paquier wrote: On Mon, Oct 15, 2018 at 11:20:28AM -0400, Andrew Dunstan wrote: On 10/15/2018 11:05 AM, Tom Lane wrote: Andrew Dunstan writes: We should fix this in PG11 rather than ship a broken utility. If everyone is happy, I can apply this. At this point I'd be happier if you waited till after 11.0 wraps... there is no margin for error today. If there is no urgency for this week, could you let me wrap this patch then? I cannot do that this week, but the beginning of next week will be fine per the current odds. Fine by me. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Add TAP tests for pg_verify_checksums
On October 16, 2018 3:47:55 PM PDT, Tom Lane wrote: >Michael Paquier writes: >> On Mon, Oct 15, 2018 at 11:20:28AM -0400, Andrew Dunstan wrote: >>> OK. In that case should we note that the utility is broken on >Windows? > >> Not sure if that's worth adding in the documentation, something in >the >> press release would be more adapted? > >I can't get too excited about it. I can guarantee you that there are >worse bugs than this one in 11.0. I would not be too surprised if we >know of some by next week. So I don't see much point in loudly >advertising this particular bug ... Same. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: pgsql: Add TAP tests for pg_verify_checksums
Michael Paquier writes: > On Mon, Oct 15, 2018 at 11:20:28AM -0400, Andrew Dunstan wrote: >> OK. In that case should we note that the utility is broken on Windows? > Not sure if that's worth adding in the documentation, something in the > press release would be more adapted? I can't get too excited about it. I can guarantee you that there are worse bugs than this one in 11.0. I would not be too surprised if we know of some by next week. So I don't see much point in loudly advertising this particular bug ... regards, tom lane
Re: pgsql: Add TAP tests for pg_verify_checksums
On Mon, Oct 15, 2018 at 11:20:28AM -0400, Andrew Dunstan wrote: > On 10/15/2018 11:05 AM, Tom Lane wrote: >> Andrew Dunstan writes: >>> We should fix this in PG11 rather than ship a broken utility. If >>> everyone is happy, I can apply this. >> >> At this point I'd be happier if you waited till after 11.0 wraps... >> there is no margin for error today. If there is no urgency for this week, could you let me wrap this patch then? I cannot do that this week, but the beginning of next week will be fine per the current odds. > OK. In that case should we note that the utility is broken on Windows? Not sure if that's worth adding in the documentation, something in the press release would be more adapted? -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/15/2018 11:05 AM, Tom Lane wrote: Andrew Dunstan writes: We should fix this in PG11 rather than ship a broken utility. If everyone is happy, I can apply this. At this point I'd be happier if you waited till after 11.0 wraps... there is no margin for error today. OK. In that case should we note that the utility is broken on Windows? cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Add TAP tests for pg_verify_checksums
Andrew Dunstan writes: > We should fix this in PG11 rather than ship a broken utility. If > everyone is happy, I can apply this. At this point I'd be happier if you waited till after 11.0 wraps... there is no margin for error today. regards, tom lane
Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/14/2018 09:11 PM, Andrew Dunstan wrote: What do you think about the updated version attached? Looks good to me. We should fix this in PG11 rather than ship a broken utility. If everyone is happy, I can apply this. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/14/2018 06:41 PM, Michael Paquier wrote: On Sun, Oct 14, 2018 at 10:24:55AM -0400, Andrew Dunstan wrote: [snip] Thanks a lot for the review, Andrew! This code now seems redundant: if (strcmp(fn, ".") == 0 || strcmp(fn, "..") == 0) return true; Indeed. I have renamed skipfile() to isRelFileName on the way, which looks better in the context. I would probably reverse the order of these two tests. It might not make any difference, assuming fn is never an empty string, but it seems more logical to me. + /* good to go if only digits */ + if (fn[pos] == '\0') + return false; + /* leave if no digits */ + if (pos == 0) + return true; No objections to that. Done. It also looks to me like the check for a segment number doesn't ensure there is at least one digit, so "123." might pass, but I could be wrong. In any case, there isn't a test for that, and there probably should be. You are right here. This name pattern has been failing. Fixed in the attached. I have added "123_" and "123_." as extra patterns to check. What do you think about the updated version attached? Looks good to me. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Add TAP tests for pg_verify_checksums
On Sun, Oct 14, 2018 at 10:24:55AM -0400, Andrew Dunstan wrote: > [snip] Thanks a lot for the review, Andrew! > This code now seems redundant: > > if (strcmp(fn, ".") == 0 || > strcmp(fn, "..") == 0) > return true; Indeed. I have renamed skipfile() to isRelFileName on the way, which looks better in the context. > I would probably reverse the order of these two tests. It might not make any > difference, assuming fn is never an empty string, but it seems more logical > to me. > >+ /* good to go if only digits */ >+ if (fn[pos] == '\0') >+ return false; >+ /* leave if no digits */ >+ if (pos == 0) >+ return true; No objections to that. Done. > It also looks to me like the check for a segment number doesn't ensure > there is at least one digit, so "123." might pass, but I could be > wrong. In any case, there isn't a test for that, and there probably > should be. You are right here. This name pattern has been failing. Fixed in the attached. I have added "123_" and "123_." as extra patterns to check. What do you think about the updated version attached? -- Michael diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c index 1bc020ab6c..666ab3f21e 100644 --- a/src/bin/pg_verify_checksums/pg_verify_checksums.c +++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c @@ -15,6 +15,7 @@ #include "catalog/pg_control.h" #include "common/controldata_utils.h" +#include "common/relpath.h" #include "getopt_long.h" #include "pg_getopt.h" #include "storage/bufpage.h" @@ -49,27 +50,69 @@ usage(void) printf(_("Report bugs to .\n")); } -static const char *const skip[] = { - "pg_control", - "pg_filenode.map", - "pg_internal.init", - "PG_VERSION", - NULL, -}; - +/* + * isRelFileName + * + * Check if the given file name has a shape authorized for scanning. + */ static bool -skipfile(const char *fn) +isRelFileName(const char *fn) { - const char *const *f; + int pos; - if (strcmp(fn, ".") == 0 || - strcmp(fn, "..") == 0) + /*-- + * Only files including data checksums are authorized for scanning. This + * is guessed based on the file name, by reverse-engineering + * GetRelationPath so make sure to update both code paths if any updates + * are done. The following file names are allowed: + * + * . + * _ + * _. + * + * Note that temporary files, beginning with 't', are also skipped. + * + *-- + */ + + /* A non-empty string of digits should follow */ + for (pos = 0; isdigit((unsigned char) fn[pos]); ++pos) + ; + /* leave if no digits */ + if (pos == 0) + return false; + /* good to go if only digits */ + if (fn[pos] == '\0') return true; - for (f = skip; *f; f++) - if (strcmp(*f, fn) == 0) - return true; - return false; + /* Authorized fork files can be scanned */ + if (fn[pos] == '_') + { + int forkchar = forkname_chars([pos + 1], NULL); + + if (forkchar <= 0) + return false; + + pos += forkchar + 1; + } + + /* Check for an optional segment number */ + if (fn[pos] == '.') + { + int segchar; + + for (segchar = 1; isdigit((unsigned char) fn[pos + segchar]); ++segchar) + ; + + if (segchar <= 1) + return false; + pos += segchar; + } + + /* Now this should be the end */ + if (fn[pos] != '\0') + return false; + return true; } static void @@ -146,7 +189,7 @@ scan_directory(const char *basedir, const char *subdir) char fn[MAXPGPATH]; struct stat st; - if (skipfile(de->d_name)) + if (!isRelFileName(de->d_name)) continue; snprintf(fn, sizeof(fn), "%s/%s", path, de->d_name); diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl index dc29da09af..8442534e95 100644 --- a/src/bin/pg_verify_checksums/t/002_actions.pl +++ b/src/bin/pg_verify_checksums/t/002_actions.pl @@ -5,7 +5,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 12; +use Test::More tests => 36; # Initialize node with checksums enabled. my $node = get_new_node('node_checksum'); @@ -17,6 +17,31 @@ command_like(['pg_controldata', $pgdata], qr/Data page checksum version:.*1/, 'checksums enabled in control file'); +# Add set of dummy files with some contents. These should not be scanned +# by the tool. +append_to_file "$pgdata/global/123.", "foo"; +append_to_file "$pgdata/global/123_", "foo"; +append_to_file "$pgdata/global/123_.", "foo"; +append_to_file "$pgdata/global/123.12t", "foo"; +append_to_file "$pgdata/global/foo", "foo2"; +append_to_file "$pgdata/global/t123", "bar"; +append_to_file "$pgdata/global/123a", "bar2"; +append_to_file "$pgdata/global/.123", "foobar"; +append_to_file "$pgdata/global/_fsm", "foobar2"; +append_to_file "$pgdata/global/_init", "foobar3"; +append_to_file "$pgdata/global/_vm.123", "foohoge"; +append_to_file "$pgdata/global/123_vm.123t", "foohoge2"; + +# Those are correct but empty files, so they
Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/13/2018 09:59 PM, Michael Paquier wrote: On Sat, Oct 13, 2018 at 05:53:00PM -0400, Andrew Dunstan wrote: It occurred to me that a pretty simple fix could just be to blacklist everything that didn't start with a digit. The whitelist approach is probably preferable... depends how urgent we see this as. Yeah, possibly, still that's not a correct long-term fix. So attached is what I think we should do for HEAD and REL_11_STABLE with an approach using a whitelist. I have added positive and negative tests on top of the existing TAP tests, as suggested by Michael B, and I made the code use relpath.h to make sure that we don't miss any fork types. Any objections to this patch? This code now seems redundant: if (strcmp(fn, ".") == 0 || strcmp(fn, "..") == 0) return true; I would probably reverse the order of these two tests. It might not make any difference, assuming fn is never an empty string, but it seems more logical to me. + /* good to go if only digits */ + if (fn[pos] == '\0') + return false; + /* leave if no digits */ + if (pos == 0) + return true; It also looks to me like the check for a segment number doesn't ensure there is at least one digit, so "123." might pass, but I could be wrong. In any case, there isn't a test for that, and there probably should be. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Add TAP tests for pg_verify_checksums
On Sat, Oct 13, 2018 at 05:53:00PM -0400, Andrew Dunstan wrote: > It occurred to me that a pretty simple fix could just be to blacklist > everything that didn't start with a digit. The whitelist approach is > probably preferable... depends how urgent we see this as. Yeah, possibly, still that's not a correct long-term fix. So attached is what I think we should do for HEAD and REL_11_STABLE with an approach using a whitelist. I have added positive and negative tests on top of the existing TAP tests, as suggested by Michael B, and I made the code use relpath.h to make sure that we don't miss any fork types. Any objections to this patch? -- Michael diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c index 1bc020ab6c..7fb9d656fa 100644 --- a/src/bin/pg_verify_checksums/pg_verify_checksums.c +++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c @@ -15,6 +15,7 @@ #include "catalog/pg_control.h" #include "common/controldata_utils.h" +#include "common/relpath.h" #include "getopt_long.h" #include "pg_getopt.h" #include "storage/bufpage.h" @@ -49,26 +50,66 @@ usage(void) printf(_("Report bugs to .\n")); } -static const char *const skip[] = { - "pg_control", - "pg_filenode.map", - "pg_internal.init", - "PG_VERSION", - NULL, -}; - static bool skipfile(const char *fn) { - const char *const *f; + int pos; + int savepos; if (strcmp(fn, ".") == 0 || strcmp(fn, "..") == 0) return true; - for (f = skip; *f; f++) - if (strcmp(*f, fn) == 0) + /*-- + * Only files including data checksums are authorized for scanning. This + * is guessed based on the file name, by reverse-engineering + * GetRelationPath so make sure to update both code paths if any updates + * are done. The following file names are allowed: + * + * . + * _ + * _. + * + * Note that temporary files, beginning with 't', are also skipped. + * + *-- + */ + + /* A non-empty string of digits should follow */ + for (pos = 0; isdigit((unsigned char) fn[pos]); ++pos) + ; + /* good to go if only digits */ + if (fn[pos] == '\0') + return false; + /* leave if no digits */ + if (pos == 0) + return true; + + /* Authorized fork files can be scanned */ + if (fn[pos] == '_') + { + int forkchar = forkname_chars([pos + 1], NULL); + + if (forkchar <= 0) return true; + + pos += forkchar + 1; + } + + /* Check for an optional segment number */ + if (fn[pos] == '.') + { + savepos = pos + 1; /* count dot */ + /* This should use only digits */ + for (pos = savepos; isdigit((unsigned char) fn[pos]); ++pos) + ; + if (fn[pos] != '\0') + return true; + } + + /* Now this should be the end */ + if (fn[pos] != '\0') + return true; return false; } diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl index dc29da09af..33408ced89 100644 --- a/src/bin/pg_verify_checksums/t/002_actions.pl +++ b/src/bin/pg_verify_checksums/t/002_actions.pl @@ -5,7 +5,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 12; +use Test::More tests => 36; # Initialize node with checksums enabled. my $node = get_new_node('node_checksum'); @@ -17,6 +17,28 @@ command_like(['pg_controldata', $pgdata], qr/Data page checksum version:.*1/, 'checksums enabled in control file'); +# Add set of dummy files with some contents. These should not be scanned +# by the tool. +append_to_file "$pgdata/global/123.12t", "foo"; +append_to_file "$pgdata/global/foo", "foo2"; +append_to_file "$pgdata/global/t123", "bar"; +append_to_file "$pgdata/global/123a", "bar2"; +append_to_file "$pgdata/global/.123", "foobar"; +append_to_file "$pgdata/global/_fsm", "foobar2"; +append_to_file "$pgdata/global/_init", "foobar3"; +append_to_file "$pgdata/global/_vm.123", "foohoge"; +append_to_file "$pgdata/global/123_vm.123t", "foohoge2"; + +# Those are correct but empty files, so they should pass through. +append_to_file "$pgdata/global/9", ""; +append_to_file "$pgdata/global/9.123", ""; +append_to_file "$pgdata/global/9_fsm", ""; +append_to_file "$pgdata/global/9_init", ""; +append_to_file "$pgdata/global/9_vm", ""; +append_to_file "$pgdata/global/9_init.123", ""; +append_to_file "$pgdata/global/9_fsm.123", ""; +append_to_file "$pgdata/global/9_vm.123", ""; + # Checksums pass on a newly-created cluster command_ok(['pg_verify_checksums', '-D', $pgdata], "succeeds with offline cluster"); @@ -67,3 +89,30 @@ $node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata, '-r', [qr/Bad checksums:.*1/], [qr/checksum verification failed/], 'fails for corrupted data on single relfilenode'); + +# Utility routine to check that pg_verify_checksums is correctly +# able to detect correctly-shaped relation files with some data. +sub fail_corrupt +{ + my $node = shift; + my $file = shift; + my $pgdata = $node->data_dir; + + # Create the file
Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/13/2018 10:00 AM, Andrew Dunstan wrote: On 10/13/2018 04:30 AM, Michael Paquier wrote: On Fri, Oct 12, 2018 at 12:14:48PM +0200, Michael Banck wrote: On Fri, Oct 12, 2018 at 03:05:43PM +0900, Michael Paquier wrote: On Fri, Oct 12, 2018 at 12:11:58PM +0900, Michael Paquier wrote: Agreed. I am just working on a patch for v11- which uses a whitelist-based method instead of what is present now. Reverting the tests to put the buildfarm to green could be done, but that's not the root of the problem. I think that's the right solution; the online verification patch adds even more logic to the blacklist so getting rid of it in favor of a whitelist is +1 with me. Thanks Michael for the input! So, I have coded this thing, and finish with the attached. The following file patterns are accepted for the checksums: . _ _. I have added some tests on the way to make sure that all the patterns get covered. Please note that this skips the temporary files. Maybe also add some correct (to be scanned) but non-empty garbage files later on that it should barf on? I was not sure about doing that in the main patch so I tweaked manually the test to make sure that the tool still complained with "could not read block" as it should. That's easy enough to add, so I'll add them with multiple file patterns. Those are cheap checks as well if they are placed in global/. Another problem that the patch has is that it is not using forkname_to_number() to scan for all the fork types, and I forgot init forks in the previous version. Using forkname_to_number() also makes the tool more bug-proof, it is also not complicated to plug into the existing patch. Anyway, I have a bit of a problem here, which prevents me to stay in front of a computer or to look at a screen for more than a couple of minutes in a row for a couple of days at least, and I don't like to keep the buildfarm unhappy for the time being. There are three options: 1) Revert the TAP tests of pg_verify_checksums. 2) Push the patch which adds new entries for EXEC_BACKEND files in the skip list. That's a short workaround, and that would allow default deployments of Postgres to use the tool. 3) Finish the patch with the whitelist approach. I can do 1) or 2) in my condition. 3) requires more work than I can do now, though the patch to do is getting in shape, so the buildfarm would stay unhappy. Any preference of the course of action to take? I have disabled the test temporarily on my two animals since I want to make sure they are working OK with other changes, and we know what the problem is. Andres might want to do that with his animal also just add "--skip-steps=pg_verify_checksums-check" to the command line. If you want to throw what you have for 3) over to wall to me I can see if I can finish it. It occurred to me that a pretty simple fix could just be to blacklist everything that didn't start with a digit. The whitelist approach is probably preferable ... depends how urgent we see this as. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/13/2018 04:30 AM, Michael Paquier wrote: On Fri, Oct 12, 2018 at 12:14:48PM +0200, Michael Banck wrote: On Fri, Oct 12, 2018 at 03:05:43PM +0900, Michael Paquier wrote: On Fri, Oct 12, 2018 at 12:11:58PM +0900, Michael Paquier wrote: Agreed. I am just working on a patch for v11- which uses a whitelist-based method instead of what is present now. Reverting the tests to put the buildfarm to green could be done, but that's not the root of the problem. I think that's the right solution; the online verification patch adds even more logic to the blacklist so getting rid of it in favor of a whitelist is +1 with me. Thanks Michael for the input! So, I have coded this thing, and finish with the attached. The following file patterns are accepted for the checksums: . _ _. I have added some tests on the way to make sure that all the patterns get covered. Please note that this skips the temporary files. Maybe also add some correct (to be scanned) but non-empty garbage files later on that it should barf on? I was not sure about doing that in the main patch so I tweaked manually the test to make sure that the tool still complained with "could not read block" as it should. That's easy enough to add, so I'll add them with multiple file patterns. Those are cheap checks as well if they are placed in global/. Another problem that the patch has is that it is not using forkname_to_number() to scan for all the fork types, and I forgot init forks in the previous version. Using forkname_to_number() also makes the tool more bug-proof, it is also not complicated to plug into the existing patch. Anyway, I have a bit of a problem here, which prevents me to stay in front of a computer or to look at a screen for more than a couple of minutes in a row for a couple of days at least, and I don't like to keep the buildfarm unhappy for the time being. There are three options: 1) Revert the TAP tests of pg_verify_checksums. 2) Push the patch which adds new entries for EXEC_BACKEND files in the skip list. That's a short workaround, and that would allow default deployments of Postgres to use the tool. 3) Finish the patch with the whitelist approach. I can do 1) or 2) in my condition. 3) requires more work than I can do now, though the patch to do is getting in shape, so the buildfarm would stay unhappy. Any preference of the course of action to take? I have disabled the test temporarily on my two animals since I want to make sure they are working OK with other changes, and we know what the problem is. Andres might want to do that with his animal also just add "--skip-steps=pg_verify_checksums-check" to the command line. If you want to throw what you have for 3) over to wall to me I can see if I can finish it. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Add TAP tests for pg_verify_checksums
On Fri, Oct 12, 2018 at 12:14:48PM +0200, Michael Banck wrote: > On Fri, Oct 12, 2018 at 03:05:43PM +0900, Michael Paquier wrote: >> On Fri, Oct 12, 2018 at 12:11:58PM +0900, Michael Paquier wrote: >>> Agreed. I am just working on a patch for v11- which uses a >>> whitelist-based method instead of what is present now. Reverting the >>> tests to put the buildfarm to green could be done, but that's not the >>> root of the problem. > > I think that's the right solution; the online verification patch adds > even more logic to the blacklist so getting rid of it in favor of a > whitelist is +1 with me. Thanks Michael for the input! >> So, I have coded this thing, and finish with the attached. The >> following file patterns are accepted for the checksums: >> . >> _ >> _. >> I have added some tests on the way to make sure that all the patterns >> get covered. Please note that this skips the temporary files. > > Maybe also add some correct (to be scanned) but non-empty garbage files > later on that it should barf on? I was not sure about doing that in the main patch so I tweaked manually the test to make sure that the tool still complained with "could not read block" as it should. That's easy enough to add, so I'll add them with multiple file patterns. Those are cheap checks as well if they are placed in global/. Another problem that the patch has is that it is not using forkname_to_number() to scan for all the fork types, and I forgot init forks in the previous version. Using forkname_to_number() also makes the tool more bug-proof, it is also not complicated to plug into the existing patch. Anyway, I have a bit of a problem here, which prevents me to stay in front of a computer or to look at a screen for more than a couple of minutes in a row for a couple of days at least, and I don't like to keep the buildfarm unhappy for the time being. There are three options: 1) Revert the TAP tests of pg_verify_checksums. 2) Push the patch which adds new entries for EXEC_BACKEND files in the skip list. That's a short workaround, and that would allow default deployments of Postgres to use the tool. 3) Finish the patch with the whitelist approach. I can do 1) or 2) in my condition. 3) requires more work than I can do now, though the patch to do is getting in shape, so the buildfarm would stay unhappy. Any preference of the course of action to take? -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On Fri, Oct 12, 2018 at 03:05:43PM +0900, Michael Paquier wrote: > On Fri, Oct 12, 2018 at 12:11:58PM +0900, Michael Paquier wrote: > > Agreed. I am just working on a patch for v11- which uses a > > whitelist-based method instead of what is present now. Reverting the > > tests to put the buildfarm to green could be done, but that's not the > > root of the problem. I think that's the right solution; the online verification patch adds even more logic to the blacklist so getting rid of it in favor of a whitelist is +1 with me. > So, I have coded this thing, and finish with the attached. The > following file patterns are accepted for the checksums: > . > _ > _. > I have added some tests on the way to make sure that all the patterns > get covered. Please note that this skips the temporary files. Maybe also add some correct (to be scanned) but non-empty garbage files later on that it should barf on? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Re: pgsql: Add TAP tests for pg_verify_checksums
On Fri, Oct 12, 2018 at 12:11:58PM +0900, Michael Paquier wrote: > Agreed. I am just working on a patch for v11- which uses a > whitelist-based method instead of what is present now. Reverting the > tests to put the buildfarm to green could be done, but that's not the > root of the problem. So, I have coded this thing, and finish with the attached. The following file patterns are accepted for the checksums: . _ _. I have added some tests on the way to make sure that all the patterns get covered. Please note that this skips the temporary files. Thoughts? -- Michael diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c index 1bc020ab6c..81abb021d9 100644 --- a/src/bin/pg_verify_checksums/pg_verify_checksums.c +++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c @@ -49,26 +49,64 @@ usage(void) printf(_("Report bugs to .\n")); } -static const char *const skip[] = { - "pg_control", - "pg_filenode.map", - "pg_internal.init", - "PG_VERSION", - NULL, -}; - static bool skipfile(const char *fn) { - const char *const *f; + int pos; + int savepos; if (strcmp(fn, ".") == 0 || strcmp(fn, "..") == 0) return true; - for (f = skip; *f; f++) - if (strcmp(*f, fn) == 0) + /*-- + * Only files including data checksums are authorized for scanning. This + * is guessed based on the file name, by reverse-engineering + * GetRelationPath so make sure to update both code paths if any updates + * are done. The following file names are allowed: + * . + * _ + * _. + * + * Note that temporary files, beginning with 't', are also skipped. + * + *-- + */ + + /* A non-empty string of digits should follow */ + for (pos = 0; isdigit((unsigned char) fn[pos]); ++pos) + ; + /* good to go if only digits */ + if (fn[pos] == '\0') + return false; + /* leave if no digits */ + if (pos == 0) + return true; + + /* VM and FSM files can be scanned */ + if (fn[pos] == '_') + { + if (strncmp(fn + pos + 1, "vm", 2) == 0) + pos += 3; + else if (strncmp(fn + pos + 1, "fsm", 3) == 0) + pos += 4; + else return true; + } + + /* Check for an optional segment number */ + if (fn[pos] == '.') + { + savepos = pos + 1; /* count the dot */ + /* This should use only digits */ + for (pos = savepos; isdigit((unsigned char) fn[pos]); ++pos) + ; + if (fn[pos] != '\0') + return true; + } + + if (fn[pos] != '\0') + return true; return false; } diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl index dc29da09af..6ad1d019f1 100644 --- a/src/bin/pg_verify_checksums/t/002_actions.pl +++ b/src/bin/pg_verify_checksums/t/002_actions.pl @@ -17,6 +17,25 @@ command_like(['pg_controldata', $pgdata], qr/Data page checksum version:.*1/, 'checksums enabled in control file'); +# Add set of dummy files with some contents. These should not be scanned +# by the tool. +append_to_file "$pgdata/global/123.12t", "foo"; +append_to_file "$pgdata/global/foo", "foo2"; +append_to_file "$pgdata/global/t123", "bar"; +append_to_file "$pgdata/global/123a", "bar2"; +append_to_file "$pgdata/global/.123", "foobar"; +append_to_file "$pgdata/global/_fsm", "foobar2"; +append_to_file "$pgdata/global/_vm.123", "foohoge"; +append_to_file "$pgdata/global/123_vm.123t", "foohoge2"; + +# Those are correct but empty files, so they should pass through. +append_to_file "$pgdata/global/9", ""; +append_to_file "$pgdata/global/9.123", ""; +append_to_file "$pgdata/global/9_vm", ""; +append_to_file "$pgdata/global/9_fsm", ""; +append_to_file "$pgdata/global/9_vm.123", ""; +append_to_file "$pgdata/global/9_fsm.123", ""; + # Checksums pass on a newly-created cluster command_ok(['pg_verify_checksums', '-D', $pgdata], "succeeds with offline cluster"); signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
On Thu, Oct 11, 2018 at 07:41:18PM -0700, Andres Freund wrote: > We only remove temp dirs on startup, and I'm pretty sure there at least > not too long ago were a few error cases where we can leak temp files in > individual sessions. And there's talk about allowing > pg_verify_checksums when online. So this seems to be an angle that > needs to be thought about... Agreed. I am just working on a patch for v11- which uses a whitelist-based method instead of what is present now. Reverting the tests to put the buildfarm to green could be done, but that's not the root of the problem. -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-12 11:07:58 +0900, Michael Paquier wrote: > On Thu, Oct 11, 2018 at 06:53:19PM -0700, Andres Freund wrote: > > Hm. Maybe I'm confused, but how is it correct to use a blacklisting > > approach here? It's far from absurd to have files inside a tablespacs > > when using temp_tablespace that aren't written through the buffer > > manager. Like, uh, temp files, when using temp_tablespaces. > > pg_verify_checksums assumes that the cluster has been cleanly shut > down so temp files are not an issue, no? However... We only remove temp dirs on startup, and I'm pretty sure there at least not too long ago were a few error cases where we can leak temp files in individual sessions. And there's talk about allowing pg_verify_checksums when online. So this seems to be an angle that needs to be thought about... Greetings, Andres Freund
Re: pgsql: Add TAP tests for pg_verify_checksums
On Thu, Oct 11, 2018 at 06:53:19PM -0700, Andres Freund wrote: > On 2018-10-12 10:39:18 +0900, Michael Paquier wrote: >> I have been able to reproduce the problem, and that's a bug within >> pg_verify_checksums as it fails to consider that config_exec_params is >> not a file it should scan when using EXEC_BACKEND. The same can >> happen with config_exec_params.new. > > Ugh, so it's actively borked on windows right now? It looks so. Automated testing proves to be useful sometimes. > Hm. Maybe I'm confused, but how is it correct to use a blacklisting > approach here? It's far from absurd to have files inside a tablespacs > when using temp_tablespace that aren't written through the buffer > manager. Like, uh, temp files, when using temp_tablespaces. pg_verify_checksums assumes that the cluster has been cleanly shut down so temp files are not an issue, no? However... > But even > leaving that aside, there's a few extensions that place additional files > into the tablespace directories (and we don't give them an alternative > solution). This is a good argument. I have not played much with such extensions myself though. > I think at the very least you're going to need to pattern match to > relation looking files. Yes, it seems so. A whitelist approach would consist in checking for relfilenodes, VMs and FSMs as files authorized for scan, in a way rather similar to what looks_like_temp_rel_name() does for temp files... -- Michael signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-12 10:39:18 +0900, Michael Paquier wrote: > On Thu, Oct 11, 2018 at 06:04:11PM -0700, Andres Freund wrote: > > culicidae tests EXEC_BACKEND, so there's an explanation as to why it > > sometimes behaves differently. But here I don't immediately see how > > that'd matter. Probably still worth verifying that it's not somehow > > caused by that. > > Thanks, that's the point of detail I needed about culicidae Note that that's also mentioned on the BF page when you hover over the yellow stick-it symbol. > (you will need to explain me one day face-to-face how you pronounce > it). I didn't choose it ;) > I have been able to reproduce the problem, and that's a bug within > pg_verify_checksums as it fails to consider that config_exec_params is > not a file it should scan when using EXEC_BACKEND. The same can > happen with config_exec_params.new. Ugh, so it's actively borked on windows right now? > The attached, which fixes the issue for me, needs to be back-patched to > v11. Looks consistent with what we have rn. But: > diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c > b/src/bin/pg_verify_checksums/pg_verify_checksums.c > index 1bc020ab6c..a6c884f149 100644 > --- a/src/bin/pg_verify_checksums/pg_verify_checksums.c > +++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c > @@ -54,6 +54,10 @@ static const char *const skip[] = { > "pg_filenode.map", > "pg_internal.init", > "PG_VERSION", > +#ifdef EXEC_BACKEND > + "config_exec_params", > + "config_exec_params.new", > +#endif > NULL, > }; > Hm. Maybe I'm confused, but how is it correct to use a blacklisting approach here? It's far from absurd to have files inside a tablespacs when using temp_tablespace that aren't written through the buffer manager. Like, uh, temp files, when using temp_tablespaces. But even leaving that aside, there's a few extensions that place additional files into the tablespace directories (and we don't give them an alternative solution). I think at the very least you're going to need to pattern match to relation looking files. Greetings, Andres Freund
Re: pgsql: Add TAP tests for pg_verify_checksums
On Thu, Oct 11, 2018 at 06:04:11PM -0700, Andres Freund wrote: > culicidae tests EXEC_BACKEND, so there's an explanation as to why it > sometimes behaves differently. But here I don't immediately see how > that'd matter. Probably still worth verifying that it's not somehow > caused by that. Thanks, that's the point of detail I needed about culicidae (you will need to explain me one day face-to-face how you pronounce it). I have been able to reproduce the problem, and that's a bug within pg_verify_checksums as it fails to consider that config_exec_params is not a file it should scan when using EXEC_BACKEND. The same can happen with config_exec_params.new. The attached, which fixes the issue for me, needs to be back-patched to v11. -- Michael diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c index 1bc020ab6c..a6c884f149 100644 --- a/src/bin/pg_verify_checksums/pg_verify_checksums.c +++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c @@ -54,6 +54,10 @@ static const char *const skip[] = { "pg_filenode.map", "pg_internal.init", "PG_VERSION", +#ifdef EXEC_BACKEND + "config_exec_params", + "config_exec_params.new", +#endif NULL, }; signature.asc Description: PGP signature
Re: pgsql: Add TAP tests for pg_verify_checksums
Hi, On 2018-10-12 09:56:14 +0900, Michael Paquier wrote: > On Fri, Oct 12, 2018 at 12:17:57AM +, Michael Paquier wrote: > > Add TAP tests for pg_verify_checksums > > > > All options available in the utility get coverage: > > - Tests with disabled page checksums. > > - Tests with enabled test checksums. > > - Emulation of corruption and broken checksums with a full scan and > > single relfilenode scan. > > culicidae is failing after this commit in an interesting way, so we > really needed some tests for this utility: > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=culicidae=HEAD > > ok 7 - fails with corrupted data status (got 1 vs expected 1) > not ok 8 - fails with corrupted data stdout /(?^:Bad checksums:.*1)/ > > # Failed test 'fails with corrupted data stdout /(?^:Bad checksums:.*1)/' > # at t/002_actions.pl line 57. > # '' > # doesn't match '(?^:Bad checksums:.*1)' > not ok 9 - fails with corrupted data stderr /(?^:checksum verification > failed)/ > > So the checksum failure is happening, but the output is not present. > What's not clear to me is that all the other hosts running Debian SID > don't complain, and that the follow-up test on a single relfilenode is > able to pass with a test much similar to the previous one. culicidae tests EXEC_BACKEND, so there's an explanation as to why it sometimes behaves differently. But here I don't immediately see how that'd matter. Probably still worth verifying that it's not somehow caused by that. > Is the issue > that we should just call fflush() on stderr and stdout at the end of > main() in pg_verify_checksum.c? Shouldn't we back-patch that? I don't see how that'd would change anything. A "missing" fflush() isn't a licence to simply throw away buffer contents, so I doubt it's just that. Greetings, Andres Freund
Re: pgsql: Add TAP tests for pg_verify_checksums
On Fri, Oct 12, 2018 at 12:17:57AM +, Michael Paquier wrote: > Add TAP tests for pg_verify_checksums > > All options available in the utility get coverage: > - Tests with disabled page checksums. > - Tests with enabled test checksums. > - Emulation of corruption and broken checksums with a full scan and > single relfilenode scan. culicidae is failing after this commit in an interesting way, so we really needed some tests for this utility: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=culicidae=HEAD ok 7 - fails with corrupted data status (got 1 vs expected 1) not ok 8 - fails with corrupted data stdout /(?^:Bad checksums:.*1)/ # Failed test 'fails with corrupted data stdout /(?^:Bad checksums:.*1)/' # at t/002_actions.pl line 57. # '' # doesn't match '(?^:Bad checksums:.*1)' not ok 9 - fails with corrupted data stderr /(?^:checksum verification failed)/ So the checksum failure is happening, but the output is not present. What's not clear to me is that all the other hosts running Debian SID don't complain, and that the follow-up test on a single relfilenode is able to pass with a test much similar to the previous one. Is the issue that we should just call fflush() on stderr and stdout at the end of main() in pg_verify_checksum.c? Shouldn't we back-patch that? -- Michael signature.asc Description: PGP signature
Re: TAP tests for pg_verify_checksums
On Wed, Oct 10, 2018 at 10:50:02AM +0900, Michael Paquier wrote: > The resulting patch is attached. Does that look good? And committed. Thanks all for taking the time to review. -- Michael signature.asc Description: PGP signature
Re: TAP tests for pg_verify_checksums
On Tue, Oct 09, 2018 at 05:14:50PM +0200, Michael Banck wrote: > Am Dienstag, den 09.10.2018, 16:54 +0200 schrieb Peter Eisentraut: >> On 06/10/2018 13:46, Michael Paquier wrote: > >>> +# Time to create a corruption > > That looks a bit weird, maybe "some corruption"? Or maybe it's just me > not being a native speaker. Okay, let's do with your suggestion. >> I would also like to see a test that runs against a cluster without >> checksums enabled. OK. I have added a test within initdb to save one initdb run. >> +# Checks cannot happen for an online cluster >> +$node->start; >> +command_fails(['pg_verify_checksums', '-D', $pgdata], >> + "checksum checks not done"); >> >> The test name should be something like "fails with online cluster". Done. I have put more thoughts into those. > One more thing we could check is the relfilenode after we corrupted it, > it should also catch the corruption then. Or is that too trivial? There is one as of v3: +# Checksum checks on single relfilenode fail +$node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata, '-r', + $relfilenode_corrupted], + 1, + [qr/Bad checksums:.*1/], + [qr/checksum verification failed/], + ''); The resulting patch is attached. Does that look good? -- Michael diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl index 5dc629fd5e..759779adb2 100644 --- a/src/bin/initdb/t/001_initdb.pl +++ b/src/bin/initdb/t/001_initdb.pl @@ -8,7 +8,7 @@ use Fcntl ':mode'; use File::stat qw{lstat}; use PostgresNode; use TestLib; -use Test::More tests => 18; +use Test::More tests => 22; my $tempdir = TestLib::tempdir; my $xlogdir = "$tempdir/pgxlog"; @@ -58,6 +58,18 @@ mkdir $datadir; "check PGDATA permissions"); } } + +# Control file should tell that data checksums are disabled by default. +command_like(['pg_controldata', $datadir], + qr/Data page checksum version:.*0/, + 'checksums are disabled in control file'); +# pg_verify_checksums fails with checksums disabled by default. This is +# not part of the tests included in pg_verify_checksums to save from +# the creation of an extra instance. +command_fails( + [ 'pg_verify_checksums', '-D', $datadir], + "pg_verify_checksums fails with data checksum disabled"); + command_ok([ 'initdb', '-S', $datadir ], 'sync only'); command_fails([ 'initdb', $datadir ], 'existing data directory'); diff --git a/src/bin/pg_verify_checksums/.gitignore b/src/bin/pg_verify_checksums/.gitignore index d1dcdaf0dd..0e5e569a54 100644 --- a/src/bin/pg_verify_checksums/.gitignore +++ b/src/bin/pg_verify_checksums/.gitignore @@ -1 +1,3 @@ /pg_verify_checksums + +/tmp_check/ diff --git a/src/bin/pg_verify_checksums/Makefile b/src/bin/pg_verify_checksums/Makefile index d16261571f..cfe4ab1b8b 100644 --- a/src/bin/pg_verify_checksums/Makefile +++ b/src/bin/pg_verify_checksums/Makefile @@ -34,3 +34,9 @@ uninstall: clean distclean maintainer-clean: rm -f pg_verify_checksums$(X) $(OBJS) rm -rf tmp_check + +check: + $(prove_check) + +installcheck: + $(prove_installcheck) diff --git a/src/bin/pg_verify_checksums/t/001_basic.pl b/src/bin/pg_verify_checksums/t/001_basic.pl new file mode 100644 index 00..1fa2e12db2 --- /dev/null +++ b/src/bin/pg_verify_checksums/t/001_basic.pl @@ -0,0 +1,8 @@ +use strict; +use warnings; +use TestLib; +use Test::More tests => 8; + +program_help_ok('pg_verify_checksums'); +program_version_ok('pg_verify_checksums'); +program_options_handling_ok('pg_verify_checksums'); diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl new file mode 100644 index 00..dc29da09af --- /dev/null +++ b/src/bin/pg_verify_checksums/t/002_actions.pl @@ -0,0 +1,69 @@ +# Do basic sanity checks supported by pg_verify_checksums using +# an initialized cluster. + +use strict; +use warnings; +use PostgresNode; +use TestLib; +use Test::More tests => 12; + +# Initialize node with checksums enabled. +my $node = get_new_node('node_checksum'); +$node->init(extra => ['--data-checksums']); +my $pgdata = $node->data_dir; + +# Control file should know that checksums are enabled. +command_like(['pg_controldata', $pgdata], + qr/Data page checksum version:.*1/, + 'checksums enabled in control file'); + +# Checksums pass on a newly-created cluster +command_ok(['pg_verify_checksums', '-D', $pgdata], + "succeeds with offline cluster"); + +# Checks cannot happen with an online cluster +$node->start; +command_fails(['pg_verify_checksums', '-D', $pgdata], + "fails with online cluster"); + +# Create table to corrupt and get its relfilenode +$node->safe_psql('postgres', + "SELECT a INTO corrupt1 FROM generate_series(1,1) AS a; + ALTER TABLE corrupt1 SET (autovacuum_enabled=false);"); + +my $file_corrupted = $node->safe_psql('postgres', + "SELECT
Re: TAP tests for pg_verify_checksums
Hi, Am Dienstag, den 09.10.2018, 16:54 +0200 schrieb Peter Eisentraut: > On 06/10/2018 13:46, Michael Paquier wrote: > > What do you think about the updated version attached? > > +# Time to create a corruption That looks a bit weird, maybe "some corupption"? Or maybe it's just me not being a native speaker. > Other than that, it appears to cover everything. One more thing we could check is the relfilenode after we corrupted it, it should also catch the corruption then. Or is that too trivial? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Re: TAP tests for pg_verify_checksums
On 06/10/2018 13:46, Michael Paquier wrote: > On Fri, Oct 05, 2018 at 01:38:05PM +0200, Michael Banck wrote: >> It's too late for v11 though at this point I guess? > > Unfortunately yes. > >> I think it would be easy to also test the -r command-line option, as we >> already create a table. > > Good idea. Let's add this test. > >> That comment should read 'that checksums are enabled', right? > > Indeed. Fixed. > >> Otherwise, LGTM and I've tested it without finding any problems. > > What do you think about the updated version attached? Looks pretty good to me. Let's make sure the test names are useful: +# Checks cannot happen for an online cluster +$node->start; +command_fails(['pg_verify_checksums', '-D', $pgdata], + "checksum checks not done"); The test name should be something like "fails with online cluster". I would also like to see a test that runs against a cluster without checksums enabled. Other than that, it appears to cover everything. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: TAP tests for pg_verify_checksums
On Fri, Oct 05, 2018 at 01:38:05PM +0200, Michael Banck wrote: > It's too late for v11 though at this point I guess? Unfortunately yes. > I think it would be easy to also test the -r command-line option, as we > already create a table. Good idea. Let's add this test. > That comment should read 'that checksums are enabled', right? Indeed. Fixed. > Otherwise, LGTM and I've tested it without finding any problems. What do you think about the updated version attached? -- Michael diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl index 5dc629fd5e..610887e5d4 100644 --- a/src/bin/initdb/t/001_initdb.pl +++ b/src/bin/initdb/t/001_initdb.pl @@ -8,7 +8,7 @@ use Fcntl ':mode'; use File::stat qw{lstat}; use PostgresNode; use TestLib; -use Test::More tests => 18; +use Test::More tests => 21; my $tempdir = TestLib::tempdir; my $xlogdir = "$tempdir/pgxlog"; @@ -58,6 +58,12 @@ mkdir $datadir; "check PGDATA permissions"); } } + +# Control file should tell that data checksums are disabled by default. +command_like(['pg_controldata', $datadir], + qr/Data page checksum version:.*0/, + 'checksums are disabled in control file'); + command_ok([ 'initdb', '-S', $datadir ], 'sync only'); command_fails([ 'initdb', $datadir ], 'existing data directory'); diff --git a/src/bin/pg_verify_checksums/.gitignore b/src/bin/pg_verify_checksums/.gitignore index d1dcdaf0dd..0e5e569a54 100644 --- a/src/bin/pg_verify_checksums/.gitignore +++ b/src/bin/pg_verify_checksums/.gitignore @@ -1 +1,3 @@ /pg_verify_checksums + +/tmp_check/ diff --git a/src/bin/pg_verify_checksums/Makefile b/src/bin/pg_verify_checksums/Makefile index d16261571f..cfe4ab1b8b 100644 --- a/src/bin/pg_verify_checksums/Makefile +++ b/src/bin/pg_verify_checksums/Makefile @@ -34,3 +34,9 @@ uninstall: clean distclean maintainer-clean: rm -f pg_verify_checksums$(X) $(OBJS) rm -rf tmp_check + +check: + $(prove_check) + +installcheck: + $(prove_installcheck) diff --git a/src/bin/pg_verify_checksums/t/001_basic.pl b/src/bin/pg_verify_checksums/t/001_basic.pl new file mode 100644 index 00..1fa2e12db2 --- /dev/null +++ b/src/bin/pg_verify_checksums/t/001_basic.pl @@ -0,0 +1,8 @@ +use strict; +use warnings; +use TestLib; +use Test::More tests => 8; + +program_help_ok('pg_verify_checksums'); +program_version_ok('pg_verify_checksums'); +program_options_handling_ok('pg_verify_checksums'); diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl new file mode 100644 index 00..2c329bbf8b --- /dev/null +++ b/src/bin/pg_verify_checksums/t/002_actions.pl @@ -0,0 +1,69 @@ +# Do basic sanity checks supported by pg_verify_checksums using +# an initialized cluster. + +use strict; +use warnings; +use PostgresNode; +use TestLib; +use Test::More tests => 12; + +# Initialize node with checksums enabled. +my $node = get_new_node('node_checksum'); +$node->init(extra => ['--data-checksums']); +my $pgdata = $node->data_dir; + +# Control file should know that checksums are enabled. +command_like(['pg_controldata', $pgdata], + qr/Data page checksum version:.*1/, + 'checksums enabled in control file'); + +# Checksums pass on a newly-created cluster +command_ok(['pg_verify_checksums', '-D', $pgdata], + "checksum checks done and passing"); + +# Checks cannot happen for an online cluster +$node->start; +command_fails(['pg_verify_checksums', '-D', $pgdata], + "checksum checks not done"); + +# Create table to corrupt and get its relfilenode +$node->safe_psql('postgres', + "SELECT a INTO corrupt1 FROM generate_series(1,1) AS a; + ALTER TABLE corrupt1 SET (autovacuum_enabled=false);"); + +my $file_corrupted = $node->safe_psql('postgres', + "SELECT pg_relation_filepath('corrupt1')"); +my $relfilenode_corrupted = $node->safe_psql('postgres', + "SELECT relfilenode FROM pg_class WHERE relname = 'corrupt1';"); + +# Set page header and block size +my $pageheader_size = 24; +my $block_size = $node->safe_psql('postgres', 'SHOW block_size;'); +$node->stop; + +# Checksums pass for single relfilenode as the table is not corrupted +# yet. +command_ok(['pg_verify_checksums', '-D', $pgdata, + '-r', $relfilenode_corrupted], + "checksum checks for table relfilenode done and passing"); + +# Time to create a corruption +open my $file, '+<', "$pgdata/$file_corrupted"; +seek($file, $pageheader_size, 0); +syswrite($file, '\0\0\0\0\0\0\0\0\0'); +close $file; + +# Checksum checks on single relfilenode fail +$node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata, '-r', + $relfilenode_corrupted], + 1, + [qr/Bad checksums:.*1/], + [qr/checksum verification failed/], + 'pg_checksums reports checksum mismatch'); + +# Global checksum checks fail +$node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata], + 1, + [qr/Bad checksums:.*1/], + [qr/checksum verification failed/], + 'pg_checksums reports checksum mismatch');
Re: TAP tests for pg_verify_checksums
Hi, On Fri, Oct 05, 2018 at 10:26:45AM +0900, Michael Paquier wrote: > The topic of $subject has been discussed a bit times, resulting in a > couple of patches on the way: > https://www.postgresql.org/message-id/20180830200258.gg15...@paquier.xyz > https://www.postgresql.org/message-id/cabuevezekrwpegk2o-ov4z2mjft6cu8clfa-v1s1j4z8x7w...@mail.gmail.com > > However nothing has actually happened. Based on the previous feedback, > attached is an updated patch to do the actual job. Thanks! It's too late for v11 though at this point I guess? I think it would be easy to also test the -r command-line option, as we already create a table. Something like |my $relfilenode_corrupted = $node->safe_psql('postgres', |"SELECT relfilenode FROM pg_class WHERE relname = 'corrupt1'"); [...] |# Checksums pass solely on that table |command_ok(['pg_verify_checksums', '-D', $pgdata, '-r', $relfilenode_corrupted], | "checksum checks for table relfilenode done and passing"); While making sure the $node->stop is between the two. One nitpick: > diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl > b/src/bin/pg_verify_checksums/t/002_actions.pl > new file mode 100644 > index 00..7423595181 > --- /dev/null > +++ b/src/bin/pg_verify_checksums/t/002_actions.pl [...] > +# Control file should know that checksums are disabled. > +command_like(['pg_controldata', $pgdata], > + qr/Data page checksum version:.*1/, > + 'checksums enabled in control file'); That comment should read 'that checksums are enabled', right? Otherwise, LGTM and I've tested it without finding any problems. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
TAP tests for pg_verify_checksums
Hi all, The topic of $subject has been discussed a bit times, resulting in a couple of patches on the way: https://www.postgresql.org/message-id/20180830200258.gg15...@paquier.xyz https://www.postgresql.org/message-id/cabuevezekrwpegk2o-ov4z2mjft6cu8clfa-v1s1j4z8x7w...@mail.gmail.com However nothing has actually happened. Based on the previous feedback, attached is an updated patch to do the actual job. Magnus Hagander and Michael Banck need to be credited as authors, as this patch is roughly a merge of what they proposed. Thoughts? -- Michael diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl index 5dc629fd5e..610887e5d4 100644 --- a/src/bin/initdb/t/001_initdb.pl +++ b/src/bin/initdb/t/001_initdb.pl @@ -8,7 +8,7 @@ use Fcntl ':mode'; use File::stat qw{lstat}; use PostgresNode; use TestLib; -use Test::More tests => 18; +use Test::More tests => 21; my $tempdir = TestLib::tempdir; my $xlogdir = "$tempdir/pgxlog"; @@ -58,6 +58,12 @@ mkdir $datadir; "check PGDATA permissions"); } } + +# Control file should tell that data checksums are disabled by default. +command_like(['pg_controldata', $datadir], + qr/Data page checksum version:.*0/, + 'checksums are disabled in control file'); + command_ok([ 'initdb', '-S', $datadir ], 'sync only'); command_fails([ 'initdb', $datadir ], 'existing data directory'); diff --git a/src/bin/pg_verify_checksums/.gitignore b/src/bin/pg_verify_checksums/.gitignore index d1dcdaf0dd..0e5e569a54 100644 --- a/src/bin/pg_verify_checksums/.gitignore +++ b/src/bin/pg_verify_checksums/.gitignore @@ -1 +1,3 @@ /pg_verify_checksums + +/tmp_check/ diff --git a/src/bin/pg_verify_checksums/Makefile b/src/bin/pg_verify_checksums/Makefile index d16261571f..cfe4ab1b8b 100644 --- a/src/bin/pg_verify_checksums/Makefile +++ b/src/bin/pg_verify_checksums/Makefile @@ -34,3 +34,9 @@ uninstall: clean distclean maintainer-clean: rm -f pg_verify_checksums$(X) $(OBJS) rm -rf tmp_check + +check: + $(prove_check) + +installcheck: + $(prove_installcheck) diff --git a/src/bin/pg_verify_checksums/t/001_basic.pl b/src/bin/pg_verify_checksums/t/001_basic.pl new file mode 100644 index 00..1fa2e12db2 --- /dev/null +++ b/src/bin/pg_verify_checksums/t/001_basic.pl @@ -0,0 +1,8 @@ +use strict; +use warnings; +use TestLib; +use Test::More tests => 8; + +program_help_ok('pg_verify_checksums'); +program_version_ok('pg_verify_checksums'); +program_options_handling_ok('pg_verify_checksums'); diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl new file mode 100644 index 00..7423595181 --- /dev/null +++ b/src/bin/pg_verify_checksums/t/002_actions.pl @@ -0,0 +1,52 @@ +# Do basic sanity checks supported by pg_verify_checksums using +# an initialized cluster. + +use strict; +use warnings; +use PostgresNode; +use TestLib; +use Test::More tests => 8; + +# Initialize node with checksums enabled. +my $node = get_new_node('node_checksum'); +$node->init(extra => ['--data-checksums']); +my $pgdata = $node->data_dir; + +# Control file should know that checksums are disabled. +command_like(['pg_controldata', $pgdata], + qr/Data page checksum version:.*1/, + 'checksums enabled in control file'); + +# Checksums pass on a newly-created cluster +command_ok(['pg_verify_checksums', '-D', $pgdata], + "checksum checks done and passing"); + +# Checks cannot happen for an online cluster +$node->start; +command_fails(['pg_verify_checksums', '-D', $pgdata], + "checksum checks not done"); + +# Create table to corrupt and get its relfilenode +$node->safe_psql('postgres', + "SELECT a INTO corrupt1 FROM generate_series(1,1) AS a; + ALTER TABLE corrupt1 SET (autovacuum_enabled=false);"); + +my $file_corrupted = $node->safe_psql('postgres', + "SELECT pg_relation_filepath('corrupt1')"); + +# Set page header and block size +my $pageheader_size = 24; +my $block_size = $node->safe_psql('postgres', 'SHOW block_size;'); +$node->stop; + +# Time to create a corruption +open my $file, '+<', "$pgdata/$file_corrupted"; +seek($file, $pageheader_size, 0); +syswrite($file, '\0\0\0\0\0\0\0\0\0'); +close $file; + +$node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata], + 1, + [qr/Bad checksums: 1/s], + [qr/checksum verification failed/s], + 'pg_checksums reports checksum mismatch'); signature.asc Description: PGP signature