Re: [HACKERS] Measuring replay lag
Hi Thomas. At 2017-02-15 00:48:41 +1300, thomas.mu...@enterprisedb.com wrote: > > Here is a new version with the buffer on the sender side as requested. This looks good. > + write_lag > + interval > + Estimated time taken for recent WAL records to be written on this > + standby server I think I would find a slightly more detailed explanation helpful here. A few tiny nits: > + * If the lsn hasn't advanced since last time, then do nothing. This > way > + * we only record a new sample when new WAL has been written, which is > + * simple proxy for the time at which the log was written. "which is simple" → "which is a simple" > + * If the buffer if full, for now we just rewind by one slot and > overwrite > + * the last sample, as a simple if somewhat uneven way to lower the > + * sampling rate. There may be better adaptive compaction algorithms. "buffer if" → "buffer is" > + * Find out how much time has elapsed since WAL position 'lsn' or earlier was > + * written to the lag tracking buffer and 'now'. Return -1 if no time is > + * available, and otherwise the elapsed time in microseconds. Find out how much time has elapsed "between X and 'now'", or "since X". (I prefer the former, i.e., s/since/between/.) -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Update copyright for 2017
At 2017-01-03 18:46:32 +0100, mag...@hagander.net wrote: > > Thoughts? I think we should stop making wholesale changes to copyright notices every year. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackups and slots
At 2016-12-16 07:32:24 +0100, mag...@hagander.net wrote: > > I really think the default should be "what most people want", and not > "whatever is compatible with a mode that was necessary because we > lacked infrastructure". Very much agreed. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
At 2016-11-04 10:35:05 +, si...@2ndquadrant.com wrote: > > Since the "lots of parameters" approach is almost exactly what we have > now, I think we should do that first, get this patch committed and > then sit back and discuss an improved API and see what downsides it > introduces. Thanks, I agree strongly with this. Someone (Michael?) earlier mentioned the potential for introducing bugs with this patch (the idea of merging recovery.conf into postgresql.conf at all, not this particular incarnation). I think the current proposed approach with recovery_target=xid recovery_target_xid=xxx is preferable because (a) it doesn't introduce much new code, e.g., new parsing functions, nor (b) need many changes in the documentation—all we need is to say that of the various recovery_target_xxx parameters, the one that has priority is the one named by recovery_target. If I were to introduce recovery_target='xid xxx', I would at a minimum need to factor out (or duplicate) parsing and error handling code, make a type/value union-in-struct to store in the GUC *extra, then make sure that we handle the older values in a backwards-compatible way, and move a bunch of documentation around. Can it be done? Yes, of course; and I'll do it if that's the consensus. But it would be a backwards-compatible change to the current approach anyway, and I think it would be better to put in the simple way now before discussing the new API further. Let's get the basic new *functionality* out so that people can play with it, I'm sure we'll find a few non-API things that need tweaking as a result of the change. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
At 2016-09-28 13:13:56 -0400, robertmh...@gmail.com wrote: > > I hope that the fact that there's been no discussion for the last > three weeks doesn't mean this effort is dead; I would like very > much to see it move forward. Here's an updated patch. Sorry, I got busy elswhere. I struggled with the handling of recovery_target a little. For example, one suggested alternative was: recovery_target_type = xid recovery_target_value = … The problem with implementing it this way is that the _value setting cannot be parsed without already having parsed the _type, and I didn't want to force that sort of dependency. What I've done instead is to make this work: recovery_target = xid|time|name|lsn|immediate recovery_target_xid = … recovery_target_time = … recovery_target_name = … recovery_target_lsn = … The recovery_target_xxx values are parsed as they used to be, but the one that's used is the one that's set in recovery_target. That's easy to explain, and the patch is much less intrusive, but I'm certainly open to suggestions to improve this, and I have the time to work on this patch with a view towards getting it committed in this cycle. -- Abhijit P.S. Sorry, I haven't been able to resolve the conflicts between Simon's earlier recovery_startup_r10_api.v1b patch and the "pg_ctl promote -w" changes in master. I was distracted by some illness in the family, but I will post another update very soon. >From 231ccefbc99c07f00560f4e88a961db186db704e Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Mon, 31 Oct 2016 11:32:51 +0530 Subject: Convert recovery.conf settings to GUCs Based on unite_recoveryconf_postgresqlconf_v3.patch by Fujii Masao. --- contrib/pg_standby/pg_standby.c | 2 +- doc/src/sgml/backup.sgml| 31 +- doc/src/sgml/config.sgml| 480 ++ doc/src/sgml/filelist.sgml | 1 - doc/src/sgml/func.sgml | 2 +- doc/src/sgml/high-availability.sgml | 42 +- doc/src/sgml/pgstandby.sgml | 4 +- doc/src/sgml/postgres.sgml | 1 - doc/src/sgml/recovery-config.sgml | 508 --- doc/src/sgml/ref/pgarchivecleanup.sgml | 6 +- doc/src/sgml/release-9.1.sgml | 5 +- doc/src/sgml/release.sgml | 2 +- src/backend/access/transam/recovery.conf.sample | 29 +- src/backend/access/transam/xlog.c | 520 +++- src/backend/access/transam/xlogarchive.c| 8 +- src/backend/postmaster/startup.c| 2 + src/backend/replication/walreceiver.c | 24 ++ src/backend/utils/misc/guc-file.l | 18 + src/backend/utils/misc/guc.c| 430 src/backend/utils/misc/postgresql.conf.sample | 24 ++ src/bin/pg_archivecleanup/pg_archivecleanup.c | 2 +- src/include/access/xlog.h | 21 + src/include/access/xlog_internal.h | 2 + src/include/utils/guc_tables.h | 2 + 24 files changed, 1202 insertions(+), 964 deletions(-) delete mode 100644 doc/src/sgml/recovery-config.sgml diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c index e4136f9..8fcb85c 100644 --- a/contrib/pg_standby/pg_standby.c +++ b/contrib/pg_standby/pg_standby.c @@ -520,7 +520,7 @@ usage(void) printf(" -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0)\n"); printf(" -?, --help show this help, then exit\n"); printf("\n" - "Main intended use as restore_command in recovery.conf:\n" + "Main intended use as restore_command in postgresql.conf:\n" " restore_command = 'pg_standby [OPTION]... ARCHIVELOCATION %%f %%p %%r'\n" "e.g.\n" " restore_command = 'pg_standby /mnt/server/archiverdir %%f %%p %%r'\n"); diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 6eaed1e..2df0dc6 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -1190,8 +1190,15 @@ SELECT pg_stop_backup(); - Create a recovery command file recovery.conf in the cluster - data directory (see ). You might + Set up recovery parameters in postgresql.conf (see + and + ). + + + + + Create a recovery trigger file recovery.trigger + in the cluster data directory. You might also want to temporarily modify pg_hba.conf to prevent ordinary users from connecting until you are sure the recovery was successful. @@ -1203,7 +1210,7 @@ SELECT pg_stop_backup(); recovery be terminated because of an external error, the server can simply be restarted and it will continue recovery. Upon completion of the r
Re: [HACKERS] Proposal for changes to recovery.conf API
At 2016-09-06 10:56:53 +0200, p...@2ndquadrant.com wrote: > > So +1 on the recovery_target = 'xid:xxx' idea. OK, I'll make it work that way. New patch in a couple of days. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
At 2016-09-06 14:40:54 +0900, michael.paqu...@gmail.com wrote: > > my best advice here is to make all those recovery_target_* parameters > PGC_POSTMASTER so as they are loaded only once when the server starts, > and then we define the recovery target type used in the startup > process instead of trying to do so at GUC level. I understand your approach in light of the GUC code, but I see things a bit differently—the complexity comes largely from the specific handling of recovery_target. I'll try to come up with a way to do it better. If not, we have your suggestion to fall back on. > +else if (recovery_target_lsn != 0) > +recovery_target = RECOVERY_TARGET_LSN; > This needs to check for InvalidXLogRecPtr. Of course, thanks. Fixed locally in all the relevant places. Will repost whenever there are other substantive changes. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
Hi. Here's an updated version of my patch, which now applies on top of the patch that Simon posted earlier (recovery_startup_r10_api.v1b.patch). A few notes: 1. I merged in recovery_target_lsn as a new GUC setting. 2. I fixed various minor nits in the earlier patch, not worth mentioning individually. 3. I haven't added back trigger_file, because Simon's patch removes it. I can add it back in separately after discussion (otherwise Simon's and my patches will conflict). 4. I've tested this to the extent that setting things in postgresql.conf works, and recovery.conf is still read if it exists, and so on. One open issue is the various assign_recovery_target_xxx functions, which Michael noted in his earlier review: +static void +assign_recovery_target_xid(const char *newval, void *extra) +{ + recovery_target_xid = *((TransactionId *) extra); + + if (recovery_target_xid != InvalidTransactionId) + recovery_target = RECOVERY_TARGET_XID; + else if (recovery_target_name && *recovery_target_name) + recovery_target = RECOVERY_TARGET_NAME; + else if (recovery_target_time != 0) + recovery_target = RECOVERY_TARGET_TIME; + else if (recovery_target_lsn != 0) + recovery_target = RECOVERY_TARGET_LSN; + else + recovery_target = RECOVERY_TARGET_UNSET; +} (Note how recovery_target_lsn caused this—and three other functions besides—to grow an extra branch.) I don't like this code, but I'm not yet sure what to replace it with. I think we should address the underlying problem—that the UI doesn't map cleanly to what the code wants. There's been some discussion about this earlier, but not any consensus that I could see. Do we want something like this (easy to implement and document, perhaps not especially convenient to use): recovery_target = 'xid' # or 'time'/'name'/'lsn'/'immediate' recovery_target_xid = xxx? # the only setting we care about now recovery_target_otherthings = parsed_but_ignored Or something like this (a bit harder to implement): recovery_target = 'xid:xxx' # or 'time:xxx' etc. Alternatively, the do-nothing option is to move the tests from guc.c to StartupXLOG and do it only once in some defined order (which would also break the current last-mention-wins behaviour). Thoughts? (I've added Fujii to the Cc: list, in case he has any comments, since this is based on his earlier patch.) -- Abhijit commit c20d735648f5ea867fda5afc499a63d3877536a3 Author: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Thu Sep 1 09:01:04 2016 +0530 Convert recovery.conf settings to GUCs Based on unite_recoveryconf_postgresqlconf_v3.patch by Fujii Masao. diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c index e4136f9..8fcb85c 100644 --- a/contrib/pg_standby/pg_standby.c +++ b/contrib/pg_standby/pg_standby.c @@ -520,7 +520,7 @@ usage(void) printf(" -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0)\n"); printf(" -?, --help show this help, then exit\n"); printf("\n" - "Main intended use as restore_command in recovery.conf:\n" + "Main intended use as restore_command in postgresql.conf:\n" " restore_command = 'pg_standby [OPTION]... ARCHIVELOCATION %%f %%p %%r'\n" "e.g.\n" " restore_command = 'pg_standby /mnt/server/archiverdir %%f %%p %%r'\n"); diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 0f09d82..f43e41e 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -1174,8 +1174,15 @@ SELECT pg_stop_backup(); - Create a recovery command file recovery.conf in the cluster - data directory (see ). You might + Set up recovery parameters in postgresql.conf (see + and + ). + + + + + Create a recovery trigger file recovery.trigger + in the cluster data directory. You might also want to temporarily modify pg_hba.conf to prevent ordinary users from connecting until you are sure the recovery was successful. @@ -1187,7 +1194,7 @@ SELECT pg_stop_backup(); recovery be terminated because of an external error, the server can simply be restarted and it will continue recovery. Upon completion of the recovery process, the server will rename - recovery.conf to recovery.done (to prevent + recovery.trigger to recovery.done (to prevent accidentally re-entering recovery mode later) and then commence normal database operations. @@ -1203,12 +1210,11 @@ SELECT pg_stop_backup(); -The key part of all this is to set up a recovery configuration file that -describes how you want to recover and how far the recovery should -run. You can use recovery.conf.sample (normally -located in the installation's share/ directory) as a -prototype
Re: [HACKERS] LSN as a recovery target
At 2016-09-04 07:02:01 +0100, si...@2ndquadrant.com wrote: > > > By the way, what has been committed does not include the patch > > adding the parsing context in case of an error as wanted upthread. > > Perhaps that's not worth adding now as there is the GUC refactoring > > potentially happening for the recovery parameters, so I don't mind > > much. Just that's worth mentioning. > > Hmm, that was unintentional. If something stalls the recovery > parameter project, please remind me to commit that as well. I'm aware of this, and will adjust accordingly in the GUC patch. Thanks for the heads up. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
At 2016-09-01 15:51:11 +0900, michael.paqu...@gmail.com wrote: > > - (errmsg("starting point-in-time recovery to XID %u", > - recoveryTargetXid))); > User loses information if those logs are removed. Agreed. I'm revising the patch right now, and I decided to leave them. I'll consider and comment on the remainder of your points after I've posted an update. Thanks for reading. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
At 2016-08-31 17:15:59 +0100, si...@2ndquadrant.com wrote: > > * Recovery parameters would now be part of the main postgresql.conf > infrastructure > Any parameters set in $DATADIR/recovery.conf will be read after the > main parameter file, similar to the way that postgresql.conf.auto is > read. > (Abhijit) > > * Parameters > All of the parameters formerly set in recovery.conf can be set in > postgresql.conf using RELOAD > These parameters will have no defaults in postgresql.conf.sample > Setting them has no effect during normal running, or once recovery ends. > https://www.postgresql.org/docs/devel/static/archive-recovery-settings.html > https://www.postgresql.org/docs/devel/static/recovery-target-settings.html > https://www.postgresql.org/docs/devel/static/standby-settings.html > (Abhijit) I've attached a WIP patch for the above (recovery_guc_v20160831.patch). This was based on the unite_recoveryconf_postgresqlconf_v3.patch posted by Fujii Masao. Unfortunately, some parts conflict with the patch that Simon just posted (e.g., his patch removes trigger_file altogether, whereas mine converts it into a GUC along the lines of the original patch). Rather than trying to untangle that right now, I'm posting what I have as-is, and I'll post an updated version tomorrow. -- Abhijit commit 999c0c2632f0d4c20d20b9ac30cd258305f74bac Author: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Wed Aug 31 22:18:07 2016 +0530 Convert recovery.conf settings into GUCs Based on unite_recoveryconf_postgresqlconf_v3.patch by Fujii Masao. diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c index e4136f9..8fcb85c 100644 --- a/contrib/pg_standby/pg_standby.c +++ b/contrib/pg_standby/pg_standby.c @@ -520,7 +520,7 @@ usage(void) printf(" -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0)\n"); printf(" -?, --help show this help, then exit\n"); printf("\n" - "Main intended use as restore_command in recovery.conf:\n" + "Main intended use as restore_command in postgresql.conf:\n" " restore_command = 'pg_standby [OPTION]... ARCHIVELOCATION %%f %%p %%r'\n" "e.g.\n" " restore_command = 'pg_standby /mnt/server/archiverdir %%f %%p %%r'\n"); diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 0f09d82..f43e41e 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -1174,8 +1174,15 @@ SELECT pg_stop_backup(); - Create a recovery command file recovery.conf in the cluster - data directory (see ). You might + Set up recovery parameters in postgresql.conf (see + and + ). + + + + + Create a recovery trigger file recovery.trigger + in the cluster data directory. You might also want to temporarily modify pg_hba.conf to prevent ordinary users from connecting until you are sure the recovery was successful. @@ -1187,7 +1194,7 @@ SELECT pg_stop_backup(); recovery be terminated because of an external error, the server can simply be restarted and it will continue recovery. Upon completion of the recovery process, the server will rename - recovery.conf to recovery.done (to prevent + recovery.trigger to recovery.done (to prevent accidentally re-entering recovery mode later) and then commence normal database operations. @@ -1203,12 +1210,11 @@ SELECT pg_stop_backup(); -The key part of all this is to set up a recovery configuration file that -describes how you want to recover and how far the recovery should -run. You can use recovery.conf.sample (normally -located in the installation's share/ directory) as a -prototype. The one thing that you absolutely must specify in -recovery.conf is the restore_command, +The key part of all this is to set up recovery parameters that +specify how you want to recover and how far the recovery should +run. The one thing that you absolutely must specify in +postgresql.conf to recover from the backup is +the restore_command, which tells PostgreSQL how to retrieve archived WAL file segments. Like the archive_command, this is a shell command string. It can contain %f, which is @@ -1270,7 +1276,7 @@ restore_command = 'cp /mnt/server/archivedir/%f %p' If you want to recover to some previous point in time (say, right before the junior DBA dropped your main transaction table), just specify the -required stopping point in recovery.conf. You can specify +required stopping point in postgresql.conf. You can specify the stop point, known as the recovery target, either by date/time, named restore point or by completion of a specific transaction ID. As of this writing only the date/time and named restore point options @@ -1367,8 +
Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE
At 2016-07-21 12:46:29 -0400, robertmh...@gmail.com wrote: > > I join with others in thinking it's a reasonable contrib module. I don't like the use of the term "empty" to describe an UPDATE or DELETE without a WHERE clause. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.6 and fsync=off
At 2016-04-28 13:44:23 -0700, and...@anarazel.de wrote: > > Abhijit had a patch implementing automatically running fsync whenever > reenabled IIRC. Abhijit? The patch I had written is attached, and it's not quite the same thing. Here's how I originally described it in response to a question from Robert: «In 20150115133245.gg5...@awork2.anarazel.de, Andres explained his rationale as follows: «What I am thinking of is that, currently, if you start the server for initial loading with fsync=off, and then restart it, you're open to data loss. So when the current config file setting is changed from off to on, we should fsync the data directory. Even if there was no crash restart.» That's what I tried to implement.» I remember there was some subsequent discussion about it being better to issue fsync during a checkpoint when we see that its value has changed, but if I did any work on it (which I have a vague memory of), I can't find it now. Sorry. Do you want a patch along those lines now, or is it too late? -- Abhijit >From 1768680b672bcb037446230323cabcf9960f7f9a Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Fri, 1 May 2015 17:59:51 +0530 Subject: Recursively fsync PGDATA on the next restart after fsync was disabled Even if we didn't crash, we want to fsync PGDATA on startup if we know the server ran with fsync=off at some point since the previous restart. Otherwise, starting the server with fsync=off for initial data loading and then restarting it opens you to data loss on a power failure after the restart. --- src/backend/access/transam/xlog.c | 9 +++-- src/bin/pg_controldata/pg_controldata.c | 2 ++ src/include/catalog/pg_control.h| 8 +++- 3 files changed, 16 insertions(+), 3 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 084174d..af12992 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4823,6 +4823,7 @@ BootStrapXLOG(void) ControlFile->checkPoint = checkPoint.redo; ControlFile->checkPointCopy = checkPoint; ControlFile->unloggedLSN = 1; + ControlFile->fsync_disabled = false; /* Set important parameter values for use when replaying WAL */ ControlFile->MaxConnections = MaxConnections; @@ -5893,10 +5894,12 @@ StartupXLOG(void) */ if (enableFsync && - (ControlFile->state != DB_SHUTDOWNED && - ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)) + (ControlFile->fsync_disabled || + (ControlFile->state != DB_SHUTDOWNED && + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY))) { fsync_pgdata(data_directory); + ControlFile->fsync_disabled = false; } if (ControlFile->state == DB_SHUTDOWNED) @@ -8272,6 +8275,8 @@ CreateCheckPoint(int flags) /* crash recovery should always recover to the end of WAL */ ControlFile->minRecoveryPoint = InvalidXLogRecPtr; ControlFile->minRecoveryPointTLI = 0; + if (!enableFsync) + ControlFile->fsync_disabled = true; /* * Persist unloggedLSN value. It's reset on crash recovery, so this goes diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c index d8cfe5e..e99014f 100644 --- a/src/bin/pg_controldata/pg_controldata.c +++ b/src/bin/pg_controldata/pg_controldata.c @@ -290,6 +290,8 @@ main(int argc, char *argv[]) (uint32) ControlFile.backupEndPoint); printf(_("End-of-backup record required:%s\n"), ControlFile.backupEndRequired ? _("yes") : _("no")); + printf(_("Fsync disabled at runtime:%s\n"), + ControlFile.fsync_disabled ? _("yes") : _("no")); printf(_("Current wal_level setting:%s\n"), wal_level_str(ControlFile.wal_level)); printf(_("Current wal_log_hints setting:%s\n"), diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h index 2e4c381..a71d73e 100644 --- a/src/include/catalog/pg_control.h +++ b/src/include/catalog/pg_control.h @@ -21,7 +21,7 @@ /* Version identifier for this pg_control format */ -#define PG_CONTROL_VERSION 942 +#define PG_CONTROL_VERSION 943 /* * Body of CheckPoint XLOG records. This is declared here because we keep @@ -182,6 +182,12 @@ typedef struct ControlFileData bool track_commit_timestamp; /* + * Indicates whether fsync was ever disabled since the last restart. + * Tested and set at checkpoints, reset at startup. + */ + bool fsync_disabled; + + /* * This data is used to check for hardware-architecture compatibility of * the database and the backend executable. We need not check endianness * explicitly, since the pg_control version will surely look wrong to a -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.6 and fsync=off
Here's a patch just to help things along. -- Abhijit diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index f3e3de0..353de2e 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -182,7 +182,9 @@ #wal_level = minimal # minimal, replica, or logical # (change requires restart) -#fsync = on# turns forced synchronization on or off +#fsync = on# provide crash safety by flushing disk writes + # (disabling this can lead to unrecoverable + # data corruption) #synchronous_commit = on # synchronization level; # off, local, remote_write, remote_apply, or on #wal_sync_method = fsync # the default is the first option -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.6 and fsync=off
At 2016-04-27 17:58:08 +0800, cr...@2ndquadrant.com wrote: > > #fsync = on # turns forced synchronization on or > off I suggest:# provide crash safety by flushing disk writes # (Disabling this can lead to unrecoverable data # loss if the system crashes.) -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] Breakage with VACUUM ANALYSE + partitions
At 2016-04-12 09:00:57 -0400, robertmh...@gmail.com wrote: > > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freundwrote: > > > > 3) Actually handle the case of the last open segment not being > >RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so. > > #3 seems like it's probably about 15 years overdue, so let's do that > anyway. I don't have anything useful to contribute, I'm just trying to understand this fix. _mdfd_getseg() looks like this: targetseg = blkno / ((BlockNumber) RELSEG_SIZE); for (nextsegno = 1; nextsegno <= targetseg; nextsegno++) { Assert(nextsegno == v->mdfd_segno + 1); if (v->mdfd_chain == NULL) { /* * Normally we will create new segments only if authorized by the * caller (i.e., we are doing mdextend()). […] */ if (behavior == EXTENSION_CREATE || InRecovery) { if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE) /* mdextend */ v->mdfd_chain = _mdfd_openseg(reln, forknum, +nextsegno, O_CREAT); } else { /* We won't create segment if not existent */ v->mdfd_chain = _mdfd_openseg(reln, forknum, nextsegno, 0); } if (v->mdfd_chain == NULL) /* return NULL or ERROR */ } v = v->mdfd_chain; } Do I understand correctly that the case of the "last open segment" (i.e., the one for which mdfd_chain == NULL) not being RELSEG_SIZE (i.e., _mdnblocks(reln, forknum, v) < RELSEG_SIZE) should not call _mfdf_openseg on nextsegno if behaviour is not EXTENSION_CREATE or InRecovery? And that "We won't create segment if not existent" should happen, but doesn't only because the next segment file wasn't removed earlier, so we have to add an extra check for that case? In other words, is something like the following what's needed here, or is there more to it? -- Abhijit diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 578276d..58ea5a0 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -1783,6 +1783,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, if (v->mdfd_chain == NULL) { + bool thisSegNeedsExtension = _mdnblocks(reln, forknum, v) < RELSEG_SIZE; + /* * Normally we will create new segments only if authorized by the * caller (i.e., we are doing mdextend()). But when doing WAL @@ -1799,7 +1801,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, */ if (behavior == EXTENSION_CREATE || InRecovery) { - if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE) + if (thisSegNeedsExtension) { char *zerobuf = palloc0(BLCKSZ); @@ -1810,7 +1812,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, } v->mdfd_chain = _mdfd_openseg(reln, forknum, +nextsegno, O_CREAT); } - else + else if (!thisSegNeedsExtension) { /* We won't create segment if not existent */ v->mdfd_chain = _mdfd_openseg(reln, forknum, nextsegno, 0); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pglogical questions and problems
At 2016-04-13 03:46:21 -0700, j...@commandprompt.com wrote: > > ALTER DATABASE ADD NODE; > ALTER SCHEMA SUBSCRIBE ALL; > CREATE REPLICATION SET; > > But I am unaware if that is possible within the constraints of the > extensions API. It is not possible. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incomplete startup packet errors
At 2016-04-13 10:02:22 +0200, mag...@hagander.net wrote: > > I wonder if it would make sense to only log that error if *at least > one byte* has been received and then it becomes empty. Yes, it would be very nice to eliminate that logspam, as you say. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-04-05 18:45:58 -0300, alvhe...@2ndquadrant.com wrote: > > I changed the regression test a bit more, so please recheck. Looks good, thank you. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
OK, thanks for the clarification. Here's the earlier patch, but with the relevant added docs and tests retained. -- Abhijit >From dfb6ded15246ec65cc911864bfcff285eef1c4d4 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Tue, 5 Apr 2016 11:55:09 +0530 Subject: =?UTF-8?q?Implement=20"ALTER=20=E2=80=A6=20DEPENDS=20ON=20EXTENSI?= =?UTF-8?q?ON=20=E2=80=A6"=20for=20functions=20and=20triggers?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A new dependency type, DEPENDENCY_AUTO_EXTENSION, indicates that an object depends on an extension (but doesn't belong to it, so pg_dump should continue to dump it) and should be dropped when the extension itself is dropped. A new statement type, AlterObjectDependsStmt, is used to declare such dependencies. Includes documentation and tests. --- doc/src/sgml/catalogs.sgml | 13 ++ doc/src/sgml/ref/alter_function.sgml | 20 + doc/src/sgml/ref/alter_trigger.sgml| 16 +++ src/backend/catalog/dependency.c | 2 + src/backend/catalog/objectaddress.c| 25 +++ src/backend/commands/alter.c | 32 ++ src/backend/nodes/copyfuncs.c | 17 src/backend/nodes/equalfuncs.c | 15 +++ src/backend/parser/gram.y | 36 +++- src/backend/tcop/utility.c | 27 src/include/catalog/dependency.h | 9 +++- src/include/catalog/objectaddress.h| 4 ++ src/include/commands/alter.h | 1 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 15 +++ src/include/parser/kwlist.h| 1 + src/test/modules/test_extensions/Makefile | 2 +- .../test_extensions/expected/test_extdepend.out| 50 ++ .../modules/test_extensions/sql/test_extdepend.sql | 32 ++ src/tools/pgindent/typedefs.list | 1 + 20 files changed, 315 insertions(+), 4 deletions(-) create mode 100644 src/test/modules/test_extensions/expected/test_extdepend.out create mode 100644 src/test/modules/test_extensions/sql/test_extdepend.sql diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index bb75229..3c128a0 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2888,6 +2888,19 @@ + + + DEPENDENCY_AUTO_EXTENSION (x) + + + The dependent object is not a member of the extension that is the + referenced object (and so should not be ignored by pg_dump), but + cannot function without it and should be dropped when the + extension itself is. The dependent object may be dropped on its + own as well. + + + Other dependency flavors might be needed in future. diff --git a/doc/src/sgml/ref/alter_function.sgml b/doc/src/sgml/ref/alter_function.sgml index f40363e..1ef905f 100644 --- a/doc/src/sgml/ref/alter_function.sgml +++ b/doc/src/sgml/ref/alter_function.sgml @@ -29,6 +29,8 @@ ALTER FUNCTION name ( [ [ argmode ] [ argname ] argtype [, ...] ] ) SET SCHEMA new_schema +ALTER FUNCTION name ( [ [ argmode ] [ argname ] argtype [, ...] ] ) +DEPENDS ON EXTENSION extension_name where action is one of: @@ -148,6 +150,15 @@ ALTER FUNCTION name ( [ [ extension_name + + + The name of an extension that the function depends on. + + + + CALLED ON NULL INPUT RETURNS NULL ON NULL INPUT @@ -300,6 +311,15 @@ ALTER FUNCTION sqrt(integer) SET SCHEMA maths; + To mark the function sqrt for type + integer as being dependent on the extension + mathlib: + +ALTER FUNCTION sqrt(integer) DEPENDS ON EXTENSION mathlib; + + + + To adjust the search path that is automatically set for a function: ALTER FUNCTION check_password(text) SET search_path = admin, pg_temp; diff --git a/doc/src/sgml/ref/alter_trigger.sgml b/doc/src/sgml/ref/alter_trigger.sgml index c63b5df..37c4d03 100644 --- a/doc/src/sgml/ref/alter_trigger.sgml +++ b/doc/src/sgml/ref/alter_trigger.sgml @@ -22,6 +22,7 @@ PostgreSQL documentation ALTER TRIGGER name ON table_name RENAME TO new_name +ALTER TRIGGER name ON table_name DEPENDS ON EXTENSION extension_name @@ -70,6 +71,15 @@ ALTER TRIGGER name ON + + +extension_name + + + The name of an extension that the trigger depends on. + + + @@ -93,6 +103,12 @@ ALTER TRIGGER name ON ALTER TRIGGER emp_stamp ON emp RENAME TO emp_track_chgs; + + + To mark a trigger as being dependent on an extension: + +ALTER TRIGGER emp_stamp ON emp DEPENDS ON EXTENSION emplib; + diff --git a/src/backend/catalog/dependency.c b/src/backend/c
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-04-05 12:33:56 +0530, a...@2ndquadrant.com wrote: > > Álvaro: I did document and test the extra types you added, but now > that I think about it a bit more, it's hard to argue that it's useful > to have a table, for example, depend on an extension. There's really > nothing about a table that "doesn't work without" an extension. I think it makes sense to implement this for triggers and functions. It may also be useful for indexes and materialised views, which can refer to functions in an extension (and in future, sequences as well). It's certainly good to know the grammar would work if we wanted to add other object types in future, but I think we should leave it at that. Thoughts? -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
Álvaro: I did document and test the extra types you added, but now that I think about it a bit more, it's hard to argue that it's useful to have a table, for example, depend on an extension. There's really nothing about a table that "doesn't work without" an extension. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
Hi. Here's the updated patch. It fixes a couple of small problems, and includes documentation and tests (which I placed in a new file in src/test/modules/test_extensions, on Petr's advice). I wanted to post this before I went on to attempt any more grammar cleanups. Please let me know if there's anything else you'd like to see here. Thanks again for your review and suggestions. -- Abhijit >From d4882446de50c49c6563dac6dbdd386c01f9477a Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Tue, 5 Apr 2016 11:55:09 +0530 Subject: =?UTF-8?q?Implement=20"ALTER=20=E2=80=A6=20DEPENDS=20ON=20EXTENSI?= =?UTF-8?q?ON=20=E2=80=A6"=20for=20various=20object=20types?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A new dependency type, DEPENDENCY_AUTO_EXTENSION, indicates that an object depends on an extension (but doesn't belong to it, so pg_dump should continue to dump it) and should be dropped when the extension itself is dropped. A new statement type, AlterObjectDependsStmt, is used to declare such dependencies. Includes documentation and tests. --- doc/src/sgml/catalogs.sgml | 13 +++ doc/src/sgml/ref/alter_function.sgml | 20 doc/src/sgml/ref/alter_operator.sgml | 18 doc/src/sgml/ref/alter_rule.sgml | 18 +++- doc/src/sgml/ref/alter_table.sgml | 25 + doc/src/sgml/ref/alter_trigger.sgml| 16 +++ doc/src/sgml/ref/alter_tsconfig.sgml | 23 - doc/src/sgml/ref/alter_type.sgml | 27 + src/backend/catalog/dependency.c | 2 + src/backend/catalog/objectaddress.c| 25 + src/backend/commands/alter.c | 32 ++ src/backend/nodes/copyfuncs.c | 17 src/backend/nodes/equalfuncs.c | 15 +++ src/backend/parser/gram.y | 86 +++- src/backend/tcop/utility.c | 27 + src/include/catalog/dependency.h | 9 +- src/include/catalog/objectaddress.h| 4 + src/include/commands/alter.h | 1 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 15 +++ src/include/parser/kwlist.h| 1 + src/test/modules/test_extensions/Makefile | 2 +- .../test_extensions/expected/test_extdepend.out| 110 + .../modules/test_extensions/sql/test_extdepend.sql | 56 +++ src/tools/pgindent/typedefs.list | 1 + 25 files changed, 558 insertions(+), 6 deletions(-) create mode 100644 src/test/modules/test_extensions/expected/test_extdepend.out create mode 100644 src/test/modules/test_extensions/sql/test_extdepend.sql diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index bb75229..3c128a0 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2888,6 +2888,19 @@ + + + DEPENDENCY_AUTO_EXTENSION (x) + + + The dependent object is not a member of the extension that is the + referenced object (and so should not be ignored by pg_dump), but + cannot function without it and should be dropped when the + extension itself is. The dependent object may be dropped on its + own as well. + + + Other dependency flavors might be needed in future. diff --git a/doc/src/sgml/ref/alter_function.sgml b/doc/src/sgml/ref/alter_function.sgml index f40363e..1ef905f 100644 --- a/doc/src/sgml/ref/alter_function.sgml +++ b/doc/src/sgml/ref/alter_function.sgml @@ -29,6 +29,8 @@ ALTER FUNCTION name ( [ [ argmode ] [ argname ] argtype [, ...] ] ) SET SCHEMA new_schema +ALTER FUNCTION name ( [ [ argmode ] [ argname ] argtype [, ...] ] ) +DEPENDS ON EXTENSION extension_name where action is one of: @@ -148,6 +150,15 @@ ALTER FUNCTION name ( [ [ extension_name + + + The name of an extension that the function depends on. + + + + CALLED ON NULL INPUT RETURNS NULL ON NULL INPUT @@ -300,6 +311,15 @@ ALTER FUNCTION sqrt(integer) SET SCHEMA maths; + To mark the function sqrt for type + integer as being dependent on the extension + mathlib: + +ALTER FUNCTION sqrt(integer) DEPENDS ON EXTENSION mathlib; + + + + To adjust the search path that is automatically set for a function: ALTER FUNCTION check_password(text) SET search_path = admin, pg_temp; diff --git a/doc/src/sgml/ref/alter_operator.sgml b/doc/src/sgml/ref/alter_operator.sgml index b2eaa7a..456243f 100644 --- a/doc/src/sgml/ref/alter_operator.sgml +++ b/doc/src/sgml/ref/alter_operator.sgml @@ -28,6 +28,9 @@ ALTER OPERATOR name ( { left_typenew_schema ALTER OPERATOR name ( { left_type | NO
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-04-04 18:55:03 -0300, alvhe...@2ndquadrant.com wrote: > > At this point I think we're missing user-level docs for the additional > clause in each ALTER command. Thanks for having a look. Now that you're happy with the grammar, I'll write the remaining docs and resubmit the patch later today. > Do you have any ideas on how this would appear in regression tests? No specific ideas. At a high level, I think we should install an empty extension, create one of each kind of object, alter them to depend on the extension, check that pg_depend has the right 'x' rows, then drop the extension and make sure the objects have gone away. Does that sound reasonable? Suggestions welcome. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
At 2016-04-04 17:28:07 +0900, masao.fu...@gmail.com wrote: > > Barring any objections, I'll commit this patch. No objections, just a minor wording tweak: doc/src/sgml/config.sgml: "The synchronous standbys will be the standbys that their names appear early in this list" should be "The synchronous standbys will be those whose names appear earlier in this list". doc/src/sgml/high-availability.sgml: "The standbys that their names appear early in this list are given higher priority and will be considered as synchronous" should be "The standbys whose names appear earlier in the list are given higher priority and will be considered as synchronous". "The standbys that their names appear early in the list will be used as the synchronous standby" should be "The standbys whose names appear earlier in the list will be used as synchronous standbys". You may prefer to reword this in some other way, but the current "that their names appear" wording should be changed. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-03-29 10:15:51 -0400, da...@pgmasters.net wrote: > > Either way it looks like you need to post a patch with more > documentation - do you know when you'll have that ready? Here it is. (I was actually looking for other potential callers, but I couldn't find any. There are some places that take a RangeVar and make a list from it, but they are creating new nodes, and don't quite fit. So the only change in this patch is to add a comment above the get_object_address_rv function.) Álvaro, do you like this one any better? -- Abhijit diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 951f59b..189b771 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2864,6 +2864,19 @@ + + + DEPENDENCY_AUTO_EXTENSION (x) + + + The dependent object is not a member of the extension that is the + referenced object (and so should not be ignored by pg_dump), but + cannot function without it and should be dropped when the + extension itself is. The dependent object may be dropped on its + own as well. + + + Other dependency flavors might be needed in future. diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index c48e37b..a284bed 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -587,6 +587,7 @@ findDependentObjects(const ObjectAddress *object, { case DEPENDENCY_NORMAL: case DEPENDENCY_AUTO: + case DEPENDENCY_AUTO_EXTENSION: /* no problem */ break; case DEPENDENCY_INTERNAL: @@ -786,6 +787,7 @@ findDependentObjects(const ObjectAddress *object, subflags = DEPFLAG_NORMAL; break; case DEPENDENCY_AUTO: + case DEPENDENCY_AUTO_EXTENSION: subflags = DEPFLAG_AUTO; break; case DEPENDENCY_INTERNAL: diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c index d2aaa6d..39128c1 100644 --- a/src/backend/catalog/objectaddress.c +++ b/src/backend/catalog/objectaddress.c @@ -999,6 +999,32 @@ get_object_address(ObjectType objtype, List *objname, List *objargs, } /* + * Return an ObjectAddress based on a RangeVar and an obect name. The + * name of the relation identified by the RangeVar is prepended to the + * list passed in as objname. This is useful to find the ObjectAddress + * of objects that depend on a relation. All other considerations are + * exactly as for get_object_address above. + */ + +ObjectAddress +get_object_address_rv(ObjectType objtype, RangeVar *rel, List *objname, + List *objargs, Relation *relp, LOCKMODE lockmode, + bool missing_ok) +{ + if (rel) + { + objname = lcons(makeString(rel->relname), objname); + if (rel->schemaname) + objname = lcons(makeString(rel->schemaname), objname); + if (rel->catalogname) + objname = lcons(makeString(rel->catalogname), objname); + } + + return get_object_address(objtype, objname, objargs, + relp, lockmode, missing_ok); +} + +/* * Find an ObjectAddress for a type of object that is identified by an * unqualified name. */ diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c index 5af0f2f..dd0b4c9f 100644 --- a/src/backend/commands/alter.c +++ b/src/backend/commands/alter.c @@ -23,6 +23,7 @@ #include "catalog/pg_collation.h" #include "catalog/pg_conversion.h" #include "catalog/pg_event_trigger.h" +#include "catalog/pg_extension.h" #include "catalog/pg_foreign_data_wrapper.h" #include "catalog/pg_foreign_server.h" #include "catalog/pg_language.h" @@ -32,6 +33,7 @@ #include "catalog/pg_opclass.h" #include "catalog/pg_opfamily.h" #include "catalog/pg_proc.h" +#include "catalog/pg_trigger.h" #include "catalog/pg_ts_config.h" #include "catalog/pg_ts_dict.h" #include "catalog/pg_ts_parser.h" @@ -391,6 +393,32 @@ ExecRenameStmt(RenameStmt *stmt) } /* + * Executes an ALTER OBJECT / DEPENDS ON EXTENSION statement. + * + * Return value is the address of the altered object. + */ +ObjectAddress +ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt) +{ + ObjectAddress address; + ObjectAddress extAddr; + Relation rel = NULL; + + address = get_object_address_rv(stmt->objectType, stmt->relation, stmt->objname, + stmt->objargs, , AccessExclusiveLock, false); + + if (rel) + heap_close(rel, NoLock); + + extAddr = get_object_address(OBJECT_EXTENSION, stmt->extname, NULL, + , AccessExclusiveLock, false); + + recordDependencyOn(, , DEPENDENCY_AUTO_EXTENSION); + + return address; +} + +/* * Executes an ALTER OBJECT / SET SCHEMA statement. Based on the object * type, the function appropriate to that type is executed. * diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 6b5d1d6..ecc260d 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -3203,6 +3203,20 @@ _copyRenameStmt(const RenameStmt *from) return newnode; } +static AlterObjectDependsStmt *
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-03-24 22:48:51 +0530, a...@2ndquadrant.com wrote: > > > I think I would like to see code implement both alternatives to see > > which one is least ugly. Maybe a third idea will manifest itself > > upon seeing those. > > Here's the first one. ExecAlterObjectDependsStmt() looks like this: Here's the second one, which is only slightly different from the first. ExecAlterObjectDependsStmt() now looks like this: +ObjectAddress +ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt) +{ +ObjectAddress address; +ObjectAddress extAddr; +Relationrel = NULL; + +address = get_object_address_rv(stmt->objectType, stmt->relation, stmt->objname, +stmt->objargs, , AccessExclusiveLock, false); + +if (rel) +heap_close(rel, NoLock); + +extAddr = get_object_address(OBJECT_EXTENSION, stmt->extname, NULL, + , AccessExclusiveLock, false); + +recordDependencyOn(, , DEPENDENCY_AUTO_EXTENSION); + +return address; +} And the new get_object_address_rv() looks like this: +ObjectAddress +get_object_address_rv(ObjectType objtype, RangeVar *rel, List *objname, + List *objargs, Relation *relp, LOCKMODE lockmode, + bool missing_ok) +{ +if (rel) +{ +objname = lcons(makeString(rel->relname), objname); +if (rel->schemaname) +objname = lcons(makeString(rel->schemaname), objname); +if (rel->catalogname) +objname = lcons(makeString(rel->catalogname), objname); +} + +return get_object_address(objtype, objname, objargs, + relp, lockmode, missing_ok); +} Complete patch attached for reference, as before. (I know I haven't documented the function. I will go through the code to see if there are any other potential callers, but I wanted to share what I had already.) -- Abhijit diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 951f59b..189b771 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2864,6 +2864,19 @@ + + + DEPENDENCY_AUTO_EXTENSION (x) + + + The dependent object is not a member of the extension that is the + referenced object (and so should not be ignored by pg_dump), but + cannot function without it and should be dropped when the + extension itself is. The dependent object may be dropped on its + own as well. + + + Other dependency flavors might be needed in future. diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index c48e37b..a284bed 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -587,6 +587,7 @@ findDependentObjects(const ObjectAddress *object, { case DEPENDENCY_NORMAL: case DEPENDENCY_AUTO: + case DEPENDENCY_AUTO_EXTENSION: /* no problem */ break; case DEPENDENCY_INTERNAL: @@ -786,6 +787,7 @@ findDependentObjects(const ObjectAddress *object, subflags = DEPFLAG_NORMAL; break; case DEPENDENCY_AUTO: + case DEPENDENCY_AUTO_EXTENSION: subflags = DEPFLAG_AUTO; break; case DEPENDENCY_INTERNAL: diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c index d2aaa6d..137f94d 100644 --- a/src/backend/catalog/objectaddress.c +++ b/src/backend/catalog/objectaddress.c @@ -998,6 +998,24 @@ get_object_address(ObjectType objtype, List *objname, List *objargs, return address; } +ObjectAddress +get_object_address_rv(ObjectType objtype, RangeVar *rel, List *objname, + List *objargs, Relation *relp, LOCKMODE lockmode, + bool missing_ok) +{ + if (rel) + { + objname = lcons(makeString(rel->relname), objname); + if (rel->schemaname) + objname = lcons(makeString(rel->schemaname), objname); + if (rel->catalogname) + objname = lcons(makeString(rel->catalogname), objname); + } + + return get_object_address(objtype, objname, objargs, + relp, lockmode, missing_ok); +} + /* * Find an ObjectAddress for a type of object that is identified by an * unqualified name. diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c index 5af0f2f..dd0b4c9f 100644 --- a/src/backend/commands/alter.c +++ b/src/backend/commands/alter.c @@ -23,6 +23,7 @@ #include "catalog/pg_collation.h" #include "catalog/pg_conversion.h" #include "catalog/pg_event_trigger.h" +#include "catalog/pg_extension.h" #include "catalog/pg_foreign_data_wrapper.h" #include "catalog/pg_foreign_server.h" #include "catalog/pg_language.h" @@ -32,6 +33,7 @@ #include "catalog/pg_opclass.h" #include "catalog/pg_opfamily.h" #include "catalog/pg_proc.h" +#include "catalog/pg_trigger.h" #include "catalog/pg_ts_config.h" #include "catalog/pg_ts_dict.h" #include "catalog/pg_ts_parser.h" @@ -391,6 +393,32 @@ ExecRenameStmt(RenameStmt *stmt) } /* + * Executes an ALTER OBJECT / DEPENDS ON
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-03-24 12:31:16 -0300, alvhe...@2ndquadrant.com wrote: > > In other words I think the conclusion here is that we must use > qualified_name in the new production rather than switching the old > production to any_name. Makes sense. > I think I would like to see code implement both alternatives to see > which one is least ugly. Maybe a third idea will manifest itself upon > seeing those. Here's the first one. ExecAlterObjectDependsStmt() looks like this: +ObjectAddress +ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt) +{ +ObjectAddress address; +ObjectAddress extAddr; +Relationrel = NULL; + +/* + * If the parser handed us a RangeVar, we add the relation's name to + * stmt->objname so that we can pass it to get_object_address(). + */ +if (stmt->relation) +{ +stmt->objname = lcons(makeString(stmt->relation->relname), stmt->objname); +if (stmt->relation->schemaname) +stmt->objname = lcons(makeString(stmt->relation->schemaname), stmt->objname); +if (stmt->relation->catalogname) +stmt->objname = lcons(makeString(stmt->relation->catalogname), stmt->objname); +} + +address = get_object_address(stmt->objectType, stmt->objname, stmt->objargs, + , AccessExclusiveLock, false); + +if (rel) +heap_close(rel, NoLock); + +extAddr = get_object_address(OBJECT_EXTENSION, stmt->extname, NULL, + , AccessExclusiveLock, false); + +recordDependencyOn(, , DEPENDENCY_AUTO_EXTENSION); + +return address; +} (This works fine for both functions and triggers, I tested it.) Complete patch attached for reference. I'll post the get_object_address_rv() variant tomorrow, but comments are welcome in the meantime. -- Abhijit diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 951f59b..189b771 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2864,6 +2864,19 @@ + + + DEPENDENCY_AUTO_EXTENSION (x) + + + The dependent object is not a member of the extension that is the + referenced object (and so should not be ignored by pg_dump), but + cannot function without it and should be dropped when the + extension itself is. The dependent object may be dropped on its + own as well. + + + Other dependency flavors might be needed in future. diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index c48e37b..a284bed 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -587,6 +587,7 @@ findDependentObjects(const ObjectAddress *object, { case DEPENDENCY_NORMAL: case DEPENDENCY_AUTO: + case DEPENDENCY_AUTO_EXTENSION: /* no problem */ break; case DEPENDENCY_INTERNAL: @@ -786,6 +787,7 @@ findDependentObjects(const ObjectAddress *object, subflags = DEPFLAG_NORMAL; break; case DEPENDENCY_AUTO: + case DEPENDENCY_AUTO_EXTENSION: subflags = DEPFLAG_AUTO; break; case DEPENDENCY_INTERNAL: diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c index 5af0f2f..339a313 100644 --- a/src/backend/commands/alter.c +++ b/src/backend/commands/alter.c @@ -23,6 +23,7 @@ #include "catalog/pg_collation.h" #include "catalog/pg_conversion.h" #include "catalog/pg_event_trigger.h" +#include "catalog/pg_extension.h" #include "catalog/pg_foreign_data_wrapper.h" #include "catalog/pg_foreign_server.h" #include "catalog/pg_language.h" @@ -32,6 +33,7 @@ #include "catalog/pg_opclass.h" #include "catalog/pg_opfamily.h" #include "catalog/pg_proc.h" +#include "catalog/pg_trigger.h" #include "catalog/pg_ts_config.h" #include "catalog/pg_ts_dict.h" #include "catalog/pg_ts_parser.h" @@ -391,6 +393,45 @@ ExecRenameStmt(RenameStmt *stmt) } /* + * Executes an ALTER OBJECT / DEPENDS ON EXTENSION statement. + * + * Return value is the address of the altered object. + */ +ObjectAddress +ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt) +{ + ObjectAddress address; + ObjectAddress extAddr; + Relation rel = NULL; + + /* + * If the parser handed us a RangeVar, we add the relation's name to + * stmt->objname so that we can pass it to get_object_address(). + */ + if (stmt->relation) + { + stmt->objname = lcons(makeString(stmt->relation->relname), stmt->objname); + if (stmt->relation->schemaname) + stmt->objname = lcons(makeString(stmt->relation->schemaname), stmt->objname); + if (stmt->relation->catalogname) + stmt->objname = lcons(makeString(stmt->relation->catalogname), stmt->objname); + } + + address = get_object_address(stmt->objectType, stmt->objname, stmt->objargs, + , AccessExclusiveLock, false); + + if (rel) + heap_close(rel, NoLock); + + extAddr = get_object_address(OBJECT_EXTENSION, stmt->extname, NULL, + , AccessExclusiveLock, false); + + recordDependencyOn(, ,
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
Hi. I implemented "ALTER FUNCTION … DEPENDS ON EXTENSION" using a new node (AlterObjectDependsStmt), and tried to add "ALTER TRIGGER … DEPENDS ON EXTENSION" (partly because I wanted to make sure the code could support multiple object types, partly because it's convenient in this particular use case). Then I ran into a problem, which I could use some help with. The main ExecAlterObjectDependsStmt() function is very simple: +ObjectAddress +ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt) +{ + ObjectAddress address; + ObjectAddress extAddr; + + address = get_object_address(stmt->objectType, stmt->objname, stmt->objargs, +NULL, AccessExclusiveLock, false); + extAddr = get_object_address(OBJECT_EXTENSION, stmt->extname, NULL, +NULL, AccessExclusiveLock, false); + + recordDependencyOn(, , DEPENDENCY_AUTO_EXTENSION); + + return address; +} All that's needed is to construct the node based on variants of the ALTER command: +AlterObjectDependsStmt: +ALTER FUNCTION function_with_argtypes DEPENDS ON EXTENSION name +{ +AlterObjectDependsStmt *n = makeNode(AlterObjectDependsStmt); +n->objectType = OBJECT_FUNCTION; +n->objname = $3->funcname; +n->objargs = $3->funcargs; +n->extname = list_make1(makeString($7)); +$$ = (Node *)n; +} +| ALTER TRIGGER name ON any_name DEPENDS ON EXTENSION name +{ +AlterObjectDependsStmt *n = makeNode(AlterObjectDependsStmt); +n->objectType = OBJECT_TRIGGER; +n->objname = list_make1(lappend($5, makeString($3))); +n->objargs = NIL; +n->extname = list_make1(makeString($9)); +$$ = (Node *)n; +} +; Now, the first part of this works fine. But with the second part, I get a reduce/reduce conflict if I use any_name. Here's an excerpt from the verbose bison output: State 2920 1181 AlterObjectDependsStmt: ALTER TRIGGER name ON any_name DEPENDS . ON EXTENSION name ON shift, and go to state 3506 State 2921 1165 RenameStmt: ALTER TRIGGER name ON qualified_name RENAME . TO name TO shift, and go to state 3507 State 2922 829 attrs: '.' . attr_name 2006 indirection_el: '.' . attr_name 2007 | '.' . '*' … attr_name go to state 3508 ColLabelgo to state 1937 unreserved_keyword go to state 1938 col_name_keywordgo to state 1939 type_func_name_keyword go to state 1940 reserved_keywordgo to state 1941 State 3506 1181 AlterObjectDependsStmt: ALTER TRIGGER name ON any_name DEPENDS ON . EXTENSION name EXTENSION shift, and go to state 4000 State 3507 1165 RenameStmt: ALTER TRIGGER name ON qualified_name RENAME TO . name … namego to state 4001 ColId go to state 543 unreserved_keyword go to state 544 col_name_keywordgo to state 545 State 3508 829 attrs: '.' attr_name . 2006 indirection_el: '.' attr_name . RENAMEreduce using rule 2006 (indirection_el) '[' reduce using rule 2006 (indirection_el) '.' reduce using rule 829 (attrs) '.' [reduce using rule 2006 (indirection_el)] $default reduce using rule 829 (attrs) I can see that the problem is that it can reduce '.' in two ways, but I'm afraid I don't remember enough about yacc to know how to fix it. If anyone has suggestions, I would be grateful. Meanwhile, I tried using qualified_name instead of any_name, which does work without conflicts, but I end up with a RangeVar instead of the sort of List* that I can feed to get_object_address. I could change ExecAlterObjectDependsStmt() to check if stmt->relation is set, and if so, convert the RangeVar back to a List* in the right format before passing it to get_object_address. That would work fine, but it doesn't seem like a good solution. I could write some get_object_address_rv() wrapper that does essentially the same conversion, but that's only slightly better. Are there any other options? (Apart from the above, the rest of the patch is really just boilerplate for the new node type, but I've attached it anyway for reference.) -- Abhijit diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 951f59b..189b771 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2864,6 +2864,19 @@ + + + DEPENDENCY_AUTO_EXTENSION (x) + + + The dependent object is not a member of the extension that is the + referenced object (and so should not be ignored by pg_dump), but + cannot function without it and should be dropped when the + extension itself is. The dependent object may be
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-03-21 15:43:09 -0400, robertmh...@gmail.com wrote: > > I also think we should allow a function to depend on multiple > extensions, as Alvaro mentions downthread. I'm working on an updated patch, will post shortly. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-03-21 12:04:40 +0530, a...@2ndquadrant.com wrote: > > I'll write up a patch for this. Thanks for the suggestions. Here's a patch to implement ALTER FUNCTION x DEPENDS ON EXTENSION y. The changes to functioncmds.c:AlterFunction were less intrusive than I had originally feared. -- Abhijit >From 722f23b75369f035f094753e47663c50580b052c Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Tue, 1 Mar 2016 06:44:28 +0530 Subject: =?UTF-8?q?Add=20DEPENDENCY=5FAUTO=5FEXTENSION/ALTER=20FUNCTION=20?= =?UTF-8?q?=E2=80=A6=20DEPENDS=20ON=20EXTENSION=20=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The new dependency type is used for functions that require an extension to operate, but do not actually belong to the extension per se and, in particular, should not be ignored by pg_dump. The new command allows a function to be marked as depending on an extension in this way. --- doc/src/sgml/catalogs.sgml | 13 + src/backend/catalog/dependency.c| 2 ++ src/backend/commands/functioncmds.c | 31 --- src/backend/parser/gram.y | 7 ++- src/include/catalog/dependency.h| 9 - src/include/parser/kwlist.h | 1 + 6 files changed, 58 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 951f59b..189b771 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2864,6 +2864,19 @@ + + + DEPENDENCY_AUTO_EXTENSION (x) + + + The dependent object is not a member of the extension that is the + referenced object (and so should not be ignored by pg_dump), but + cannot function without it and should be dropped when the + extension itself is. The dependent object may be dropped on its + own as well. + + + Other dependency flavors might be needed in future. diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index c48e37b..a284bed 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -587,6 +587,7 @@ findDependentObjects(const ObjectAddress *object, { case DEPENDENCY_NORMAL: case DEPENDENCY_AUTO: + case DEPENDENCY_AUTO_EXTENSION: /* no problem */ break; case DEPENDENCY_INTERNAL: @@ -786,6 +787,7 @@ findDependentObjects(const ObjectAddress *object, subflags = DEPFLAG_NORMAL; break; case DEPENDENCY_AUTO: + case DEPENDENCY_AUTO_EXTENSION: subflags = DEPFLAG_AUTO; break; case DEPENDENCY_INTERNAL: diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c index a745d73..b0f0f88 100644 --- a/src/backend/commands/functioncmds.c +++ b/src/backend/commands/functioncmds.c @@ -41,6 +41,7 @@ #include "catalog/objectaccess.h" #include "catalog/pg_aggregate.h" #include "catalog/pg_cast.h" +#include "catalog/pg_extension.h" #include "catalog/pg_language.h" #include "catalog/pg_namespace.h" #include "catalog/pg_proc.h" @@ -50,6 +51,7 @@ #include "catalog/pg_type_fn.h" #include "commands/alter.h" #include "commands/defrem.h" +#include "commands/extension.h" #include "commands/proclang.h" #include "miscadmin.h" #include "optimizer/var.h" @@ -466,7 +468,8 @@ compute_common_attribute(DefElem *defel, List **set_items, DefElem **cost_item, DefElem **rows_item, - DefElem **parallel_item) + DefElem **parallel_item, + DefElem **extdepend_item) { if (strcmp(defel->defname, "volatility") == 0) { @@ -521,6 +524,13 @@ compute_common_attribute(DefElem *defel, *parallel_item = defel; } + else if (strcmp(defel->defname, "extdepend") == 0) + { + if (*extdepend_item) + goto duplicate_error; + + *extdepend_item = defel; + } else return false; @@ -682,7 +692,8 @@ compute_attributes_sql_style(List *options, _items, _item, _item, - _item)) + _item, + NULL)) { /* recognized common option */ continue; @@ -1179,6 +1190,7 @@ AlterFunction(AlterFunctionStmt *stmt) DefElem*cost_item = NULL; DefElem*rows_item = NULL; DefElem*parallel_item = NULL; + DefElem*extdepend_item = NULL; ObjectAddress address; rel = heap_open(ProcedureRelationId, RowExclusiveLock); @@ -1217,7 +1229,8 @@ AlterFunction(AlterFunctionStmt *stmt) _items, _item, _item, - _item) == false) + _item, + _item) == false) elog(ERROR, "option \"%s\" not recognized", defel->defname); } @@ -1303,6 +1316,18 @@ AlterFunction(AlterFunctionStmt *stmt) heap_close(rel, NoLock); heap_fr
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-03-21 13:04:33 +0300, a.korot...@postgrespro.ru wrote: > > I'm not sure why we want to make new dependency type by ALTER FUNCTION > command, not ALTER EXTENSION? It's a matter of semantics. It means something very different than what an 'e' dependency means. The extension doesn't own the function (and so pg_dump shouldn't ignore it), but the function depends on the extension (and so dropping the extension should drop it). > The argument could be that 'x' dependency type would be used for other > objects not extensions. I suppose this is possible, but yes, I agree with you that it's not clear how or why this would be useful. > So, I would prefer this patch to extend ALTER EXTENSION command while > it's aimed to this particular problem. OK, so that's what the patch does, and it's certainly the simplest approach for reasons discussed earlier (though perhaps not as clear semantically as the ALTER FUNCTION approach). But: > I even think we could extend existent grammar rule > > | ALTER EXTENSION name add_drop FUNCTION function_with_argtypes > > *** AlterExtensionContentsStmt: > > *** 3982,3987 > > --- 3987,3993 > > n->objtype = OBJECT_FUNCTION; > > n->objname = $6->funcname; > > n->objargs = $6->funcargs; > > + n->deptype = 'e'; > > $$ = (Node *)n; > > } How exactly do you propose to do this, i.e., what would the final command to declare an 'x' dependency look like? -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-03-19 17:46:25 -0300, alvhe...@2ndquadrant.com wrote: > > I don't think the first patch is acceptable standalone -- we need both > things together. OK. > But in reality, pg_depend handling is mixed up with other changes all > over the place. Yes, I noticed that. :-/ > Anyway I think this should be something along the lines of > ALTER FUNCTION foo() DEPENDS ON EXTENSION bar; OK. That's reasonable. > ALTER FUNCTION foo() OWNED BY EXTENSION bar; If the function is really OWNED BY EXTENSION, then the right way to declare it would seem to be ALTER EXTENSION … ADD FUNCTION. I prefer DEPENDS ON EXTENSION for this reason, there's no ambiguity about what we're declaring. > Another argument to focus only on extensions is that pg_dump knows > specifically about extensions for supressing objects to dump, and we > don't have any other object type doing the same kind of thing; so > perhaps extensions-only is fine. That's the argument that motivates this particular patch. I think if we have a DEPENDS ON EXTENSION framework, it (a) addresses the immediate need, and (b) gives us a straightforward way to add DEPENDS ON in future when we find some need for it. I'll write up a patch for this. Thanks for the suggestions. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind just doesn't fsync *anything*?
At 2016-03-10 08:35:43 +0100, michael.paqu...@gmail.com wrote: > > > I guess the easiest fix would be to shell out to initdb -s? > > What do you mean? I am not sure what initdb has to do with that as we > have no need for it in pg_rewind. initdb -S/--sync-only fsyncs everything in the data directory and exits. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-02-29 19:56:07 -0600, jim.na...@bluetreble.com wrote: > > I don't see why this would be limited to just functions. […] Am I > missing something? No, you are not missing anything. The specific problem I was trying to solve involved a function, so I sketched out a solution for functions. Once we have some consensus on whether that's an acceptable approach, I'll extend the patch in whatever way we agree seems appropriate. > Maybe the better way to handle this would be through ALTER EXTENSION? That's what this (second) patch does. > Given the audience for this, I think it'd probably be OK to just > provide a function that does this, instead of DDL. That seems like a promising idea. Can you suggest some possible usage? Thanks. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-01-18 11:08:19 +0530, a...@2ndquadrant.com wrote: > > I'm proposing to address a part of that problem by allowing extension > dependencies to be explicitly declared for functions and objects > created either by a user or dynamically by the extension itself—things > that need the extension to function, but aren't a part of it. I didn't hear any further suggestions, so here's a patch for discussion. 1. This adds the 'x'/DEPENDENCY_AUTO_EXTENSION type. 2. This adds an 'ALTER FUNCTION … ADD DEPENDENT FUNCTION …' command. I split up the two because we may want the new dependency type without going to the trouble of adding a new command. Maybe extension authors should just insert an 'x' row into pg_depend directly? I was inclined to implement it using ALTER FUNCTION, but AlterFunction() is focused on altering the pg_proc entry for a function, so the new code didn't fit. Ultimately, ExecAlterExtensionContentsStmt() was the closest match, so that's where I did it. Comments welcome. I'll add this patch to the CF. -- Abhijit >From 9835f0990a015431393d608c8710d9effe301c9d Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Tue, 1 Mar 2016 06:44:28 +0530 Subject: Add a DEPENDENCY_AUTO_EXTENSION dependency type This is useful for functions that require the extension to operate, but do not belong to the extension per se and, in particular, should not be ignored by pg_dump. --- doc/src/sgml/catalogs.sgml | 13 + src/backend/catalog/dependency.c | 2 ++ src/include/catalog/dependency.h | 9 - 3 files changed, 23 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 951f59b..189b771 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2864,6 +2864,19 @@ + + + DEPENDENCY_AUTO_EXTENSION (x) + + + The dependent object is not a member of the extension that is the + referenced object (and so should not be ignored by pg_dump), but + cannot function without it and should be dropped when the + extension itself is. The dependent object may be dropped on its + own as well. + + + Other dependency flavors might be needed in future. diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index c48e37b..a284bed 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -587,6 +587,7 @@ findDependentObjects(const ObjectAddress *object, { case DEPENDENCY_NORMAL: case DEPENDENCY_AUTO: + case DEPENDENCY_AUTO_EXTENSION: /* no problem */ break; case DEPENDENCY_INTERNAL: @@ -786,6 +787,7 @@ findDependentObjects(const ObjectAddress *object, subflags = DEPFLAG_NORMAL; break; case DEPENDENCY_AUTO: + case DEPENDENCY_AUTO_EXTENSION: subflags = DEPFLAG_AUTO; break; case DEPENDENCY_INTERNAL: diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h index 049bf9f..380f74a 100644 --- a/src/include/catalog/dependency.h +++ b/src/include/catalog/dependency.h @@ -61,6 +61,12 @@ * created only during initdb. The fields for the dependent object * contain zeroes. * + * DEPENDENCY_AUTO_EXTENSION ('x'): the dependent object is not a member + * of the extension that is the referenced object (and so should not be + * ignored by pg_dump), but cannot function without it and should be + * dropped when the extension itself is. The dependent object may be + * dropped on its own as well. + * * Other dependency flavors may be needed in future. */ @@ -70,7 +76,8 @@ typedef enum DependencyType DEPENDENCY_AUTO = 'a', DEPENDENCY_INTERNAL = 'i', DEPENDENCY_EXTENSION = 'e', - DEPENDENCY_PIN = 'p' + DEPENDENCY_PIN = 'p', + DEPENDENCY_AUTO_EXTENSION = 'x' } DependencyType; /* -- 2.1.4 >From f06f59e8f7f25406510178fbf62b59dcfd59428c Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Tue, 1 Mar 2016 06:46:11 +0530 Subject: =?UTF-8?q?Add=20experimental=20'ALTER=20EXTENSION=20=E2=80=A6=20A?= =?UTF-8?q?DD=20DEPENDENT=20FUNCTION=20=E2=80=A6'=20command?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This uses the new 'x' dependency type. --- src/backend/commands/extension.c | 7 +-- src/backend/nodes/copyfuncs.c| 1 + src/backend/nodes/equalfuncs.c | 1 + src/backend/parser/gram.y| 39 ++- src/include/nodes/parsenodes.h | 1 + src/include/parser/kwlist.h | 1 + 6 files changed, 47 insertions(+), 3 deletions(-) diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index 9d84b79..c33dca6 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -2985,6 +2985,7 @@ ExecAlterExtensionContentsStmt(AlterExtensionContentsStmt *stmt, ObjectAddress object; Relati
Re: [HACKERS] Writing new unit tests with PostgresNode
At 2016-02-22 07:45:37 +, e...@xs4all.nl wrote: > > I think what's needed is: > > use 5.008_008; That is meant to fail if the code is run on a version of Perl older than 5.8.8, not fail if the code uses language features not present in 5.8.8. It is little help for those trying to maintain backwards compatibility. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
At 2016-01-16 12:18:53 -0500, robertmh...@gmail.com wrote: > > This seems like one manifestation of the more general problem that we > don't have any real idea what objects a function definition depends > on. Yes. I'm proposing to address a part of that problem by allowing extension dependencies to be explicitly declared for functions and objects created either by a user or dynamically by the extension itself—things that need the extension to function, but aren't a part of it. Put that way, ALTER EXTENSION doesn't sound like the way to do it. Maybe ALTER FUNCTION … DEPENDS ON EXTENSION …? I don't particularly care how the dependency is recorded, it's the dependency type that's important. I'll post a patch along those lines in a bit, just so we have something concrete to discuss; meanwhile, suggestions for another syntax to record the dependency are welcome. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
Right, here's another try. The extension does trigger-based DML auditing. You install it using CREATE EXTENSION and then call one of its functions to enable auditing for a particular table. That function will create a customised trigger function based on the table's columns and a trigger that uses it: CREATE FUNCTION fn_audit_$table_name() RETURNS TRIGGER … CREATE TRIGGER … ON $table_name … EXECUTE fn_audit_$table_name; All that works fine (with pg_dump too). But if you drop the extension, the triggers stop working because the trigger function calls functions in the extension that are now gone. To mitigate this problem, the extension actually does: CREATE FUNCTION fn_audit… ALTER EXTENSION … ADD FUNCTION fn_audit… Now the trigger depends on the trigger function (as before), and the trigger function depends on the extension, so you can't inadvertently break the system by dropping the extension. But now pg_dump has a problem: it'll dump the trigger definitions, but not the trigger functions (because of their new 'e' dependency on the extension). So if you restore, you get the extension and the triggers, but the trigger functions are gone, and things break. *This* is the problem I'm trying to solve. Sorry, my earlier explanation was not clear, because I didn't fully understand the problem and what the extension was doing. One possible solution is to make the trigger function depend on the extension with a dependency type that isn't 'e', and therefore doesn't prevent pg_dump from including the function in its output. We would need some way to record the dependency, but no changes to pg_dump would be needed. Thoughts? -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'
I'm sorry, it wasn't clear from my earlier post that the triggers are dependent on a function provided by the extension. So when you do CREATE EXTENSION foo, it creates foo_somefunc() that RETURNS TRIGGER. Later, a trigger is created (somehow; in this case it is by some other function in the extension, but it could be by the user directly as well) that executes this function. But that's only a partial answer to the questions raised here, and I'll return to the drawing board and draw up a more complete explanation. Thanks for reading. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] dealing with extension dependencies that aren't quite 'e'
Hi. I'm looking at an extension that creates some triggers (on user tables) dynamically (i.e., not during CREATE EXTENSION, but afterwards). The author has two problems with it: * «DROP EXTENSION ext» won't work without adding CASCADE, which is an (admittedly relatively minor) inconvenience to users. * More importantly, pg_dump will dump all those trigger definitions, which is inappropriate in this case because the extension will try to create them. As an experiment, I implemented "ALTER EXTENSION … ADD TRIGGER … ON …" (a few-line patch to gram.y) and taught pg_dump.c's getTriggers() to skip triggers with any 'e' dependency. That works, in the sense that DROP EXTENSION will drop the triggers (but the triggers can't be dropped on their own any more), and pg_dump will ignore them. I'm trying to find a more generally useful mechanism that addresses this problem and has a chance of being accepted upstream. Rather than overloading 'e', we could introduce a new dependency type that references an extension, but means that the dependent object should be dropped when the extension is, but it can also be dropped on its own, and pg_dump should ignore it. That would work for this case, and I can imagine it *may* be useful for other types of objects (e.g., sequences that depend on a seqam function provided by an extension, indexes that depend on index functions provided by an extension). But that leaves open the question of how exactly to record the dependency: ALTER EXTENSION … ADD … is the easiest way, but it doesn't feel right to introduce object-type-specific logic there to determine the dependency type to use. Besides, I'm not entirely comfortable with tying pg_dump behaviour so closely with the dependency, though there's some sort of precedent there with deptype = 'i'. In short, I can see some scope for improvement, but not clearly enough to make a concrete proposal. I'm hoping for advice or suggestions with a view towards submitting something to the next commitfest (2016-03). Thanks. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()
either skip it until we reach the RBM_ZERO_AND_CLEANUP_LOCK part, or add a new branch before the «if (!isExtend)» one. In summary, we're either adding one new branch that handles everything related to RBM_CACHE_ONLY and side-stepping some things on the way that happen to not apply (isLocalBuf, isExtend, TRACE_*), or we're adding a bunch of tests to ReadBuffer_common. But splitting BufferAlloc into BufferAlloc_shared and the rest of BufferAlloc doesn't look especially nice either. If you look at the shared_buffers lookup from BufferAlloc: /* see if the block is in the buffer pool already */ LWLockAcquire(newPartitionLock, LW_SHARED); buf_id = BufTableLookup(newTag, newHash); if (buf_id = 0) { /* * Found it. Now, pin the buffer so no one can steal it from the * buffer pool, and check to see if the correct data has been loaded * into the buffer. */ buf = GetBufferDescriptor(buf_id); valid = PinBuffer(buf, strategy); /* Can release the mapping lock as soon as we've pinned it */ LWLockRelease(newPartitionLock); *foundPtr = TRUE; if (!valid) { /* * We can only get here if (a) someone else is still reading in * the page, or (b) a previous read attempt failed. We have to * wait for any active read attempt to finish, and then set up our * own read attempt if the page is still not BM_VALID. * StartBufferIO does it all. */ if (StartBufferIO(buf, true)) { /* * If we get here, previous attempts to read the buffer must * have failed ... but we shall bravely try again. */ *foundPtr = FALSE; } } return buf; } /* * Didn't find it in the buffer pool. We'll have to initialize a new * buffer. Remember to unlock the mapping lock while doing the work. */ LWLockRelease(newPartitionLock); /* Loop here in case we have to try another victim buffer */ for (;;) { There are two options here. I split out BufferAlloc_shared and either make BufferAlloc call it, or not. If I make BufferAlloc call it, the LWLockAcquire/Release could move to the _shared function without confusion, but I'll either have to return the PinBuffer BM_VALID test result via a validPtr, or repeat the test in the caller before calling StartBufferIO. Both the INIT_BUFFERTAG and BufTableHashCode() would also have to be in both functions, since they are used for BufTableInsert() later in BufferAlloc. On the other hand, if I don't make BufferAlloc call the _shared function, then we end up with a new few-line special-case function plus either a self-contained new branch in ReadBuffer_common, or a bunch of RBM_CACHE_ONLY tests. That doesn't really seem an improvement over what the original patch did, i.e. introduce a single special-case function to handle a single special-case access. I'm open to suggestions. -- Abhijit P.S. For reference, I've attached the patch that I posted earlier, rebased to apply to master. commit 4cbbbd242b7ee6333e1d1a09794aa1cd9d39035f Author: Abhijit Menon-Sen a...@2ndquadrant.com Date: Tue Jun 23 13:38:01 2015 +0530 Introduce XLogLockBlockRangeForCleanup() When replaying B-Tree vacuum records, we need to confirm that a range of blocks is unpinned, but there was no way to ask the buffer manager if a block was pinned without having it read in uncached pages. This patch exposes a new interface that returns pages if they're in the cache, or InvalidBuffer, and uses that in a function that locks a range of blocks without reading them in from disk. diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 2f85867..e49a9ba 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -416,33 +416,10 @@ btree_xlog_vacuum(XLogReaderState *record) { RelFileNode thisrnode; BlockNumber thisblkno; - BlockNumber blkno; XLogRecGetBlockTag(record, 0, thisrnode, NULL, thisblkno); - - for (blkno = xlrec-lastBlockVacuumed + 1; blkno thisblkno; blkno++) - { - /* - * We use RBM_NORMAL_NO_LOG mode because it's not an error - * condition to see all-zero pages. The original btvacuumpage - * scan would have skipped over all-zero pages, noting them in FSM - * but not bothering to initialize them just yet; so we mustn't - * throw an error here. (We could skip acquiring the cleanup lock - * if PageIsNew, but it's probably not worth the cycles to test.) - * - * XXX we don't actually need to read the block, we just need to - * confirm it is unpinned. If we had a special call into the - * buffer manager we could optimise this so that if the block is - * not in shared_buffers
Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)
At 2015-06-11 14:38:03 +0900, langote_amit...@lab.ntt.co.jp wrote: On the other hand, I don't like the idea of doing (3) by adding command line arguments to pg_basebackup and adding a new option to the command. I don't think that level of flexibility is justified; it would also make it easier to end up with a broken base backup (by inadvertently excluding more than you meant to). Maybe a combination of (2) and part of (3). In absence of any command line argument, the behavior is (2), to exclude. Provide an option to *include* it (-S/--serverlog). I don't like that idea any more than having the command-line argument to exclude pg_log. (And people who store torrented files in PGDATA may like it even less.) -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)
At 2015-06-11 14:28:36 +0900, michael.paqu...@gmail.com wrote: After spending the night thinking about that, honestly, I think that we should go with (2) and keep the base backup as light-weight as possible and not bother about a GUC. OK. Then the patch I posted earlier should be sufficient. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)
At 2015-06-10 13:22:27 -0400, robertmh...@gmail.com wrote: I'm not clear on which of these options you are voting for: (1) include pg_log in pg_basebackup as we do currently (2) exclude it (3) add a switch controlling whether or not it gets excluded I can live with (3), but I bet most people want (2). Thanks for spelling out the options. I strongly prefer (2), but I could live with (3) if it were done as a GUC setting. (And if that's what we decide to do, I'm willing to write up the patch.) Whether or not it's a good idea to let one's logfiles grow to 8GB, the fact that doing so breaks base backups means that being able to exclude pg_log *somehow* is more of a necessity than personal preference. On the other hand, I don't like the idea of doing (3) by adding command line arguments to pg_basebackup and adding a new option to the command. I don't think that level of flexibility is justified; it would also make it easier to end up with a broken base backup (by inadvertently excluding more than you meant to). -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)
Hi. This is a followup to a 2014-02 discussion that led to pg_stats_temp being excluded from pg_basebackup. At the time, it was discussed to exclude pg_log as well, but nothing eventually came of that. Recently, one of our customers has had a basebackup fail because pg_log contained files that were 8GB: FATAL: archive member pg_log/postgresql-20150119.log too large for tar format I think pg_basebackup should also skip pg_log entries, as it does for pg_stats_temp and pg_replslot, etc. I've attached a patch along those lines for discussion. -- Abhijit P.S. Aren't we leaking statrelpath? From 8db162c1385b1cdd2b0e666975b76aa814f09f35 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Mon, 27 Apr 2015 12:58:52 +0530 Subject: Skip files in pg_log during basebackup --- src/backend/replication/basebackup.c | 29 + 1 file changed, 29 insertions(+) diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 1e86e4c..cc75a03 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -27,6 +27,7 @@ #include nodes/pg_list.h #include pgtar.h #include pgstat.h +#include postmaster/syslogger.h #include replication/basebackup.h #include replication/walsender.h #include replication/walsender_private.h @@ -72,6 +73,9 @@ static bool backup_started_in_recovery = false; /* Relative path of temporary statistics directory */ static char *statrelpath = NULL; +/* Relative path to log directory */ +static char logpath[MAXPGPATH]; + /* * Size of each block sent into the tar stream for larger files. */ @@ -157,6 +161,18 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir) else statrelpath = pgstat_stat_directory; + /* + * Do the same for the log_directory. + */ + + if (is_absolute_path(Log_directory) + path_is_prefix_of_path(DataDir, Log_directory)) + snprintf(logpath, MAXPGPATH, ./%s, Log_directory + datadirpathlen + 1); + else if (strncmp(Log_directory, ./, 2) != 0) + snprintf(logpath, MAXPGPATH, ./%s, Log_directory); + else + strncpy(logpath, Log_directory, MAXPGPATH); + /* Add a node for the base directory at the end */ ti = palloc0(sizeof(tablespaceinfo)); ti-size = opt-progress ? sendDir(., 1, true, tablespaces, true) : -1; @@ -965,6 +981,19 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces, } /* + * We can skip pg_log (or whatever log_directory is set to, if + * it's under the data directory), but include it as an empty + * directory anyway, so we get permissions right. + */ + if (strcmp(pathbuf, logpath) == 0) + { + if (!sizeonly) +_tarWriteHeader(pathbuf + basepathlen + 1, NULL, statbuf); + size += 512; + continue; + } + + /* * Skip pg_replslot, not useful to copy. But include it as an empty * directory anyway, so we get permissions right. */ -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] skipping pg_log in basebackup
At 2015-06-08 13:09:02 +0900, michael.paqu...@gmail.com wrote: It seems to be that: http://www.postgresql.org/message-id/cahgqgwh0okz6ckpjkcwojga3ejwffm1enrmro3dkdoteaai...@mail.gmail.com (Note that this is about calculating the wrong size, whereas my bug is about the file being too large to be written to a tar archive.) And a recent discussion about that is this one: http://www.postgresql.org/message-id/82897A1301080E4B8E461DDAA0FFCF142A1B2660@SYD1216 Oh, sorry, I somehow did miss that thread entirely. Thanks for the pointer. (I've added Vaishnavi to the Cc: list here.) I'm not convinced that we need a mechanism to let people exclude the torrent files they've stored in their data directory, but if we have to do it, the idea of having a GUC setting rather than specifying excludes on the basebackup command line each time does have a certain appeal. Anyone else interested in doing it that way? -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_xlog - pg_xjournal?
At 2015-06-01 23:35:23 -0500, htf...@gmail.com wrote: No, it won't prevent the incredibly stupid from doing incredibly stupid things, nothing will. I hate to speechify, but I think we should try hard to avoid framing such questions in terms of incredibly stupid people and the things they might do. We have anecdotal and circumstantial evidence that the names pg_xlog and pg_clog have given some people the impression that they can delete files therein. Sometimes do this when their server is in imminent danger of running out of space, sometimes not. But our documentation makes it clear that these files are important. I think naming these directories to convey the right impression is a straightforward interface design problem, and we also know that big flashing red warnings are less effective than one might hope for. I do not think a bigger, stripier warning is worth doing in isolation. I do think it's worth choosing better names. -- Abhijit P.S. Unrelated to Michael's mail, but I also don't think it's worth debating whether people will run rm -rf *log or rm -rf log/* or whatever other variant you can think of. I'm arguing for correcting a mis-perception, not try to dodge specific harmful commands. Tom's proposal of using a symlink but dropping it after third-party tools have had time to catch up seems like the best approach to me. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_xlog - pg_xjournal?
At 2015-05-31 13:46:33 -0400, t...@sss.pgh.pa.us wrote: just always create pg_xlog as a symlink to pg_xjournal during initdb. At first glance, the Subject: of this thread made me think that *was* Joel's proposal. :-) But I think it's a great idea, and worth doing. I think pg_journal (no x) is sufficient. The journal is an idea that people are familiar with from filesystems anyway. Note that we'd really also have to rename pg_clog etc pg_clog could become pg_commits or pg_xactstatus or pg_commit_status or something. What else is there? I'd hope pg_logical can be left alone. A more difficult question is whether we'd also rename pg_resetxlog, pg_receivexlog, etc. I don't think it's necessary. (Of course, people have wanted to rename pg_resetxlog to make it sound more scary anyway, but that's a different matter.) In the end though, this is a lot of thrashing for a problem that only comes up rarely ... I'll agree with Joel that it comes up far too often for comfort anyway. I've known a number of people who were on the verge of deleting stuff from pg_xlog, but just happened to check with me first. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH, TRIVIAL] don't specify S_IRUSR|S_IWUSR without O_CREAT
Just for the record: a minor nit I noticed yesterday. -- Abhijit From 07353c86483f7e26d44a9bbe94b32315537cee73 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Fri, 29 May 2015 23:15:15 +0530 Subject: The file mode is ignored without O_CREAT, so set it to 0 --- src/backend/storage/file/fd.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index b4f6590..2f313ff 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -416,6 +416,7 @@ void fsync_fname(char *fname, bool isdir) { int fd; + int flags; int returncode; /* @@ -423,14 +424,13 @@ fsync_fname(char *fname, bool isdir) * systems don't allow us to fsync files opened read-only; so we need both * cases here */ + flags = PG_BINARY; if (!isdir) - fd = OpenTransientFile(fname, - O_RDWR | PG_BINARY, - S_IRUSR | S_IWUSR); + flags |= O_RDWR; else - fd = OpenTransientFile(fname, - O_RDONLY | PG_BINARY, - S_IRUSR | S_IWUSR); + flags |= O_RDONLY; + + fd = OpenTransientFile(fname, flags, 0); /* * Some OSs don't allow us to open directories at all (Windows returns -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-29 13:14:18 -0400, t...@sss.pgh.pa.us wrote: Pushed with minor revisions. Thanks, looks good. Since we're only logging the failures anyway, I think it is reasonable to log a complaint for any unwritable file in the data directory. Sounds reasonable, patch attached. ETXTBSY has no special meaning that I can find under Windows, so I included that change as well. -- Abhijit diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index b4f6590..70e2347 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2646,11 +2646,12 @@ pre_sync_fname(const char *fname, bool isdir, int elevel) if (fd 0) { - if (errno == EACCES || (isdir errno == EISDIR)) + /* Some platforms don't support opening directories at all. */ + if (isdir errno == EISDIR) return; -#ifdef ETXTBSY - if (errno == ETXTBSY) +#ifdef WIN32 + if (isdir errno == EACCES) return; #endif @@ -2701,11 +2702,12 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel) fd = OpenTransientFile((char *) fname, flags, 0); if (fd 0) { - if (errno == EACCES || (isdir errno == EISDIR)) + /* Some platforms don't support opening directories at all. */ + if (isdir errno == EISDIR) return; -#ifdef ETXTBSY - if (errno == ETXTBSY) +#ifdef WIN32 + if (isdir errno == EACCES) return; #endif diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index f0d66fa..f343168 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -617,11 +617,12 @@ pre_sync_fname(const char *fname, bool isdir) if (fd 0) { - if (errno == EACCES || (isdir errno == EISDIR)) + /* Some platforms don't support opening directories at all. */ + if (isdir errno == EISDIR) return; -#ifdef ETXTBSY - if (errno == ETXTBSY) +#ifdef WIN32 + if (isdir errno == EACCES) return; #endif @@ -682,11 +683,12 @@ fsync_fname_ext(const char *fname, bool isdir) fd = open(fname, flags); if (fd 0) { - if (errno == EACCES || (isdir errno == EISDIR)) + /* Some platforms don't support opening directories at all. */ + if (isdir errno == EISDIR) return; -#ifdef ETXTBSY - if (errno == ETXTBSY) +#ifdef WIN32 + if (isdir errno == EACCES) return; #endif -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-28 19:56:15 +0530, a...@2ndquadrant.com wrote: I have a separate patch to initdb with the corresponding changes, which I will post after dinner and a bit more testing. Here's that patch too. I was a bit unsure how far to go with this. It fixes the problem of not following pg_xlog if it's a symlink (Andres) and the one where it died on bogus stuff in pg_tblspc (Tom) and unreadable files anywhere else (it now spews errors to stderr, but carries on for as long as it can). I've done a bit more than strictly necessary to fix those problems, though, and made the code largely similar to what's in the other patch. If you want something minimal, let me know and I'll cut it down to size. -- Abhijit P.S. If this patch gets applied, then the Adapted from walktblspc_links in initdb.c comment in fd.c should be changed: s/links/entries/. From 8e69433fae0b4f51879793f6594c76b99d764818 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Thu, 28 May 2015 22:02:29 +0530 Subject: Make initdb -S behave like xlog.c:fsync_pgdata() In particular, it should silently skip unreadable/unexpected files in the data directory and follow symlinks only for pg_xlog and under pg_tblspc. --- src/bin/initdb/initdb.c | 235 +--- 1 file changed, 122 insertions(+), 113 deletions(-) diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index f1c4920..8ebaa55 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -218,9 +218,9 @@ static char **filter_lines_with_token(char **lines, const char *token); static char **readfile(const char *path); static void writefile(char *path, char **lines); static void walkdir(char *path, void (*action) (char *fname, bool isdir)); -static void walktblspc_links(char *path, void (*action) (char *fname, bool isdir)); +static void walktblspc_entries(char *path, void (*action) (char *fname, bool isdir)); static void pre_sync_fname(char *fname, bool isdir); -static void fsync_fname(char *fname, bool isdir); +static void fsync_fname_ext(char *fname, bool isdir); static FILE *popen_check(const char *command, const char *mode); static void exit_nicely(void); static char *get_id(void); @@ -248,7 +248,7 @@ static void load_plpgsql(void); static void vacuum_db(void); static void make_template0(void); static void make_postgres(void); -static void perform_fsync(void); +static void fsync_pgdata(void); static void trapsig(int signum); static void check_ok(void); static char *escape_quotes(const char *src); @@ -518,38 +518,38 @@ writefile(char *path, char **lines) * walkdir: recursively walk a directory, applying the action to each * regular file and directory (including the named directory itself). * - * Adapted from copydir() in copydir.c. + * Adapted from walkdir() in backend/storage/file/fd.c. */ static void walkdir(char *path, void (*action) (char *fname, bool isdir)) { DIR *dir; - struct dirent *direntry; - char subpath[MAXPGPATH]; + struct dirent *de; dir = opendir(path); if (dir == NULL) { fprintf(stderr, _(%s: could not open directory \%s\: %s\n), progname, path, strerror(errno)); - exit_nicely(); + return; } - while (errno = 0, (direntry = readdir(dir)) != NULL) + while (errno = 0, (de = readdir(dir)) != NULL) { + char subpath[MAXPGPATH]; struct stat fst; - if (strcmp(direntry-d_name, .) == 0 || - strcmp(direntry-d_name, ..) == 0) + if (strcmp(de-d_name, .) == 0 || + strcmp(de-d_name, ..) == 0) continue; - snprintf(subpath, MAXPGPATH, %s/%s, path, direntry-d_name); + snprintf(subpath, MAXPGPATH, %s/%s, path, de-d_name); if (lstat(subpath, fst) 0) { fprintf(stderr, _(%s: could not stat file \%s\: %s\n), progname, subpath, strerror(errno)); - exit_nicely(); + continue; } if (S_ISDIR(fst.st_mode)) @@ -559,75 +559,72 @@ walkdir(char *path, void (*action) (char *fname, bool isdir)) } if (errno) - { fprintf(stderr, _(%s: could not read directory \%s\: %s\n), progname, path, strerror(errno)); - exit_nicely(); - } - if (closedir(dir)) - { - fprintf(stderr, _(%s: could not close directory \%s\: %s\n), -progname, path, strerror(errno)); - exit_nicely(); - } + (void) closedir(dir); - /* - * It's important to fsync the destination directory itself as individual - * file fsyncs don't guarantee that the directory entry for the file is - * synced. Recent versions of ext4 have made the window much wider but - * it's been an issue for ext3 and other filesystems in the past. - */ (*action) (path, true); } /* - * walktblspc_links: call walkdir on each entry under the given - * pg_tblspc directory, or do nothing if pg_tblspc doesn't exist. + * walktblspc_entries -- apply the action to the entries in pb_tblspc + * + * We expect to find only symlinks to tablespace directories here, but + * we'll apply the action to regular files, and call walkdir() on any + * directories as well
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-28 17:37:16 -0400, t...@sss.pgh.pa.us wrote: I've committed this after some mostly-cosmetic rearrangements. Thank you. I have to leave shortly, so I'll look at the initdb cleanup tomorrow. Here's a revision of that patch that's more along the lines of what you committed. It occurred to me that we should probably also silently skip EACCES on opendir and stat/lstat in walkdir. That's what the attached initdb patch does. If you think it's a good idea, there's also a small fixup attached for the backend version too. -- Abhijit From 26bb25e99a2954381587bbfe296a50b19a7f047c Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Thu, 28 May 2015 22:02:29 +0530 Subject: Make initdb -S behave like xlog.c:fsync_pgdata() In particular, it should silently skip unreadable/unexpected files in the data directory and follow symlinks only for pg_xlog and under pg_tblspc. --- src/bin/initdb/initdb.c | 305 1 file changed, 151 insertions(+), 154 deletions(-) diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index f1c4920..9d5478c 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -71,6 +71,13 @@ /* Ideally this would be in a .h file, but it hardly seems worth the trouble */ extern const char *select_default_timezone(const char *share_path); +/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */ +#if defined(HAVE_SYNC_FILE_RANGE) +#define PG_FLUSH_DATA_WORKS 1 +#elif defined(USE_POSIX_FADVISE) defined(POSIX_FADV_DONTNEED) +#define PG_FLUSH_DATA_WORKS 1 +#endif + static const char *auth_methods_host[] = {trust, reject, md5, password, ident, radius, #ifdef ENABLE_GSS gss, @@ -217,10 +224,13 @@ static char **filter_lines_with_token(char **lines, const char *token); #endif static char **readfile(const char *path); static void writefile(char *path, char **lines); -static void walkdir(char *path, void (*action) (char *fname, bool isdir)); -static void walktblspc_links(char *path, void (*action) (char *fname, bool isdir)); -static void pre_sync_fname(char *fname, bool isdir); -static void fsync_fname(char *fname, bool isdir); +static void walkdir(const char *path, + void (*action) (const char *fname, bool isdir), + bool process_symlinks); +#ifdef PG_FLUSH_DATA_WORKS +static void pre_sync_fname(const char *fname, bool isdir); +#endif +static void fsync_fname_ext(const char *fname, bool isdir); static FILE *popen_check(const char *command, const char *mode); static void exit_nicely(void); static char *get_id(void); @@ -248,7 +258,7 @@ static void load_plpgsql(void); static void vacuum_db(void); static void make_template0(void); static void make_postgres(void); -static void perform_fsync(void); +static void fsync_pgdata(void); static void trapsig(int signum); static void check_ok(void); static char *escape_quotes(const char *src); @@ -521,56 +531,58 @@ writefile(char *path, char **lines) * Adapted from copydir() in copydir.c. */ static void -walkdir(char *path, void (*action) (char *fname, bool isdir)) +walkdir(const char *path, + void (*action) (const char *fname, bool isdir), + bool process_symlinks) { DIR *dir; - struct dirent *direntry; - char subpath[MAXPGPATH]; + struct dirent *de; dir = opendir(path); if (dir == NULL) { - fprintf(stderr, _(%s: could not open directory \%s\: %s\n), -progname, path, strerror(errno)); - exit_nicely(); + if (errno != EACCES) + fprintf(stderr, _(%s: could not open directory \%s\: %s\n), + progname, path, strerror(errno)); + return; } - while (errno = 0, (direntry = readdir(dir)) != NULL) + while (errno = 0, (de = readdir(dir)) != NULL) { + char subpath[MAXPGPATH]; struct stat fst; + int sret; - if (strcmp(direntry-d_name, .) == 0 || - strcmp(direntry-d_name, ..) == 0) + if (strcmp(de-d_name, .) == 0 || + strcmp(de-d_name, ..) == 0) continue; - snprintf(subpath, MAXPGPATH, %s/%s, path, direntry-d_name); + snprintf(subpath, MAXPGPATH, %s/%s, path, de-d_name); + + if (process_symlinks) + sret = stat(subpath, fst); + else + sret = lstat(subpath, fst); - if (lstat(subpath, fst) 0) + if (sret 0) { - fprintf(stderr, _(%s: could not stat file \%s\: %s\n), - progname, subpath, strerror(errno)); - exit_nicely(); + if (errno != EACCES) +fprintf(stderr, _(%s: could not stat file \%s\: %s\n), + progname, subpath, strerror(errno)); + continue; } - if (S_ISDIR(fst.st_mode)) - walkdir(subpath, action); - else if (S_ISREG(fst.st_mode)) + if (S_ISREG(fst.st_mode)) (*action) (subpath, false); + else if (S_ISDIR(fst.st_mode)) + walkdir(subpath, action, false); } if (errno) - { fprintf(stderr, _(%s: could not read directory \%s\: %s\n), progname, path, strerror(errno)); - exit_nicely(); - } - if (closedir(dir)) - { - fprintf(stderr, _(%s: could not close directory \%s\: %s\n), -progname, path, strerror
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Hi. Here's an updated patch for the fsync problem(s). A few points that may be worth considering: 1. I've made the ReadDir → ReadDirExtended change, but in retrospect I don't think it's really worth it. Using ReadDir and letting it die is probably fine. (Aside: I left ReadDirExtended static for now.) 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c? I wasn't sure if I should move that to fd.c as well. I think it's borderline OK for now. 3. There's a comment in fsync_fname that we need to open directories with O_RDONLY and non-directories with O_RDWR. Does this also apply to pre_sync_fname (i.e. sync_file_range and posix_fadvise) on any platforms we care about? It doesn't seem so, but I'm not sure. 4. As a partial aside, why does fsync_fname use OpenTransientFile? It looks like it should use BasicOpenFile as pre_sync_fname does. We close the fd immediately after calling fsync anyway. Or is there some reason I'm missing that we should use OpenTransientFile in pre_sync_fname too? (Aside²: it's pointless to specify S_IRUSR|S_IWUSR in fsync_fname since we're not setting O_CREAT. Minor, so will submit separate patch.) 5. I made walktblspc_entries use stat() instead of lstat(), so that we can handle symlinks and directories the same way. Note that we won't continue to follow links inside the sub-directories because we use walkdir instead of recursing. But this depends on S_ISDIR() returning true after stat() on Windows with a junction. Does that work? And are the entries in pb_tblspc on Windows actually junctions? I've tested this with unreadable files in PGDATA and junk inside pb_tblspc. Feedback welcome. I have a separate patch to initdb with the corresponding changes, which I will post after dinner and a bit more testing. -- Abhijit From 491dcb8c7203fd992dd9c441f5463a75c88e6f52 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Thu, 28 May 2015 17:22:18 +0530 Subject: Skip unreadable files in fsync_pgdata; follow only expected symlinks --- src/backend/access/transam/xlog.c | 40 +- src/backend/storage/file/fd.c | 262 ++ src/include/storage/fd.h | 6 +- 3 files changed, 226 insertions(+), 82 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 087b6be..32447e7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11605,20 +11605,53 @@ SetWalWriterSleeping(bool sleeping) /* * Issue fsync recursively on PGDATA and all its contents. + * + * We fsync regular files and directories wherever they are, but we + * follow symlinks only for pg_xlog and under pg_tblspc. */ static void fsync_pgdata(char *datadir) { + bool xlog_is_symlink; + char pg_xlog[MAXPGPATH]; + char pg_tblspc[MAXPGPATH]; + struct stat st; + if (!enableFsync) return; + snprintf(pg_xlog, MAXPGPATH, %s/pg_xlog, datadir); + snprintf(pg_tblspc, MAXPGPATH, %s/pg_tblspc, datadir); + + /* + * If pg_xlog is a symlink, then we need to recurse into it separately, + * because the first walkdir below will ignore it. + */ + xlog_is_symlink = false; + + if (lstat(pg_xlog, st) 0) + ereport(ERROR, +(errcode_for_file_access(), +errmsg(could not stat directory \pg_xlog\: %m))); + +#ifndef WIN32 + if (S_ISLNK(st.st_mode)) + xlog_is_symlink = true; +#else + if (pgwin32_is_junction(subpath)) + xlog_is_symlink = true; +#endif + /* * If possible, hint to the kernel that we're soon going to fsync the data * directory and its contents. */ #if defined(HAVE_SYNC_FILE_RANGE) || \ (defined(USE_POSIX_FADVISE) defined(POSIX_FADV_DONTNEED)) - walkdir(datadir, pre_sync_fname); + walkdir(DEBUG1, datadir, pre_sync_fname); + if (xlog_is_symlink) + walkdir(DEBUG1, pg_xlog, pre_sync_fname); + walktblspc_entries(DEBUG1, pg_tblspc, pre_sync_fname); #endif /* @@ -11628,5 +11661,8 @@ fsync_pgdata(char *datadir) * file fsyncs don't guarantee that the directory entry for the file is * synced. */ - walkdir(datadir, fsync_fname); + walkdir(LOG, datadir, fsync_fname_ext); + if (xlog_is_symlink) + walkdir(LOG, pg_xlog, fsync_fname_ext); + walktblspc_entries(LOG, pg_tblspc, fsync_fname_ext); } diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 68d43c6..916f8b5 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -1921,6 +1921,37 @@ TryAgain: } /* + * Read a directory opened with AllocateDir and return the next dirent. + * On error, ereports at a caller-specified level and returns NULL if + * the level is not ERROR or more severe. + */ +static struct dirent * +ReadDirExtended(int elevel, DIR *dir, const char *dirname) +{ + struct dirent *dent; + + /* Give a generic message for AllocateDir failure, if caller didn't */ + if (dir == NULL) + { + ereport(elevel, +(errcode_for_file_access(), + errmsg(could
[HACKERS] [PATCH] readlink missing nul-termination in pg_rewind
This is just something I noticed in passing. (I did a quick check of all the other uses of readlink in the source, and they do get this right.) -- Abhijit P.S. Also in passing, I note that pg_rewind will follow links under any directory anywhere named pg_tblspc (which probably doesn't matter), and does not follow pg_xlog if it's a symlink (which probably does). If you want, I can submit a trivial patch for the latter. From d1e5cbea21bb916253bce2ad189deb1924864508 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Thu, 28 May 2015 21:03:50 +0530 Subject: readlink() doesn't nul-terminate the buffer, so we have to --- src/bin/pg_rewind/copy_fetch.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/bin/pg_rewind/copy_fetch.c b/src/bin/pg_rewind/copy_fetch.c index 9e317a2..4a7150b 100644 --- a/src/bin/pg_rewind/copy_fetch.c +++ b/src/bin/pg_rewind/copy_fetch.c @@ -126,6 +126,7 @@ recurse_dir(const char *datadir, const char *parentpath, fullpath); } + link_target[len] = '\0'; callback(path, FILE_TYPE_SYMLINK, 0, link_target); /* -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-27 11:46:39 +0530, a...@2ndquadrant.com wrote: I'm trying a couple of approaches to that (e.g. using readdir directly instead of ReadDir), but other suggestions are welcome. Here's what that looks like, but not yet fully tested. -- Abhijit diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 087b6be..6956e88 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11609,16 +11609,41 @@ SetWalWriterSleeping(bool sleeping) static void fsync_pgdata(char *datadir) { + bool xlog_is_symlink; + char pg_xlog[MAXPGPATH]; + char pg_tblspc[MAXPGPATH]; + struct stat st; + if (!enableFsync) return; + snprintf(pg_xlog, MAXPGPATH, %s/pg_xlog, datadir); + snprintf(pg_tblspc, MAXPGPATH, %s/pg_tblspc, datadir); + + if (lstat(pg_xlog, st) 0) + ereport(ERROR, +(errcode_for_file_access(), +errmsg(could not stat directory \pg_xlog\: %m))); + + xlog_is_symlink = false; +#ifndef WIN32 + if (S_ISLNK(st.st_mode)) + xlog_is_symlink = true; +#else + if (pgwin32_is_junction(subpath)) + xlog_is_symlink = true; +#endif + /* * If possible, hint to the kernel that we're soon going to fsync the data * directory and its contents. */ #if defined(HAVE_SYNC_FILE_RANGE) || \ (defined(USE_POSIX_FADVISE) defined(POSIX_FADV_DONTNEED)) - walkdir(datadir, pre_sync_fname); + walkdir(DEBUG1, datadir, pre_sync_fname); + if (xlog_is_symlink) + walkdir(DEBUG1, pg_xlog, pre_sync_fname); + walktblspc_links(DEBUG1, pg_tblspc, pre_sync_fname); #endif /* @@ -11628,5 +11653,8 @@ fsync_pgdata(char *datadir) * file fsyncs don't guarantee that the directory entry for the file is * synced. */ - walkdir(datadir, fsync_fname); + walkdir(WARNING, datadir, fsync_fname_ext); + if (xlog_is_symlink) + walkdir(DEBUG1, pg_xlog, fsync_fname_ext); + walktblspc_links(WARNING, pg_tblspc, fsync_fname_ext); } diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 68d43c6..711b4b7 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2443,48 +2443,113 @@ looks_like_temp_rel_name(const char *name) /* * Hint to the OS that it should get ready to fsync() this file. * - * Adapted from pre_sync_fname in initdb.c + * Ignores errors trying to open unreadable files, and logs other errors at a + * caller-specified level. */ void -pre_sync_fname(char *fname, bool isdir) +pre_sync_fname(int elevel, char *fname, bool isdir) { int fd; fd = BasicOpenFile(fname, O_RDONLY | PG_BINARY, 0); + if (fd 0) + { + if (errno == EACCES || (isdir errno == EISDIR)) + return; + +#ifdef ETXTBSY + if (errno == ETXTBSY) + return; +#endif + + ereport(elevel, +(errcode_for_file_access(), + errmsg(could not open file \%s\: %m, fname))); + } + + (void) pg_flush_data(fd, 0, 0); + (void) close(fd); +} + +/* + * fsync_fname_ext -- Try to fsync a file or directory + * + * Ignores errors trying to open unreadable files, or trying to fsync + * directories on systems where that isn't allowed/required, and logs other + * errors at a caller-specified level. + */ +void +fsync_fname_ext(int elevel, char *fname, bool isdir) +{ + int fd; + int returncode; + /* - * Some OSs don't allow us to open directories at all (Windows returns - * EACCES) + * Some OSs require directories to be opened read-only whereas other + * systems don't allow us to fsync files opened read-only; so we need both + * cases here */ - if (fd 0 isdir (errno == EISDIR || errno == EACCES)) - return; + if (!isdir) + fd = OpenTransientFile(fname, + O_RDWR | PG_BINARY, + S_IRUSR | S_IWUSR); + else + fd = OpenTransientFile(fname, + O_RDONLY | PG_BINARY, + S_IRUSR | S_IWUSR); if (fd 0) - ereport(FATAL, -(errmsg(could not open file \%s\: %m, - fname))); + { + if (errno == EACCES || (isdir errno == EISDIR)) + return; - pg_flush_data(fd, 0, 0); +#ifdef ETXTBSY + if (errno == ETXTBSY) + return; +#endif - close(fd); + ereport(elevel, +(errcode_for_file_access(), + errmsg(could not open file \%s\: %m, fname))); + } + + returncode = pg_fsync(fd); + + /* + * Some OSes don't allow us to fsync directories at all, so we can ignore + * those errors. Anything else needs to be logged. + */ + if (returncode != 0 !(isdir errno == EBADF)) + ereport(elevel, +(errcode_for_file_access(), + errmsg(could not fsync file \%s\: %m, fname))); + + CloseTransientFile(fd); } /* * walkdir: recursively walk a directory, applying the action to each - * regular file and directory (including the named directory itself) - * and following symbolic links. + * regular file and directory (including the named directory itself). * - * NB: There is another version of walkdir in initdb.c, but that version - * behaves differently with respect to symbolic links. Caveat emptor! + * Adapted from walkdir in initdb.c */ void -walkdir(char *path, void (*action) (char
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-27 20:22:18 -0400, t...@sss.pgh.pa.us wrote: I doubt that that (not using AllocateDir) […] (Disregarding per your followup.) What I think is a reasonable compromise is to treat AllocateDir failure as nonfatal, but to continue using ReadDir etc which means that any post-open failure in reading a successfully-opened directory would be fatal. OK, that's halfway between the two patches I posted. Will do. Meanwhile, it seems like you have copied a couple of very dubious decisions in initdb's walktblspc_links(): […] Now, we don't really have to do anything about these things in order to fix the immediate problem, but I wonder if we shouldn't try to clean up initdb's behavior while we're at it. OK, I'll fix that in both. Independently of that, I thought the plan was to complain about any problems at LOG message level, not DEBUG1, and definitely not WARNING (which is lower than LOG for this purpose). I'll change the level to LOG for fsync_fname_ext, but I think DEBUG1 is more appropriate for the pre_sync_fname run. Or do you think I should use LOG for that too? And why would you use a different message level for pg_xlog/ than for other files anyway? That was just a mistake, I forgot to change it after copying. Thanks for having a look. I'll post a revised patch shortly. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-27 23:41:29 -0400, t...@sss.pgh.pa.us wrote: What about turning ReadDir into a wrapper around a ReadDirExtended function that adds an elevel argument? Sounds reasonable, doing. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-26 19:07:20 +0200, and...@anarazel.de wrote: Abhijit, do you recall why the code was changed to follow all symlinks in contrast to explicitly going through the tablespaces as initdb -S does? I'm pretty sure early versions of the patch pretty much had a verbatim copy of the initdb logic? Yes, earlier versions of the patch did follow symlinks only in pg_tblspc. I changed this because Álvaro pointed out that this was likely to miss important stuff, e.g. in 20150403163232.ga28...@eldon.alvh.no-ip.org, 20150406131636.gd4...@alvh.no-ip.org -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-26 22:44:03 +0200, and...@anarazel.de wrote: So, this was discussed in the following thread, starting at: http://archives.postgresql.org/message-id/20150403163232.GA28444%40eldon.alvh.no-ip.org Sorry, I didn't see this before replying. There are no other places we it's allowed to introduce symlinks and we have refuted bugreports of people having problems after doing that. OK. So what I propose is: 1) Remove the automatic symlink following 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in initdb -S 3) Add a elevel argument to walkdir(), return if AllocateDir() fails, continue for stat() failures in the readdir() loop. 4) Add elevel argument to pre_sync_fname, fsync_fname, return after errors. 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By virtue of not following symlinks we should not need to worry about EROFS Makes sense. I've got most of that done, I'll just remove the symlink following (or rather, restrict it to the cases mentioned), test a bit, and post. I'm inclined to think that 4) is a big enough compat break that a fsync_fname_ext with the new argument is a good idea. Agreed. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-26 03:54:51 +0200, and...@anarazel.de wrote: Say a symlink goes to a binary, which is currently being executed: ETXTBSY. Or the file is in a readonly filesystem: EROFS. So we'd need to ignore a lot of errors, possibly ignoring valid ones. Right. That's why I started out by being conservative and following only the expected symlinks in pg_tblspc (as other parts of the code do). Another thing is whether we should handle a recursive symlink in pgdata? I personally think not, but... I think not too. It's also not just as simple as making fsync_fname fail gracefully upon EACCESS - the opendir() could fail just as well. I'll post a proposed patch shortly. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
At 2015-05-13 03:04:11 +0200, and...@anarazel.de wrote: I can live with that, although I'd personally go with pgstattuple_approx()... I could live with that too. Here's an incremental patch to rename the function. (I'm not resubmitting the whole thing since you said you've made some other changes too.) Thank you. -- Abhijit From 57d597f176294ebfe30efa6d73569505cbb41e31 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Wed, 13 May 2015 09:19:07 +0530 Subject: Rename pgstatapprox to pgstattuple_approx --- contrib/pgstattuple/pgstatapprox.c| 4 ++-- contrib/pgstattuple/pgstattuple--1.2--1.3.sql | 4 ++-- contrib/pgstattuple/pgstattuple--1.3.sql | 4 ++-- doc/src/sgml/pgstattuple.sgml | 12 ++-- 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c index 46f5bc0..f3b5671 100644 --- a/contrib/pgstattuple/pgstatapprox.c +++ b/contrib/pgstattuple/pgstatapprox.c @@ -21,7 +21,7 @@ #include utils/tqual.h #include commands/vacuum.h -PG_FUNCTION_INFO_V1(pgstatapprox); +PG_FUNCTION_INFO_V1(pgstattuple_approx); typedef struct output_type { @@ -234,7 +234,7 @@ build_tuple(output_type *stat, FunctionCallInfo fcinfo) */ Datum -pgstatapprox(PG_FUNCTION_ARGS) +pgstattuple_approx(PG_FUNCTION_ARGS) { Oid relid = PG_GETARG_OID(0); Relation rel; diff --git a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql index bfcf464..99301a2 100644 --- a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql +++ b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql @@ -3,7 +3,7 @@ -- complain if script is sourced in psql, rather than via ALTER EXTENSION \echo Use ALTER EXTENSION pgstattuple UPDATE TO '1.3' to load this file. \quit -CREATE FUNCTION pgstatapprox(IN reloid regclass, +CREATE FUNCTION pgstattuple_approx(IN reloid regclass, OUT table_len BIGINT, -- physical table length in bytes OUT scanned_percent FLOAT8, -- what percentage of the table's pages was scanned OUT approx_tuple_count BIGINT, -- estimated number of live tuples @@ -14,5 +14,5 @@ CREATE FUNCTION pgstatapprox(IN reloid regclass, OUT dead_tuple_percent FLOAT8, -- dead tuples in % (based on estimate) OUT approx_free_space BIGINT, -- estimated free space in bytes OUT approx_free_percent FLOAT8) -- free space in % (based on estimate) -AS 'MODULE_PATHNAME', 'pgstatapprox' +AS 'MODULE_PATHNAME', 'pgstattuple_approx' LANGUAGE C STRICT; diff --git a/contrib/pgstattuple/pgstattuple--1.3.sql b/contrib/pgstattuple/pgstattuple--1.3.sql index f69b619..f3996e7 100644 --- a/contrib/pgstattuple/pgstattuple--1.3.sql +++ b/contrib/pgstattuple/pgstattuple--1.3.sql @@ -80,7 +80,7 @@ LANGUAGE C STRICT; /* New stuff in 1.3 begins here */ -CREATE FUNCTION pgstatapprox(IN reloid regclass, +CREATE FUNCTION pgstattuple_approx(IN reloid regclass, OUT table_len BIGINT, -- physical table length in bytes OUT scanned_percent FLOAT8, -- what percentage of the table's pages was scanned OUT approx_tuple_count BIGINT, -- estimated number of live tuples @@ -91,5 +91,5 @@ CREATE FUNCTION pgstatapprox(IN reloid regclass, OUT dead_tuple_percent FLOAT8, -- dead tuples in % (based on estimate) OUT approx_free_space BIGINT, -- estimated free space in bytes OUT approx_free_percent FLOAT8) -- free space in % (based on estimate) -AS 'MODULE_PATHNAME', 'pgstatapprox' +AS 'MODULE_PATHNAME', 'pgstattuple_approx' LANGUAGE C STRICT; diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml index 4cba006..cca6dc9 100644 --- a/doc/src/sgml/pgstattuple.sgml +++ b/doc/src/sgml/pgstattuple.sgml @@ -361,19 +361,19 @@ pending_tuples | 0 varlistentry term indexterm - primarypgstatapprox/primary + primarypgstattuple_approx/primary /indexterm - functionpgstatapprox(regclass) returns record/ + functionpgstattuple_approx(regclass) returns record/ /term listitem para - functionpgstatapprox/function is a faster alternative to + functionpgstattuple_approx/function is a faster alternative to functionpgstattuple/function that returns approximate results. The argument is the target relation's OID. For example: programlisting -test= SELECT * FROM pgstatapprox('pg_catalog.pg_proc'::regclass); +test= SELECT * FROM pgstattuple_approx('pg_catalog.pg_proc'::regclass); -[ RECORD 1 ]+--- table_len| 573440 scanned_percent | 2 @@ -392,7 +392,7 @@ approx_free_percent | 2.09 para Whereas functionpgstattuple/function always performs a full-table scan and returns an exact count of live and dead tuples - (and their sizes) and free space, functionpgstatapprox/function + (and their sizes) and free space, functionpgstattuple_approx
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
Hi Andres. I've attached an updated patch for pgstatbloat, as well as a patch to replace two uses of BuildTupleFromCStrings() elsewhere in pgstattuple. I've made the changes you mentioned in your earlier mail, except that I have not changed the name pending further suggestions about what would be the best name. Also: At 2015-05-09 15:36:49 +0530, a...@2ndquadrant.com wrote: At 2015-05-09 02:20:51 +0200, and...@anarazel.de wrote: I haven't checked, but I'm not sure that it's safe/meaningful to call PageGetHeapFreeSpace() on a new page. OK, I'll check and fix if necessary. You're right, PageGetHeapFreeSpace() isn't safe on a new page. I've added a guard to that call in the attached patch, but I'm not sure that's the right thing to do. Should I copy the orphaned new-page handling from lazy_scan_heap? What about for empty pages? Neither feels like a very good thing to do, but then neither does skipping the new page silently. Should I count the space it would have free if it were initialised, but leave the page alone for VACUUM to deal with? Or just leave it as it is? -- Abhijit From 421267bba8394255feed8f9b9424d25d0be9f979 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Mon, 11 May 2015 15:54:48 +0530 Subject: Make pgstattuple use heap_form_tuple instead of BuildTupleFromCStrings --- contrib/pgstattuple/pgstatindex.c | 43 ++--- contrib/pgstattuple/pgstattuple.c | 45 ++- 2 files changed, 37 insertions(+), 51 deletions(-) diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c index a2ea5d7..608f729 100644 --- a/contrib/pgstattuple/pgstatindex.c +++ b/contrib/pgstattuple/pgstatindex.c @@ -257,7 +257,8 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo) { TupleDesc tupleDesc; int j; - char *values[10]; + Datum values[10]; + bool nulls[10]; HeapTuple tuple; /* Build a tuple descriptor for our result type */ @@ -265,33 +266,31 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo) elog(ERROR, return type must be a row type); j = 0; - values[j++] = psprintf(%d, indexStat.version); - values[j++] = psprintf(%d, indexStat.level); - values[j++] = psprintf(INT64_FORMAT, - (indexStat.root_pages + -indexStat.leaf_pages + -indexStat.internal_pages + -indexStat.deleted_pages + -indexStat.empty_pages) * BLCKSZ); - values[j++] = psprintf(%u, indexStat.root_blkno); - values[j++] = psprintf(INT64_FORMAT, indexStat.internal_pages); - values[j++] = psprintf(INT64_FORMAT, indexStat.leaf_pages); - values[j++] = psprintf(INT64_FORMAT, indexStat.empty_pages); - values[j++] = psprintf(INT64_FORMAT, indexStat.deleted_pages); + values[j++] = Int32GetDatum(indexStat.version); + values[j++] = Int32GetDatum(indexStat.level); + values[j++] = Int64GetDatum((indexStat.root_pages + + indexStat.leaf_pages + + indexStat.internal_pages + + indexStat.deleted_pages + + indexStat.empty_pages) * BLCKSZ); + values[j++] = Int32GetDatum(indexStat.root_blkno); + values[j++] = Int32GetDatum(indexStat.internal_pages); + values[j++] = Int32GetDatum(indexStat.leaf_pages); + values[j++] = Int32GetDatum(indexStat.empty_pages); + values[j++] = Int32GetDatum(indexStat.deleted_pages); if (indexStat.max_avail 0) - values[j++] = psprintf(%.2f, - 100.0 - (double) indexStat.free_space / (double) indexStat.max_avail * 100.0); + values[j++] = Float8GetDatum(100.0 - (double) indexStat.free_space / (double) indexStat.max_avail * 100.0); else - values[j++] = pstrdup(NaN); + values[j++] = CStringGetDatum(NaN); if (indexStat.leaf_pages 0) - values[j++] = psprintf(%.2f, - (double) indexStat.fragments / (double) indexStat.leaf_pages * 100.0); + values[j++] = Float8GetDatum((double) indexStat.fragments / (double) indexStat.leaf_pages * 100.0); else - values[j++] = pstrdup(NaN); + values[j++] = CStringGetDatum(NaN); - tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc), - values); + for (j = 0; j 10; j++) + nulls[j] = false; + tuple = heap_form_tuple(tupleDesc, values, nulls); result = HeapTupleGetDatum(tuple); } diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c index c3a8b1d..552f188 100644 --- a/contrib/pgstattuple/pgstattuple.c +++ b/contrib/pgstattuple/pgstattuple.c @@ -85,28 +85,20 @@ static Datum build_pgstattuple_type(pgstattuple_type *stat, FunctionCallInfo fcinfo) { #define NCOLUMNS 9 -#define NCHARS 32 HeapTuple tuple; - char *values[NCOLUMNS]; - char values_buf[NCOLUMNS][NCHARS]; + Datum values[NCOLUMNS]; + bool nulls[NCOLUMNS]; int i; double tuple_percent; double dead_tuple_percent; double free_percent; /* free/reusable space in % */ TupleDesc tupdesc; - AttInMetadata *attinmeta; /* Build a tuple descriptor for our result type
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
At 2015-05-11 19:15:47 +0200, and...@anarazel.de wrote: TBH, I'd rather not touch unrelated things right now. We're pretty badly behind... OK. That patch is independent; just ignore it. I don't really care how it's named, as long as it makes clear that it's not an exact measurement. Not having heard any better suggestions, I picked pgstatapprox as a compromise between length and familiarity/consistency with pgstattuple. Should I count the space it would have free if it were initialised, but leave the page alone for VACUUM to deal with? And this is what the attached patch does. I also cleaned up a few things that I didn't like but had left alone to make the code look similar to pgstattuple. In particular, build_tuple() now does nothing but build a tuple from values calculated earlier in pgstatapprox_heap(). Thank you. -- Abhijit P.S. What, if anything, should be done about the complicated and likely not very useful skip-only-min#-blocks logic in lazy_scan_heap? From 0a99fc61d36e3a3ff4003db95e5c299a5f07a850 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Fri, 26 Dec 2014 12:37:13 +0530 Subject: Add pgstatapprox to pgstattuple --- contrib/pgstattuple/Makefile | 4 +- contrib/pgstattuple/pgstatapprox.c | 277 + contrib/pgstattuple/pgstattuple--1.2--1.3.sql | 18 ++ .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} | 18 +- contrib/pgstattuple/pgstattuple.control| 2 +- doc/src/sgml/pgstattuple.sgml | 135 ++ 6 files changed, 450 insertions(+), 4 deletions(-) create mode 100644 contrib/pgstattuple/pgstatapprox.c create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (72%) diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile index 862585c..6083dab 100644 --- a/contrib/pgstattuple/Makefile +++ b/contrib/pgstattuple/Makefile @@ -1,10 +1,10 @@ # contrib/pgstattuple/Makefile MODULE_big = pgstattuple -OBJS = pgstattuple.o pgstatindex.o $(WIN32RES) +OBJS = pgstattuple.o pgstatindex.o pgstatapprox.o $(WIN32RES) EXTENSION = pgstattuple -DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql +DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql PGFILEDESC = pgstattuple - tuple-level statistics REGRESS = pgstattuple diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c new file mode 100644 index 000..46f5bc0 --- /dev/null +++ b/contrib/pgstattuple/pgstatapprox.c @@ -0,0 +1,277 @@ +/* + * contrib/pgstattuple/pgstatapprox.c + * Fast bloat estimation functions + */ + +#include postgres.h + +#include access/visibilitymap.h +#include access/transam.h +#include access/xact.h +#include access/multixact.h +#include access/htup_details.h +#include catalog/namespace.h +#include funcapi.h +#include miscadmin.h +#include storage/bufmgr.h +#include storage/freespace.h +#include storage/procarray.h +#include storage/lmgr.h +#include utils/builtins.h +#include utils/tqual.h +#include commands/vacuum.h + +PG_FUNCTION_INFO_V1(pgstatapprox); + +typedef struct output_type +{ + uint64 table_len; + uint64 scanned_percent; + uint64 tuple_count; + uint64 tuple_len; + double tuple_percent; + uint64 dead_tuple_count; + uint64 dead_tuple_len; + double dead_tuple_percent; + uint64 free_space; + double free_percent; +} output_type; + +#define NUM_OUTPUT_COLUMNS 10 + +/* + * This function takes an already open relation and scans its pages, + * skipping those that have the corresponding visibility map bit set. + * For pages we skip, we find the free space from the free space map + * and approximate tuple_len on that basis. For the others, we count + * the exact number of dead tuples etc. + * + * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but + * we do not try to avoid skipping single pages. + */ + +static void +pgstatapprox_heap(Relation rel, output_type *stat) +{ + BlockNumber scanned, +nblocks, +blkno; + Buffer vmbuffer = InvalidBuffer; + BufferAccessStrategy bstrategy; + TransactionId OldestXmin; + uint64 misc_count = 0; + + OldestXmin = GetOldestXmin(rel, true); + bstrategy = GetAccessStrategy(BAS_BULKREAD); + + nblocks = RelationGetNumberOfBlocks(rel); + scanned = 0; + + for (blkno = 0; blkno nblocks; blkno++) + { + Buffer buf; + Page page; + OffsetNumber offnum, + maxoff; + Size freespace; + + CHECK_FOR_INTERRUPTS(); + + /* + * If the page has only visible tuples, then we can find out the + * free space from the FSM and move on. + */ + + if (visibilitymap_test(rel, blkno, vmbuffer)) + { + freespace = GetRecordedFreeSpace(rel, blkno); + stat-tuple_len += BLCKSZ - freespace; + stat-free_space += freespace; + continue; + } + + buf
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
At 2015-05-09 02:20:51 +0200, and...@anarazel.de wrote: + * Abhijit Menon-Sen a...@2ndquadrant.com + * Portions Copyright (c) 2001,2002Tatsuo Ishii (from pgstattuple) I think for new extension we don't really add authors and such anymore. OK, I'll get rid of the boilerplate. Hm. I do wonder if this should really be called 'statbloat'. Perhaps it'd more appropriately be called pg_estimate_bloat or somesuch? Since we've moved it into pgstattuple, perhaps pgstattuple_approximate() or pgstattuple_estimated() would be a better name. I don't really care, I'll change it to whatever people like. Also, is free_space really exact? The fsm isn't byte exact. You're right, I'll fix that. Why go through C strings? I personally think we really shouldn't add more callers to BuildTupleFromCStrings, it's such an absurd interface. I just copied this more or less blindly from pgstattuple. I'll fix it and submit a separate patch to fix the equivalent code in pgstattuple. I think it'd actually be fine to just say that the relation has to be a table or matview. Good point. Agreed. I don't believe that rationale is really true these days. I'd much rather just rip this out here than copy the rather complex logic. Yes, thanks, I very much agree that this isn't really worth copying. I'll drop it in my next submission. I haven't checked, but I'm not sure that it's safe/meaningful to call PageGetHeapFreeSpace() on a new page. OK, I'll check and fix if necessary. Thanks for the feedback. I'll submit a revised patch shortly. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-05-01 09:57:28 -0400, robertmh...@gmail.com wrote: If you don't object to this version, I'll commit it. Looks fine to me, thank you. As for the non-backpatchable 0002, I agree with Andres that it should be included in 9.5; but I take it you're still not convinced? Should I add that to the CF separately for discussion, or what? -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-05-01 08:10:16 -0400, robertmh...@gmail.com wrote: It seems to me that, at a minimum, it would be good to split those controversial and definitely not-back-patchable changes into their own patch. OK, split here (0002*). I do mind putting it into xlog.c instead of some place that's actually appropriate. OK, moved to storage/file/fd.c (0001*). -- Abhijit From 088b80eb0825339eca688e4347a4ef547edcadbb Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Thu, 6 Nov 2014 00:45:56 +0530 Subject: Recursively fsync PGDATA at startup after a crash This is so that we don't lose older unflushed writes in the event of a power failure after crash recovery, where more recent writes are preserved. See 20140918083148.ga17...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 49 + src/backend/storage/file/fd.c | 112 ++ src/include/storage/fd.h | 2 + 3 files changed, 163 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6cf4415..084174d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -845,6 +845,8 @@ static void WALInsertLockAcquireExclusive(void); static void WALInsertLockRelease(void); static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); +static void fsync_pgdata(char *datadir); + /* * Insert an XLOG record represented by an already-constructed chain of data * chunks. This is a low-level routine; to construct the WAL record header @@ -5878,6 +5880,25 @@ StartupXLOG(void) ereport(FATAL, (errmsg(control file contains invalid data))); + /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory and its contents to try to ensure that any data + * written before the crash are flushed to disk. Otherwise a power + * failure in the near future might cause earlier unflushed writes + * to be lost, even though more recent data written to disk from + * here on would be persisted. + * + * We also do this if the control file indicates that fsync was + * disabled at some point while the server was running earlier. + */ + + if (enableFsync + (ControlFile-state != DB_SHUTDOWNED + ControlFile-state != DB_SHUTDOWNED_IN_RECOVERY)) + { + fsync_pgdata(data_directory); + } + if (ControlFile-state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ @@ -11123,3 +11144,31 @@ SetWalWriterSleeping(bool sleeping) XLogCtl-WalWriterSleeping = sleeping; SpinLockRelease(XLogCtl-info_lck); } + +/* + * Issue fsync recursively on PGDATA and all its contents. + */ +static void +fsync_pgdata(char *datadir) +{ + if (!enableFsync) + return; + + /* + * If possible, hint to the kernel that we're soon going to fsync + * the data directory and its contents. + */ +#if defined(HAVE_SYNC_FILE_RANGE) || \ + (defined(USE_POSIX_FADVISE) defined(POSIX_FADV_DONTNEED)) + walkdir(datadir, pre_sync_fname); +#endif + + /* + * Now we do the fsync()s in the same order. + * + * It's important to fsync the destination directory itself as individual + * file fsyncs don't guarantee that the directory entry for the file is + * synced. + */ + walkdir(datadir, fsync_fname); +} diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index f796717..aba12ca 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2439,3 +2439,115 @@ looks_like_temp_rel_name(const char *name) return false; return true; } + +/* + * Hint to the OS that it should get ready to fsync() this file. + * + * Adapted from pre_sync_fname in initdb.c + */ +void +pre_sync_fname(char *fname, bool isdir) +{ + int fd; + + fd = open(fname, O_RDONLY | PG_BINARY); + + /* + * Some OSs don't allow us to open directories at all (Windows returns + * EACCES) + */ + if (fd 0 isdir (errno == EISDIR || errno == EACCES)) + return; + + if (fd 0) + ereport(FATAL, +(errmsg(could not open file \%s\ before fsync, + fname))); + + pg_flush_data(fd, 0, 0); + + close(fd); +} + +/* + * walkdir: recursively walk a directory, applying the action to each + * regular file and directory (including the named directory itself) + * and following symbolic links. + */ +void +walkdir(char *path, void (*action) (char *fname, bool isdir)) +{ + DIR *dir; + struct dirent *de; + + dir = AllocateDir(path); + while ((de = ReadDir(dir, path)) != NULL) + { + char subpath[MAXPGPATH]; + struct stat fst; + + CHECK_FOR_INTERRUPTS(); + + if (strcmp(de-d_name, .) == 0 || + strcmp(de-d_name, ..) == 0) + continue; + + snprintf(subpath, MAXPGPATH, %s/%s, path, de-d_name); + + if (lstat(subpath, fst) 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not stat file \%s\: %m, subpath))); + + if (S_ISREG(fst.st_mode)) + (*action) (subpath, false); + else if (S_ISDIR(fst.st_mode)) + walkdir
Re: [HACKERS] initdb -S and tablespaces
At 2015-04-30 15:37:44 -0400, robertmh...@gmail.com wrote: 1. It doesn't do that. As soon as we fsync the data directory, we reset the flag. That's not what ever disabled means to me. Could you suggest an acceptable alternative wording? I can't immediately think of anything better than disabled since the last restart. That is conditional on our resetting the flag, which we will do only if fsync is enabled at startup. So it's true, but not the whole truth. 2. I don't know why it's part of this patch. In 20150115133245.gg5...@awork2.anarazel.de, Andres explained his rationale as follows: «What I am thinking of is that, currently, if you start the server for initial loading with fsync=off, and then restart it, you're open to data loss. So when the current config file setting is changed from off to on, we should fsync the data directory. Even if there was no crash restart.» That's what I tried to implement. Also, it seems awfully unfortunate to me that we're duplicating a whole pile of code into xlog.c here. I have pointed out and discussed the duplication several times. I did it this way only because we were considering backporting the changes, and at the time it seemed better to do this and fix the duplication in a separate patch. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-04-30 16:56:17 -0700, t...@sss.pgh.pa.us wrote: As for the notion that this needs to be back-patched, I would say no. Not even just the fsync after crash part? I could separate that out from the control file changes and try to eliminate the duplication. I think that would be worth back-patching, at least. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
At 2015-01-30 21:36:42 +0100, and...@2ndquadrant.com wrote: Here's an alternative approach. I think it generally is superior and going in the right direction, but I'm not sure it's backpatchable. It basically consists out of: 1) Make GetLockConflicts() actually work. already commited as being a independent problem. 2) Allow the startup process to actually acquire locks other than AccessExclusiveLocks. There already is code acquiring other locks, but it's currently broken because they're acquired in blocking mode which isn't actually supported in the startup mode. Using this infrastructure we can actually fix a couple small races around database creation/drop. 3) Allow session locks during recovery to be heavier than RowExclusiveLock - since relation/object session locks aren't ever taken out by the user (without a plain lock first) that's safe. merged and submitted independently. 5) Make walsender actually react to recovery conflict interrupts submitted here. (0003) 4) Perform streaming base backups from within a transaction command, to provide error handling. 6) Prevent access to the template database of a CREATE DATABASE during WAL replay. 7) Add an interlock to prevent base backups and movedb() to run concurrently. What we actually came here for. combined, submitted here. (0004) I think this actually doesn't look that bad. Hi Andres. Do you intend to commit these patches? (I just verified that they apply without trouble to HEAD.) -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
At 2015-04-24 07:22:27 +0530, amit.kapil...@gmail.com wrote: Few minor issues: 1. warning on windows \pgstatbloat.c(193): warning C4715: 'pgstatbloat' : not all control paths return a value This is because the function ends with an ereport(ERROR, …). What would you suggest here? Just stick a PG_RETURN_NULL() at the end? 2. Failed to install pgstattuple--1.3.sql You have to name sql file as pgstattuple--1.3.sql rather than pgstattuple--1.2.sql How are you applying the patch? It already contains this: diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.3.sql similarity index 73% rename from contrib/pgstattuple/pgstattuple--1.2.sql rename to contrib/pgstattuple/pgstattuple--1.3.sql index e5fa2f5..93593ee 100644 --- a/contrib/pgstattuple/pgstattuple--1.2.sql +++ b/contrib/pgstattuple/pgstattuple--1.3.sql So it should do the right thing with git apply. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
At 2015-04-24 08:35:40 +0530, amit.kapil...@gmail.com wrote: Just stick a PG_RETURN_NULL() at the end? That should also work. OK, updated patch attached with just that one change. I'm not doing anything about the rename. I don't know if it's possible to make patch(1) rename a file at all (it wasn't, but maybe something has changed recently?). Note to friendly neighbourhood committers: the patch is expected to rename pgstattuple--1.2.sql to pgstattuple--1.3.sql, which it does if applied using git apply. -- Abhijit From f67ddc68ab03fe8237058dcdca1e410ef9e95b5e Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Fri, 26 Dec 2014 12:37:13 +0530 Subject: Add pgstatbloat to pgstattuple --- contrib/pgstattuple/Makefile | 4 +- contrib/pgstattuple/pgstatbloat.c | 389 + contrib/pgstattuple/pgstattuple--1.2--1.3.sql | 18 + .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} | 18 +- contrib/pgstattuple/pgstattuple.control| 2 +- doc/src/sgml/pgstattuple.sgml | 135 +++ 6 files changed, 562 insertions(+), 4 deletions(-) create mode 100644 contrib/pgstattuple/pgstatbloat.c create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (73%) diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile index 862585c..d7d27a5 100644 --- a/contrib/pgstattuple/Makefile +++ b/contrib/pgstattuple/Makefile @@ -1,10 +1,10 @@ # contrib/pgstattuple/Makefile MODULE_big = pgstattuple -OBJS = pgstattuple.o pgstatindex.o $(WIN32RES) +OBJS = pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES) EXTENSION = pgstattuple -DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql +DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql PGFILEDESC = pgstattuple - tuple-level statistics REGRESS = pgstattuple diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c new file mode 100644 index 000..612e22b --- /dev/null +++ b/contrib/pgstattuple/pgstatbloat.c @@ -0,0 +1,389 @@ +/* + * contrib/pgstattuple/pgstatbloat.c + * + * Abhijit Menon-Sen a...@2ndquadrant.com + * Portions Copyright (c) 2001,2002 Tatsuo Ishii (from pgstattuple) + * + * Permission to use, copy, modify, and distribute this software and + * its documentation for any purpose, without fee, and without a + * written agreement is hereby granted, provided that the above + * copyright notice and this paragraph and the following two + * paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT, + * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED + * OF THE POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN AS + * IS BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, + * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. + */ + +#include postgres.h + +#include access/visibilitymap.h +#include access/transam.h +#include access/xact.h +#include access/multixact.h +#include access/htup_details.h +#include catalog/namespace.h +#include funcapi.h +#include miscadmin.h +#include storage/bufmgr.h +#include storage/freespace.h +#include storage/procarray.h +#include storage/lmgr.h +#include utils/builtins.h +#include utils/tqual.h +#include commands/vacuum.h + +PG_FUNCTION_INFO_V1(pgstatbloat); + +/* + * tuple_percent, dead_tuple_percent and free_percent are computable, + * so not defined here. + */ +typedef struct pgstatbloat_output_type +{ + uint64 table_len; + uint64 tuple_count; + uint64 misc_count; + uint64 tuple_len; + uint64 dead_tuple_count; + uint64 dead_tuple_len; + uint64 free_space; + uint64 total_pages; + uint64 scanned_pages; +} pgstatbloat_output_type; + +static Datum build_output_type(pgstatbloat_output_type *stat, + FunctionCallInfo fcinfo); +static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo); + +/* + * build a pgstatbloat_output_type tuple + */ +static Datum +build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo) +{ +#define NCOLUMNS 10 +#define NCHARS 32 + + HeapTuple tuple; + char *values[NCOLUMNS]; + char values_buf[NCOLUMNS][NCHARS]; + int i; + double tuple_percent; + double dead_tuple_percent; + double free_percent; /* free/reusable space in % */ + double scanned_percent; + TupleDesc tupdesc; + AttInMetadata *attinmeta; + + /* Build a tuple descriptor for our result type
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
At 2015-04-18 12:28:36 +0530, amit.kapil...@gmail.com wrote: I think you have missed to address the below point: 4) prefetch Updated patch attached, as well as the diff against the earlier version just to make it easier to see the exact change I made (which is to copy the skip logic from lazy_scan_heap, as you suggested). I'm not entirely convinced that this change is worthwhile, but it's easy to make and difficult to argue against, so here it is. (I did test, and it seems to work OK after the change.) Amit, Tomas, thanks again for your review comments. -- Abhijit From 3edb5426292d6097cb66339b865e99bf4f766646 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Fri, 26 Dec 2014 12:37:13 +0530 Subject: Add pgstatbloat to pgstattuple --- contrib/pgstattuple/Makefile | 4 +- contrib/pgstattuple/pgstatbloat.c | 387 + contrib/pgstattuple/pgstattuple--1.2--1.3.sql | 18 + .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} | 18 +- contrib/pgstattuple/pgstattuple.control| 2 +- doc/src/sgml/pgstattuple.sgml | 135 +++ 6 files changed, 560 insertions(+), 4 deletions(-) create mode 100644 contrib/pgstattuple/pgstatbloat.c create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (73%) diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile index 862585c..d7d27a5 100644 --- a/contrib/pgstattuple/Makefile +++ b/contrib/pgstattuple/Makefile @@ -1,10 +1,10 @@ # contrib/pgstattuple/Makefile MODULE_big = pgstattuple -OBJS = pgstattuple.o pgstatindex.o $(WIN32RES) +OBJS = pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES) EXTENSION = pgstattuple -DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql +DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql PGFILEDESC = pgstattuple - tuple-level statistics REGRESS = pgstattuple diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c new file mode 100644 index 000..b19459a --- /dev/null +++ b/contrib/pgstattuple/pgstatbloat.c @@ -0,0 +1,387 @@ +/* + * contrib/pgstattuple/pgstatbloat.c + * + * Abhijit Menon-Sen a...@2ndquadrant.com + * Portions Copyright (c) 2001,2002 Tatsuo Ishii (from pgstattuple) + * + * Permission to use, copy, modify, and distribute this software and + * its documentation for any purpose, without fee, and without a + * written agreement is hereby granted, provided that the above + * copyright notice and this paragraph and the following two + * paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT, + * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED + * OF THE POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN AS + * IS BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, + * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. + */ + +#include postgres.h + +#include access/visibilitymap.h +#include access/transam.h +#include access/xact.h +#include access/multixact.h +#include access/htup_details.h +#include catalog/namespace.h +#include funcapi.h +#include miscadmin.h +#include storage/bufmgr.h +#include storage/freespace.h +#include storage/procarray.h +#include storage/lmgr.h +#include utils/builtins.h +#include utils/tqual.h +#include commands/vacuum.h + +PG_FUNCTION_INFO_V1(pgstatbloat); + +/* + * tuple_percent, dead_tuple_percent and free_percent are computable, + * so not defined here. + */ +typedef struct pgstatbloat_output_type +{ + uint64 table_len; + uint64 tuple_count; + uint64 misc_count; + uint64 tuple_len; + uint64 dead_tuple_count; + uint64 dead_tuple_len; + uint64 free_space; + uint64 total_pages; + uint64 scanned_pages; +} pgstatbloat_output_type; + +static Datum build_output_type(pgstatbloat_output_type *stat, + FunctionCallInfo fcinfo); +static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo); + +/* + * build a pgstatbloat_output_type tuple + */ +static Datum +build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo) +{ +#define NCOLUMNS 10 +#define NCHARS 32 + + HeapTuple tuple; + char *values[NCOLUMNS]; + char values_buf[NCOLUMNS][NCHARS]; + int i; + double tuple_percent; + double dead_tuple_percent; + double free_percent; /* free/reusable space in % */ + double scanned_percent; + TupleDesc tupdesc; + AttInMetadata *attinmeta; + + /* Build
Re: [HACKERS] initdb -S and tablespaces
Hi. Here's a variation of the earlier patch that follows all links in PGDATA. Does this look more like what you had in mind? -- Abhijit From d86888b0d2f5a3a57027d26ce050a3bbb58670d3 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Thu, 6 Nov 2014 00:45:56 +0530 Subject: Recursively fsync PGDATA at startup if needed It's needed if we need to perform crash recovery or if fsync was disabled at some point while the server was running earlier (and we must store that information in the control file). This is so that we don't lose older unflushed writes in the event of a power failure after crash recovery, where more recent writes are preserved. See 20140918083148.ga17...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 168 src/bin/pg_controldata/pg_controldata.c | 2 + src/include/catalog/pg_control.h| 8 +- 3 files changed, 177 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 2580996..263d2d0 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -835,6 +835,8 @@ static void WALInsertLockAcquireExclusive(void); static void WALInsertLockRelease(void); static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); +static void fsync_pgdata(char *datadir); + /* * Insert an XLOG record represented by an already-constructed chain of data * chunks. This is a low-level routine; to construct the WAL record header @@ -4811,6 +4813,7 @@ BootStrapXLOG(void) ControlFile-checkPoint = checkPoint.redo; ControlFile-checkPointCopy = checkPoint; ControlFile-unloggedLSN = 1; + ControlFile-fsync_disabled = false; /* Set important parameter values for use when replaying WAL */ ControlFile-MaxConnections = MaxConnections; @@ -5868,6 +5871,27 @@ StartupXLOG(void) ereport(FATAL, (errmsg(control file contains invalid data))); + /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory and its contents to try to ensure that any data + * written before the crash are flushed to disk. Otherwise a power + * failure in the near future might cause earlier unflushed writes + * to be lost, even though more recent data written to disk from + * here on would be persisted. + * + * We also do this if the control file indicates that fsync was + * disabled at some point while the server was running earlier. + */ + + if (enableFsync + (ControlFile-fsync_disabled || + (ControlFile-state != DB_SHUTDOWNED + ControlFile-state != DB_SHUTDOWNED_IN_RECOVERY))) + { + fsync_pgdata(data_directory); + ControlFile-fsync_disabled = false; + } + if (ControlFile-state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ @@ -8236,6 +8260,8 @@ CreateCheckPoint(int flags) /* crash recovery should always recover to the end of WAL */ ControlFile-minRecoveryPoint = InvalidXLogRecPtr; ControlFile-minRecoveryPointTLI = 0; + if (!enableFsync) + ControlFile-fsync_disabled = true; /* * Persist unloggedLSN value. It's reset on crash recovery, so this goes @@ -11107,3 +11133,145 @@ SetWalWriterSleeping(bool sleeping) XLogCtl-WalWriterSleeping = sleeping; SpinLockRelease(XLogCtl-info_lck); } + +/* + * Hint to the OS that it should get ready to fsync() this file. + * + * Adapted from pre_sync_fname in initdb.c + */ +static void +pre_sync_fname(char *fname, bool isdir) +{ + int fd; + + fd = open(fname, O_RDONLY | PG_BINARY); + + /* + * Some OSs don't allow us to open directories at all (Windows returns + * EACCES) + */ + if (fd 0 isdir (errno == EISDIR || errno == EACCES)) + return; + + if (fd 0) + ereport(FATAL, +(errmsg(could not open file \%s\ before fsync, + fname))); + + pg_flush_data(fd, 0, 0); + + close(fd); +} + +/* + * walkdir: recursively walk a directory, applying the action to each + * regular file and directory (including the named directory itself) + * and following symbolic links. + */ +static void +walkdir(char *path, void (*action) (char *fname, bool isdir)) +{ + DIR *dir; + struct dirent *de; + + dir = AllocateDir(path); + while ((de = ReadDir(dir, path)) != NULL) + { + char subpath[MAXPGPATH]; + struct stat fst; + + CHECK_FOR_INTERRUPTS(); + + if (strcmp(de-d_name, .) == 0 || + strcmp(de-d_name, ..) == 0) + continue; + + snprintf(subpath, MAXPGPATH, %s/%s, path, de-d_name); + + if (lstat(subpath, fst) 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not stat file \%s\: %m, subpath))); + + if (S_ISREG(fst.st_mode)) + (*action) (subpath, false); + else if (S_ISDIR(fst.st_mode)) + walkdir(subpath, action); +#ifndef WIN32 + else if (S_ISLNK(fst.st_mode)) +#else + else if (pg_win32_is_junction(subpath)) +#endif + { +#if defined(HAVE_READLINK) || defined(WIN32) + char linkpath[MAXPGPATH]; + int len; + struct stat lst; + + len
Re: [HACKERS] initdb -S and tablespaces
At 2015-04-06 10:16:36 -0300, alvhe...@2ndquadrant.com wrote: Well, we have many things that can be set as symlinks; some you can have initdb create the links for you (initdb --xlogdir), others you can move manually. Looking at sendDir() in backend/replication/basebackup.c, I notice that the only places where it expects/allows symlinks are for pg_xlog and in pg_tblspc. Still, thanks to the example in basebackup.c, I've got most of a patch that should follow any symlinks in PGDATA. I just need to test a little more before I post it. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-04-15 11:40:34 +0530, a...@2ndquadrant.com wrote: Still, thanks to the example in basebackup.c, I've got most of a patch that should follow any symlinks in PGDATA. I notice that src/bin/pg_rewind/copy_fetch.c has a traverse_datadir() function that does what we want (but it recurses into symlinks only inside pg_tblspc). -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
Hi Álvaro. Thanks for taking a look at the patch. At 2015-04-03 13:32:32 -0300, alvhe...@2ndquadrant.com wrote: But then, since the name is already telling us that it's all about PGDATA, maybe we can remove the recursively part of the name? Sure, that makes sense too. Since you and Andres both like «fsync_pgdata(data_directory)», I'll change it accordingly. I also noticed that walkdir() thinks it is completely agnostic on what the passed action is; except that the comment at the bottom talks about how fsync on directories is important, which seems out of place. Yes. The function behaves as documented, but the comment is clearly too specific. I'm not sure where to put it. I could make walkdir() NOT do it, and instead do it in the caller with the same comment. Thoughts? Hmm ... Actually, since surely we must follow symlinks everywhere, why do we have to do this separately for pg_tblspc? Shouldn't that link-following occur automatically when walking PGDATA in the first place? I'm not sure about that (and that's why I've not attached an updated patch here). The original idea was to follow only those links that we expect to be in PGDATA. I suppose it would be easier in terms of the code to follow all links, but I don't know if it's the right thing. If that's what you think we should do, I can post a simplified patch. Maybe fsync_recursively should Assert() that it's only being called with enableFsync=on; or perhaps we can have it return early if it's unset. I prefer the latter. Will change. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
At 2015-03-31 22:43:49 +0530, a...@2ndquadrant.com wrote: I'm just posting this WIP patch where I've renamed fastbloat to pgstatbloat as suggested by Tomas, and added in the documentation, and so on. Here's the revised version that also adds the count of RECENTLY_DEAD and INSERT/DELETE_IN_PROGRESS tuples to the call to vac_estimate_reltuples. Tomas, I agree that the build_output_tuple function could use cleaning up, but it's the same as the corresponding code in pgstattuple, and it seems to me reasonable to clean both up together in a separate patch. -- Abhijit From d11b84627438fca12955dfad77042be0a27c9acc Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Fri, 26 Dec 2014 12:37:13 +0530 Subject: Add pgstatbloat to pgstattuple --- contrib/pgstattuple/Makefile | 4 +- contrib/pgstattuple/pgstatbloat.c | 340 + contrib/pgstattuple/pgstattuple--1.2--1.3.sql | 18 ++ .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} | 18 +- contrib/pgstattuple/pgstattuple.control| 2 +- doc/src/sgml/pgstattuple.sgml | 135 6 files changed, 513 insertions(+), 4 deletions(-) create mode 100644 contrib/pgstattuple/pgstatbloat.c create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (73%) diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile index 862585c..d7d27a5 100644 --- a/contrib/pgstattuple/Makefile +++ b/contrib/pgstattuple/Makefile @@ -1,10 +1,10 @@ # contrib/pgstattuple/Makefile MODULE_big = pgstattuple -OBJS = pgstattuple.o pgstatindex.o $(WIN32RES) +OBJS = pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES) EXTENSION = pgstattuple -DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql +DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql PGFILEDESC = pgstattuple - tuple-level statistics REGRESS = pgstattuple diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c new file mode 100644 index 000..2dd1900 --- /dev/null +++ b/contrib/pgstattuple/pgstatbloat.c @@ -0,0 +1,340 @@ +/* + * contrib/pgstattuple/pgstatbloat.c + * + * Abhijit Menon-Sen a...@2ndquadrant.com + * Portions Copyright (c) 2001,2002 Tatsuo Ishii (from pgstattuple) + * + * Permission to use, copy, modify, and distribute this software and + * its documentation for any purpose, without fee, and without a + * written agreement is hereby granted, provided that the above + * copyright notice and this paragraph and the following two + * paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT, + * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED + * OF THE POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN AS + * IS BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, + * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. + */ + +#include postgres.h + +#include access/visibilitymap.h +#include access/transam.h +#include access/xact.h +#include access/multixact.h +#include access/htup_details.h +#include catalog/namespace.h +#include funcapi.h +#include miscadmin.h +#include storage/bufmgr.h +#include storage/freespace.h +#include storage/procarray.h +#include storage/lmgr.h +#include utils/builtins.h +#include utils/tqual.h +#include commands/vacuum.h + +PG_FUNCTION_INFO_V1(pgstatbloat); + +/* + * tuple_percent, dead_tuple_percent and free_percent are computable, + * so not defined here. + */ +typedef struct pgstatbloat_output_type +{ + uint64 table_len; + uint64 tuple_count; + uint64 misc_count; + uint64 tuple_len; + uint64 dead_tuple_count; + uint64 dead_tuple_len; + uint64 free_space; + uint64 total_pages; + uint64 scanned_pages; +} pgstatbloat_output_type; + +static Datum build_output_type(pgstatbloat_output_type *stat, + FunctionCallInfo fcinfo); +static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo); + +/* + * build a pgstatbloat_output_type tuple + */ +static Datum +build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo) +{ +#define NCOLUMNS 10 +#define NCHARS 32 + + HeapTuple tuple; + char *values[NCOLUMNS]; + char values_buf[NCOLUMNS][NCHARS]; + int i; + double tuple_percent; + double dead_tuple_percent; + double free_percent; /* free/reusable space in % */ + double scanned_percent; + TupleDesc tupdesc; + AttInMetadata *attinmeta; + + /* Build a tuple
Re: [HACKERS] What exactly is our CRC algorithm?
At 2015-03-25 19:18:51 +0200, hlinn...@iki.fi wrote: I think we'll need a version check there. […] You want to write that or should I? I'm not familiar with MSVC at all, so it would be nice if you did it. How do you like this latest version of the patch otherwise? I'm sorry, but I'm still not especially fond of it. Apart from removing the startup check so that client programs can also use the best available CRC32C implementation without jumping through hoops, I don't feel that the other changes buy us very much. Also, assuming that the point is that people who don't care about CRCs deeply should nevertheless be able to produce special-purpose binaries with only the optimal implementation included, we should probably have some instructions about how to do that. Thinking about what you were trying to do, I would find an arrangement roughly like the following to be clearer to follow in terms of adding new implementations and so on: #if defined(USE_CRC_SSE42) || …can build SSE4.2 CRC code… #define HAVE_CRC_SSE42 1 pg_crc32c_sse42() { … } bool sse42_crc32c_available() { … } pg_comp_crc32c = pg_crc32c_sse42; #elif defined(USE_CRC_ARM) || …can build ARM CRC code… #define HAVE_CRC_ARM 1 pg_crc32c_arm() { … } bool arm_crc32c_available() { … } pg_comp_crc32c = pg_crc32c_arm; #endif #define CRC_SELECTION 1 #if defined(USE_CRC_SSE42) || defined(USE_CRC_ARM) || … #undef CRC_SELECTION #endif #ifdef CRC_SELECTION pg_crc32c_sb8() { … } pg_comp_crc32c_choose(…) { pg_comp_crc32c = pg_crc32c_sb8; #ifdef HAVE_CRC_SSE42 if (sse42_crc32c_available()) pg_comp_crc32c = pg_crc32c_sse42; #elif … … #endif return pg_comp_crc32c(…); } pg_comp_crc32c = pg_crc32c_choose; #endif Then someone who wants to force the building of (only) the SSE4.2 implementation can build with -DUSE_CRC_SSE42. And if you turn on USE_CRC_ARM when you can't build ARM code, it won't build. (The HAVE_CRC_xxx #defines could also move to configure tests.) If you don't specify any USE_CRC_xxx explicitly, then it'll build whichever (probably) one it can, and select it at runtime if it's available. All that said, I do recognise that there are all relatively cosmetic concerns, and I don't object seriously to any of it. On the contrary, thanks for taking the time to review and work on the patch. Nobody else has expressed an opinion, so I'll leave it to you to decide whether to commit as-is, or if you want me to pursue the above approach instead. In the realm of very minor nitpicking, here are a couple of points I noticed in crc_instructions.h: 1. I think «if (pg_crc32_instructions_runtime_check())» would read better if the function were named crc32c_instructions_available or pg_crc32c_is_hw_accelerated, or something like that. 2. It documents pg_accumulate_crc32c_byte and pg_accumulate_crc32c_uint64, but actually defines pg_asm_crc32b and pg_asm_crc32q. If you update the code rather than the documentation, _update_ may be slightly preferable to _accumulate_, and maybe the suffixes should be _uint8 and _uint64. 3. The descriptions (e.g. Add one byte to the current crc value.) should also probably read Update the current crc value for the given byte/eight bytes of data. Thanks. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
At 2015-04-03 00:33:10 +0300, hlinn...@iki.fi wrote: I came up with the attached. I like it very much. src/port/Makefile has (note src/srv): +# pg_crc32c_sse42.o and its _src.o version need CFLAGS_SSE42 +pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42) +pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_SSE42) Other than that, this looks great. Thank you. BTW, we might want to move the traditional and legacy crc32 implementations out of src/common. They are only used in backend code now. I agree. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
At 2015-04-02 17:58:23 +0300, hlinn...@iki.fi wrote: We're only using inline assembly to force producing SSE 4.2 code, even when -msse4.2 is not used. That feels wrong. Why? It feels OK to me (and to Andres, per earlier discussions about exactly this topic). Doing it this way allows the binary to run on a non-SSE4.2 platform (and not use the CRC instructions). Also, -msse4.2 was added to the compiler later than support for the instructions was added to the assembler. We have a buildfarm animal that still uses gcc 2.95.3, which was released in 2001. I don't have a compiler of that vintage to test with, but I assume an old enough assembler would not know about the crc32q instruction and fail to compile. GCC from 2002 wouldn't support the symbolic operand names in inline assembly. binutils from 2007 (IIRC) wouldn't support the assembler instructions themselves. We could work around the latter by using the appropriate sequence of bytes. We could work around the former by using the old syntax for operands. I believe the GCC way to do this would be to put the SSE4.2-specific code into a separate source file, and compile that file with -msse4.2. And when you compile with -msse4.2, gcc actually also supports the _mm_crc32_u8/u64 intrinsics. I have no objection to this. Building only that file with -msse4.2 would resolve the problem of the output binary requiring SSE4.2; and the only compilers to be excluded are old enough to be uninteresting at least to me personally. Have you done/are you doing this already, or do you want me to? I could use advice on how to add build flags to only one file, since I don't know of any precendent for that. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
Hi. I'm just posting this WIP patch where I've renamed fastbloat to pgstatbloat as suggested by Tomas, and added in the documentation, and so on. I still have to incorporate Amit's comments about the estimation of reltuples according to the way vacuum does it, and I expect to post that tomorrow (I just need to test a little more). In the meantime, if anyone else was having trouble installing the extension due to the incorrect version in the control file, this is the patch you should be using. -- Abhijit From f809e070e8ea13b74c6206ca67a7eaf2a32e60fa Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Fri, 26 Dec 2014 12:37:13 +0530 Subject: Add pgstatbloat to pgstattuple --- contrib/pgstattuple/Makefile | 4 +- contrib/pgstattuple/pgstatbloat.c | 346 + contrib/pgstattuple/pgstattuple--1.2--1.3.sql | 18 ++ .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} | 18 +- contrib/pgstattuple/pgstattuple.control| 2 +- doc/src/sgml/pgstattuple.sgml | 135 6 files changed, 519 insertions(+), 4 deletions(-) create mode 100644 contrib/pgstattuple/pgstatbloat.c create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (73%) diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile index 862585c..d7d27a5 100644 --- a/contrib/pgstattuple/Makefile +++ b/contrib/pgstattuple/Makefile @@ -1,10 +1,10 @@ # contrib/pgstattuple/Makefile MODULE_big = pgstattuple -OBJS = pgstattuple.o pgstatindex.o $(WIN32RES) +OBJS = pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES) EXTENSION = pgstattuple -DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql +DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql PGFILEDESC = pgstattuple - tuple-level statistics REGRESS = pgstattuple diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c new file mode 100644 index 000..15c2cb9 --- /dev/null +++ b/contrib/pgstattuple/pgstatbloat.c @@ -0,0 +1,346 @@ +/* + * contrib/pgstattuple/pgstatbloat.c + * + * Abhijit Menon-Sen a...@2ndquadrant.com + * Portions Copyright (c) 2001,2002 Tatsuo Ishii (from pg_stattuple) + * + * Permission to use, copy, modify, and distribute this software and + * its documentation for any purpose, without fee, and without a + * written agreement is hereby granted, provided that the above + * copyright notice and this paragraph and the following two + * paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT, + * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED + * OF THE POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN AS + * IS BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, + * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. + */ + +#include postgres.h + +#include access/visibilitymap.h +#include access/transam.h +#include access/xact.h +#include access/multixact.h +#include access/htup_details.h +#include catalog/namespace.h +#include funcapi.h +#include miscadmin.h +#include storage/bufmgr.h +#include storage/freespace.h +#include storage/procarray.h +#include storage/lmgr.h +#include utils/builtins.h +#include utils/tqual.h +#include commands/vacuum.h + +PG_FUNCTION_INFO_V1(pgstatbloat); + +/* + * tuple_percent, dead_tuple_percent and free_percent are computable, + * so not defined here. + */ +typedef struct pgstatbloat_output_type +{ + uint64 table_len; + uint64 tuple_count; + uint64 tuple_len; + uint64 dead_tuple_count; + uint64 dead_tuple_len; + uint64 free_space; + uint64 total_pages; + uint64 scanned_pages; +} pgstatbloat_output_type; + +static Datum build_output_type(pgstatbloat_output_type *stat, + FunctionCallInfo fcinfo); +static Datum fbstat_relation(Relation rel, FunctionCallInfo fcinfo); +static Datum fbstat_heap(Relation rel, FunctionCallInfo fcinfo); + +/* + * build a pgstatbloat_output_type tuple + */ +static Datum +build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo) +{ +#define NCOLUMNS 10 +#define NCHARS 32 + + HeapTuple tuple; + char *values[NCOLUMNS]; + char values_buf[NCOLUMNS][NCHARS]; + int i; + double tuple_percent; + double dead_tuple_percent; + double free_percent; /* free/reusable space in % */ + double scanned_percent; + TupleDesc tupdesc; + AttInMetadata *attinmeta; + + /* Build a tuple descriptor for our
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
At 2015-02-24 11:22:41 -0500, da...@pgmasters.net wrote: Patch v3 is attached. […] +/* Log class enum used to represent bits in auditLogBitmap */ +enum LogClass +{ + LOG_NONE = 0, + + /* SELECT */ + LOG_READ = (1 0), + + /* INSERT, UPDATE, DELETE, TRUNCATE */ + LOG_WRITE = (1 1), + + /* DDL: CREATE/DROP/ALTER */ + LOG_DDL = (1 2), + + /* Function execution */ + LOG_FUNCTION = (1 4), + + /* Function execution */ + LOG_MISC = (1 5), + + /* Absolutely everything */ + LOG_ALL = ~(uint64)0 +}; The comment above LOG_MISC should be changed. More fundamentally, this classification makes it easy to reuse LOGSTMT_* (and a nice job you've done of that, with just a few additional special cases), I don't think this level is quite enough for our needs. I think it should at least be possible to specifically log commands that affect privileges and roles. I'm fond of finer categorisation for DDL as well, but I could live with all DDL being lumped together. I'm experimenting with a few approaches to do this without reintroducing switch statements to test every command. That will require core changes, but I think we can find an acceptable arrangement. I'll post a proof of concept in a few days. + * Takes an AuditEvent and, if it log_check(), writes it to the audit log. I don't think log_check is the most useful name, because this sentence doesn't tell me what the function may do. Similarly, I would prefer to have log_acl_check be renamed acl_grants_audit or similar. (These are all static functions anyway, I don't think a log_ prefix is needed.) + /* Free the column set */ + bms_free(tmpSet); (An aside, really: there are lots of comments like this, which I don't think add anything to understanding the code, and should be removed.) + /* + * We don't have access to the parsetree here, so we have to generate + * the node type, object type, and command tag by decoding + * rte-requiredPerms and rte-relkind. + */ + auditEvent.logStmtLevel = LOGSTMT_MOD; (I am also trying to find a way to avoid having to do this.) + /* Set object type based on relkind */ + switch (class-relkind) + { + case RELKIND_RELATION: + utilityAuditEvent.objectType = OBJECT_TYPE_TABLE; + break; This occurs elsewhere too. But I suppose new relkinds are added less frequently than new commands. Again on a larger level, I'm not sure how I feel about _avoiding_ the use of event triggers for audit logging. Regardless of whether we use the deparse code (which I personally think is a good idea; Álvaro has been working on it, and it looks very nice) to log extra information, using the object access hook inevitably means we have to reimplement the identification/classification code here. In old pgaudit, I think that extra effort is justified by the need to be backwards compatible with pre-event trigger releases. In a 9.5-only version, I am not at all convinced that this makes sense. Thoughts? -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 authentication needs help -SCRAM
At 2015-03-14 09:44:02 +0200, hlinn...@iki.fi wrote: Perhaps it would be time to restart the discussion on standardizing SRP as a SASL mechanism in IETF. I haven't seen much evidence that there's any interest in doing this; in fact, I can't remember the author of the draft you pointed to being very active in the discussions either. Assume that the connection is not encrypted, and Eve captures the SCRAM handshake between Alice and Bob. Using the captured handshake, she can try to guess the password, offline. With a PAKE protocol, she cannot do that. OK. I agree that this is a nice property. SCRAM made the design decision to hinder such attacks by using PBKDF2 rather than a zero-knowledge key exchange mechanism as SRP does. This was partly due to the trend that I mentioned of wanting to require TLS everywhere. I'm obviously biased in this matter, but I think it's acceptable for the potential attack to be frustrated by the use of PBKDF2 and defeated by the use of TLS (which is already possible with Postgres); and that in the balance, SCRAM is easier to implement securely than SRP. Of course, if you want to use x as your password everywhere, then SRP is preferable. ;-) -- Abhijit P.S. I don't know why the SRP code was removed from LibreSSL; nor am I sure how seriously to take that. It's possible that it's only because it's (still) rather obscure. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 authentication needs help -SCRAM
As a followup, I spoke to an IETF friend who's used and implemented both SRP and SCRAM. He agrees that SRP is cryptographically solid, that it's significantly more difficult to implement (and therefore has a bit of a monoculture risk overall, though of course that wouldn't apply to us if we were to write the code from scratch). Apparently the patent status is still not entirely clear. Two of the patents expired, but there are others that may be relevant. Stanford claims a patent, but apparently grant a free license if you do meet certain conditions. But he doesn't know of anyone having to go to court over the use of SRP. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-01-15 14:32:45 +0100, and...@2ndquadrant.com wrote: What I am thinking of is that, currently, if you start the server for initial loading with fsync=off, and then restart it, you're open to data loss. So when the current config file setting is changed from off to on, we should fsync the data directory. Even if there was no crash restart. Patch attached. Changes: 1. Renamed perform_fsync to fsync_recursively (otherwise it would read fsync_pgdata(pg_data)) 2. Added ControlData-fsync_disabled to record whether fsync was ever disabled while the server was running (tested in CreateCheckPoint) 3. Run fsync_recursively at startup only if fsync is enabled 4. Run it if we're doing crash recovery, or fsync was disabled 5. Use pg_flush_data in pre_sync_fname 6. Issue fsync on directories too 7. Tested that it works if pg_xlog is a symlink (no changes). (In short, everything you mentioned in your earlier mail.) Note that I set ControlData-fsync_disabled to false in BootstrapXLOG, but it gets set to true during a later CreateCheckPoint(). This means we run fsync again at startup after initdb. I'm not sure what to do about that. Is this about what you had in mind? -- Abhijit From bb2b5f130525dd44a10eab6829b9802b8f6f7eed Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Thu, 6 Nov 2014 00:45:56 +0530 Subject: Recursively fsync PGDATA at startup if needed It's needed if we need to perform crash recovery or if fsync was disabled at some point while the server was running earlier (and we must store that information in the control file). This is so that we don't lose older unflushed writes in the event of a power failure after crash recovery, where more recent writes are preserved. See 20140918083148.ga17...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 171 src/bin/pg_controldata/pg_controldata.c | 2 + src/include/catalog/pg_control.h| 8 +- 3 files changed, 180 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 71cbe0e..75a6862 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -833,6 +833,12 @@ static void WALInsertLockAcquireExclusive(void); static void WALInsertLockRelease(void); static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); +static void pre_sync_fname(char *fname, bool isdir); +static void walkdir(char *path, void (*action) (char *fname, bool isdir)); +static void walktblspc_links(char *path, + void (*action) (char *fname, bool isdir)); +static void fsync_recursively(char *pg_data); + /* * Insert an XLOG record represented by an already-constructed chain of data * chunks. This is a low-level routine; to construct the WAL record header @@ -4721,6 +4727,7 @@ BootStrapXLOG(void) ControlFile-checkPoint = checkPoint.redo; ControlFile-checkPointCopy = checkPoint; ControlFile-unloggedLSN = 1; + ControlFile-fsync_disabled = false; /* Set important parameter values for use when replaying WAL */ ControlFile-MaxConnections = MaxConnections; @@ -5787,6 +5794,27 @@ StartupXLOG(void) ereport(FATAL, (errmsg(control file contains invalid data))); + /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory and its contents to try to ensure that any data + * written before the crash are flushed to disk. Otherwise a power + * failure in the near future might cause earlier unflushed writes + * to be lost, even though more recent data written to disk from + * here on would be persisted. + * + * We also do this if the control file indicates that fsync was + * disabled at some point while the server was running earlier. + */ + + if (enableFsync + (ControlFile-fsync_disabled || + (ControlFile-state != DB_SHUTDOWNED + ControlFile-state != DB_SHUTDOWNED_IN_RECOVERY))) + { + fsync_recursively(data_directory); + ControlFile-fsync_disabled = false; + } + if (ControlFile-state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ @@ -8137,6 +8165,8 @@ CreateCheckPoint(int flags) /* crash recovery should always recover to the end of WAL */ ControlFile-minRecoveryPoint = InvalidXLogRecPtr; ControlFile-minRecoveryPointTLI = 0; + if (!enableFsync) + ControlFile-fsync_disabled = true; /* * Persist unloggedLSN value. It's reset on crash recovery, so this goes @@ -11008,3 +11038,144 @@ SetWalWriterSleeping(bool sleeping) XLogCtl-WalWriterSleeping = sleeping; SpinLockRelease(XLogCtl-info_lck); } + +/* + * Hint to the OS that it should get ready to fsync() this file. + * + * Adapted from pre_sync_fname in initdb.c + */ +static void +pre_sync_fname(char *fname, bool isdir) +{ + int fd; + + fd = open(fname, O_RDONLY | PG_BINARY); + + /* + * Some OSs don't allow us to open directories at all (Windows returns + * EACCES) + */ + if (fd 0 isdir
Re: [HACKERS] MD5 authentication needs help -SCRAM
At 2015-03-09 13:52:10 +0200, hlinn...@iki.fi wrote: Do you have any insight on why the IETF working group didn't choose a PAKE protocol instead of or in addition to SCRAM, when SCRAM was standardized? Hi Heikki. It was a long time ago, but I recall that SRP was patent-encumbered: https://datatracker.ietf.org/ipr/search/?rfc=2945submit=rfc The Wikipedia page says the relevant patents expired in 2011 and 2013. I haven't followed SRP development since then, maybe it's been revised. When SCRAM was being discussed, I can't recall any other proposals for PAKE protocols. Besides, as you may already know, anyone can submit an internet-draft about anything. It needs to gain general support for an extended period in order to advance through the standards process. Could you please explain what exactly you mean about a SCRAM eavesdropper gaining some advantage in being able to mount a dictionary attack? I didn't follow that part. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 authentication needs help -SCRAM
At 2015-03-08 12:48:44 -0700, j...@agliodbs.com wrote: Since SCRAM has been brought up a number of times here, I thought I'd loop in the PostgreSQL contributor who is co-author of the SCRAM standard to see if he has anything to say about implementing SCRAM as a built-in auth method for Postgres. I think it's a good idea. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
At 2015-02-17 13:01:46 -0500, sfr...@snowman.net wrote: I have to admit that I'm confused by this. Patches don't stabilise through sitting in the archives, they stabilise when the comments are being addressed, the patch updated, and further comments are addressing less important issues. The issues which Robert and I had both commented on didn't appear to be getting addressed. I'm confused and unhappy about your characterisation of the state of this patch. You make it seem as though there was broad consensus about the changes that were needed, and that I left the patch sitting in the archives for a long time without addressing important issues. You revised and refined your GRANT proposal in stages, and I tried to adapt the code to suit. I'm sorry that my efforts were not fast enough or responsive enough to make you feel that progress was being made. But nobody else commented in detail on the GRANT changes except to express general misgivings, and nobody else even disagreed when I inferred, in light of the lack of consensus that Robert pointed out, that the code was unlikely to make it into 9.5. Given that I've maintained the code over the past year despite its being rejected in an earlier CF, and given the circumstances outlined above, I do not think it's reasonable to conclude after a couple of weeks without a new version that it was abandoned. As I had mentioned earlier, there are people who already use pgaudit as-is, and complain if I break it. Anyway, I guess there is no such thing as a constructive history discussion, so I'll drop it. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
At 2015-02-17 10:52:55 -0500, sfr...@snowman.net wrote: From the old thread, David had offered to submit a pull request if there was interest and I didn't see any response... For whatever it's worth, that's because I've been away from work, and only just returned. I had it on my list to look at the code tomorrow. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
At 2015-02-10 14:30:51 +0200, hlinnakan...@vmware.com wrote: I propose that we add a new header file in src/port, let's call it crc_instructions.h […] the point of the crc_instructions.h header is to hide the differences between compilers and architectures. Moving the assembly code/compiler intrinsics to a separate header is fine with me, but I really don't like the proposed implementation at all. Here are some comments: 1. I don't mind moving platform-specific tests for CRC32C instructions to configure, but if we only define PG_HAVE_CRC32C_INSTRUCTIONS, we would anyway have to reproduce all that platform-specific stuff in the header file. To do it properly, I think we should generate the right version of crc_instructions.h for the platform. Otherwise, what's the point? Might as well have only the complicated header. 2. At this point, I think we should stick with the _sse function rather than a generic _hw one to drive the platform-specific instructions. The structure of that function (n/8*(8bytes)+n%8*(bytes)) is specific to what works best for the SSE instructions. On another platform, we may need something very similar, or very different. With the proposed structure, _hw would inevitably become #ifdef'ed non-trivially, e.g. to run a single-byte loop at the start to align the main loop, but only for certain combinations of #defines. I think we should consider what makes the most sense when someone actually wants to add support for another platform/compiler, and not before. 3. I dislike everything about pg_crc32_instructions_runtime_check()—the name, the separate PG_CRC32C_INSTRUCTIONS_NEED_RUNTIME_CHECK #define, the fact that you try to do the test inline rather than at startup… Maybe you're trying to avoid checking separately at startup so that a program that links with code from src/common doesn't need to do its own test. But I don't think the results are worth it. As for omitting the slicing-by-8 tables, if there's a more compelling reason to do that than Gentoo users might like that ;-), I think it should be done with a straightforward -DUSE_CRC_SSE style arrangement rather than figuring out that you HAVE_CRC32C_INSTRUCTIONS but don't NEED_RUNTIME_CHECK. If you want to build special-purpose binaries, just pick whichever CRC implementation you like, done. I believe the CRC instructions are also available in 32-bit mode, so the check for __x86_64__ is not needed. Right? I'm not sure, but I guess you're right. It would be nice to use GCC's builtin intrinsics, __builtin_ia32_crc32* where available, but I guess those are only available if you build with -msse4.2, and that would defeat the point of the runtime check. Exactly. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
At 2015-02-09 12:52:41 +0200, hlinnakan...@vmware.com wrote: Ok, I've committed a patch that just moves the existing code to common/pg_crc.c […] Thanks, looks good. Attached is a rebased version of the slicing-by-8 patch. Looks OK too. Do you have access to big-endian hardware to test this on? Yes, I tested it on a Linux/ppc system. I wasn't speculating when I said it does make a noticeable difference, though I'm afraid I did not keep the timings after submitting the revised patch. The speedup was some double-digit percentage, IIRC. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
At 2015-02-08 18:46:30 +0200, hlinnakan...@vmware.com wrote: So I propose to move pg_crc.c to src/common, and move the tables from pg_crc_tables.h directly to pg_crc.c. Thoughts? Sounds fine to me. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
At 2015-01-26 17:45:52 -0500, robertmh...@gmail.com wrote: Based on the recent emails, it appears there's been a shift of preference to having it be in-core […] Well, I'm not sure that anyone else here agreed with me on that Sure, an in-core AUDIT command would be great. Stephen has always said that would be the most preferable solution; and if we had the code to implement it, I doubt anyone would prefer the contrib module instead. But we don't have that code now, and we won't have it in time for 9.5. We had an opportunity to work on pgaudit in its current form, and I like to think that the result is useful. To me, the question has always been whether some variant of that code would be acceptable for 9.5's contrib. If so, I had some time to work on that. If not… well, hard luck. But the option to implement AUDIT was not available to me, which is why I have not commented much on it recently. The basic dynamic here seems to be you asking for changes and Abhijit making them but without any real confidence, and I don't feel good about that. I understand how I might have given you that impression, but I didn't mean to, and I don't really feel that way. I appreciate Stephen's suggestions and, although it took me some time to understand them fully, I think the use of GRANT to provide finer-grained auditing configuration has improved pgaudit. I am slightly concerned by the resulting complexity, but I think that can be addressed by examples and so on. I wouldn't be unhappy if this code were to go into contrib. (I should point out that it is also not the case that I do not hold any opinions and would be happy with anything pgaudit-shaped being included. For example, I strongly prefer GRANT to the 'alice:*:*' approach.) Anyway, I think it's reasonably clear now that pgaudit is unlikely to make it into 9.5 in any form, so I'll find something else to do. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
At 2015-01-26 18:47:29 -0600, jim.na...@bluetreble.com wrote: Anything happen with this? Nothing so far. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
At 2015-01-27 17:00:27 -0600, jim.na...@bluetreble.com wrote: It would be best to get this into a commit fest so it's not lost. It's there already. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perl coding error in msvc build system?
At 2014-06-03 22:30:50 -0400, pete...@gmx.net wrote: I'm not sure whether the following coding actually detects any errors: Solution.pm: open(P, cl /? 21 |) || die cl command not found; Since nobody with a Windows system has commented, I'm just writing to say that from a Perl perspective, I agree with your analysis and the patch looks perfectly sensible. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup fails with long tablespace paths
At 2014-12-24 08:10:46 -0500, pete...@gmx.net wrote: As a demo for how this might look, attached is a wildly incomplete patch to produce GNU long-link headers. Hi Peter. In what way exactly is this patch wildly incomplete? (I ask because it's been added to the current CF). -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
[After a discussion on IRC with Stephen…] At 2015-01-20 21:47:02 -0500, sfr...@snowman.net wrote: Review the opening of this email though and consider that we could look at what privileges has the audit role granted to the current role? as defining what is to be audited. Right, I understand now how that would work. I'll try to find time to (a) implement this, (b) remove the backwards-compatibility code, and (c) split up the USE_DEPARSE_FUNCTIONS stuff. For example, what if I want to see all the tables created and dropped by a particular user? I hadn't been intending to address that with this scheme, but I think we have that by looking for privilege grants where the audit role is the grantee and the role-to-be-audited the grantor. For CREATE, yes, with a bit of extra ACL-checking code in the utility hook; but I don't think we'll get very far without the ability to log ALTER/DROP too. :-) So there has to be some alternative mechanism for that, and I'm hoping Robert (or anyone!) has something in mind. Well, I was primairly digging for someone to say yes, the object access stuff will be reworked to be based on event triggers, thus removing the need for the object access bits. (This doesn't entirely make sense to me, but it's tangential anyway, so I won't comment on it for now.) -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
At 2015-01-20 20:36:39 -0500, robertmh...@gmail.com wrote: I think this is throwing the baby out with the bathwater. Neither C functions nor all-or-nothing are going to be of any practical use. Do you see some approach that has a realistic chance of making 9.5 and would also actually be worth putting into 9.5? Suggestions very welcome. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers